We're hiring We're looking for PhD researchers to join the team and work on exciting frontier problems. Get in touch →
← TrustedRouter blog

Self-fusion's gain lives in the synthesizer, not the judge

2026-06-24 · TrustedRouter-Fusion-Draco on GitHub

Synth is TrustedRouter's multi-model fusion — a panel of models, a judge, and a synthesizer behind one API. This is the research behind it. Try Synth →

TrustedRouter.com:in self-fusion, the synthesizer is the leverDRACO self-fusion gain (mean N≥2 − N=1) · same Sonnet drafts · 23 shared tasks · Sonnet-4.6 grader · whiskers = 95% bootstrap CI-3+0+3+6+9+12+15+8.0Sonnet judge+9.2Haiku judgeSonnet synthesizer+4.4Sonnet judge+2.2Haiku judgeHaiku synthesizerDowngrade the synthesizer Sonnet→Haiku and the gain falls by 4–7 points; downgrade the judge and it barely moves.TrustedRouter.com

Fusion works even when every draft comes from the same model: Claude Sonnet 4.6 run ten times and fused into one answer climbs eight points on DRACO deep research — +8.0, with a 95% interval of +4.6 to +11.3. What that leaves open is where the eight points come from: the ten drafts, or the model stitching them together. So we held one fixed and changed the other.

Swap only the fuser and the gain collapses. We took the ten Sonnet research reports, left them exactly as they were, and changed only the fuser — the model that reads the drafts and writes the final answer — from Sonnet to Haiku. Same drafts, same grader, same tasks. The gain fell from +8.0 to +2.2, an interval that sits on top of zero. Identical raw material, and a cheaper fuser threw three-quarters of the lift away. Strong drafts do not rescue a fuser that cannot tell which one is right. The lever is the fuser, not the drafts.

And the fuser is two jobs. First a judge reads the drafts and writes a compact analysis — where they agree, where they contradict, what each one caught. Then a synthesizer takes that analysis and the drafts and writes the answer. On a frontier-mixed panel the synthesizer's judgment swung the score eight points while the judge was noise. Does that hold when every draft comes from one model? We ran the full two-by-two on the same Sonnet drafts, each cell graded by Sonnet 4.6.

self-fusion gainSonnet judgeHaiku judge
Sonnet synthesizer+8.0+9.2
Haiku synthesizer+4.4+2.2

Only one seat matters. With a Sonnet synthesizer the gain is +8.0 behind a Sonnet judge and +9.2 behind a Haiku judge — swapping the expensive judge for the cheap one changed nothing. With a Haiku synthesizer it is +4.4 and +2.2. Downgrade the synthesizer and you lose four to seven points; downgrade the judge and you lose roughly zero. A cheap Haiku judge feeding a Sonnet synthesizer matches the all-Sonnet fuser. The judge can be cheap. The synthesizer cannot.

It makes sense once you see what each seat does. The judge produces structured notes, and getting the notes a little wrong is recoverable. The synthesizer makes the actual call — out of ten messy reports, which claim survives into the answer — and that one decision is the whole game. Reading a pile of research and walking out holding the single correct fact is the skill that scales with raw model strength, and it lives in the writing step, not the analysis step. It is also why fusion sat as a footnote for years: the recipe is old, but a model good enough to run the synthesis seat cheaply is recent.

So the cheap version of fusion is real, but only in one place. The judge in front of the synthesizer can be the fast, free model; the model that writes the final answer has to be the good one. That is a routing decision, not a model decision — ten draft calls and a judge call go to whatever is cheap, and the single synthesis call goes to the frontier. A gateway that places each call on the right model, and fans the drafts across every provider serving them at once so one provider's rate limit does not sink the run, is what makes the cheap version payable. The same fan-out that makes self-fusion work is what makes a router worth pointing it at.

A note on the scores: this ran end to end on Claude Code subagents, graded by Sonnet 4.6 criterion by criterion, over the 23 DRACO tasks the four configurations share. Grading the whole rubric in one call runs a few points high, so read the gaps between configurations, not the absolutes. The drafts, every fused answer, the per-task scores, and the bootstrap code are public. This is the research we do at TrustedRouter; if you want to work on what makes an ensemble of models think, apply.


Sign in

Choose a sign in method.