Fusion works now, even with the same model: self-fusion

2026-06-23 · TrustedRouter-Fusion-Draco on GitHub

Synth is TrustedRouter's multi-model fusion — a panel of models, a judge, and a synthesizer behind one API. This is the research behind it. Try Synth →

Fusion works now, and it works even when every model in the panel is the same model. Run a strong model ten times on a hard question, have it fuse its own ten answers, and the score goes up for real. The catch is the word strong: a cheap model barely moves. Last month OpenRouter showed the cheap-committee version of the trick and we reproduced it — MiniMax-M3 self-fused from 66.2 to 69.4 on DRACO deep research. What was left open is how far down the price curve it survives. So we ran self-fusion with Claude Haiku 4.5 doing its own web research, judging, and synthesizing, then Sonnet 4.6 doing all three, every answer graded by Sonnet 4.6.

Fusing helps the smart model a lot and the cheap one a little. Sonnet self-fusion climbs from 66 solo to about 74, a gain of eight points that holds up as significant (95% interval +4.6 to +11.2, across 23 tasks). Haiku self-fusion, across 44 tasks, moves from 55 to about 58 — a gain of 2.6 points that looks real but does not quite clear the bar (−0.3 to +5.4, about 96% of the bootstrap positive). Same recipe, same grader; the payoff tracks how smart the model doing the fusing is.

self-fusion	solo	fused	gain	grader
MiniMax-M3	66.2	69.4	+3.2	gemini-3.1-pro
Sonnet 4.6	66	~74	+8.0	Sonnet-4.6
Haiku 4.5	55	~58	+2.6	Sonnet-4.6

We almost published the wrong version of this. The first cut was eight tasks, and on those eight Haiku self-fusion looked like it actively hurt, down three points, and the story wrote itself: cheap fusion backfires. Eighteen more tasks killed it — that batch gained +3.7, the merged number went to +1.5 at 26 tasks, then +2.6 by 44. The dramatic backfire was small-sample luck. Read the directions, not the decimals; the decimals moved every time we added tasks.

You can still watch a weak fuser do the dumb thing on one task. On a needle-in-a-haystack question, where the score hangs on a single buried fact, a lone Haiku run found the needle and scored 87. Fusing ten runs dropped it to 63: nine of the ten runs had missed the needle, and reading all ten, Haiku wrote the consensus and sided with the nine. Sonnet kept its needle. Fusing is a vote, and a vote only helps if the model counting it can pick the one right answer out of a crowd of wrong ones.

This is probably why fusion has been a footnote until now. The idea is old — sample a model a few times and have something stitch the samples together — but the stitcher was always the weak link. A model that cannot tell its good run from its bad ones blurs them into an average and the gain washes out, which is what Haiku does here. You need a synthesizer good enough to read a pile of messy research reports and walk out holding the one correct claim, and that is recent. Sonnet 4.6 is the first cheap model we have watched clear that bar. The fusion recipe did not get better this year; the models finally got good enough to run it.

And this is where a router earns its keep, even when the panel is one model. Self-fusion fires that model ten times at once, and any single provider will rate-limit you or fall over mid-run — we lost half of this experiment to exactly that, whole batches dying against one provider's quota. Route the ten calls through TrustedRouter and they fan out across every provider serving that model at the same time; when one returns a 429 or drops, the others carry the load. No single provider promises 100% uptime, but a handful of them in parallel get you there. The same fan-out that makes self-fusion work is what makes the router worth pointing it at.

A note on the scores, because they are softer than they look. We graded with Sonnet 4.6 standing in for the gemini-3.1-pro grader the rest of this series uses, after checking it tracked gemini on the published DRACO answers — 0.92 correlation, no average bias. These are different tasks, and grading the whole rubric in one call runs a few points high, so read each model against itself, not across. The harness, the per-task scores, and the bootstrap intervals are all public.

The frontier-committee version of fusion still wins outright — a panel of different models reaches the best answer. But the cheaper trick is real now too: one good model, run several times against itself and fused by itself, beats running it once. It took the models getting good enough at the one job everyone overlooked. This is the research we do at TrustedRouter. If you have a PhD and want to work on what makes an ensemble of models think, apply.