Open-source models synthesize better than Opus, and GPT-5.5 is the worst

2026-06-17 · TrustedRouter Synth Draco on GitHub

Synth is TrustedRouter's multi-model fusion — a panel of models, a judge, and a synthesizer behind one API. This is the research behind it. Try Synth →

Update: we have since run the full judge × synthesizer grid, with more complete recommendations — the best judge, and the best judge-and-synthesizer pairing. See Synth is two jobs: the best synthesizer pairs a Kimi-k2.6 judge with a GLM-5.2 synthesizer, 73.4 on DRACO.

Synthesizing a panel of research reports into one answer is a skill of its own, and the best model at it is open-weights. We ran the test directly. On DRACO, our suite of 100 agentic deep-research tasks judged by gemini-3.1-pro, we held one thing fixed and varied one thing. The fixed part is a five-model panel — gpt-5.5, opus-4.8, gemini-3-flash, kimi-k2.6, and deepseek-v4-pro — plus a single fixed judge analysis. Every panelist sees the same task and writes the same report each time. The one thing we swap is the final model that reads those five reports and writes the answer, the slot we call the synthesizer. Whatever moves in the score is the synthesizer and nothing else.

MiniMax-M3 and GLM-5.2 are the two best, within half a point — 71.6 and 71.1 — and our fuller judge × synthesizer grid edges GLM-5.2 ahead as the best synthesizer. Both are open-weights, and both finish ahead of Claude Opus 4.8 at 70.6. That ordering should give you pause. The frontier closed model loses the synthesis slot to two models you can download, on a task where the panel feeding all three is identical.

Synthesizer	DRACO score (full 100)
minimax-m3	71.6
glm-5.2	71.1
opus-4.8	70.6
kimi-k2.6	67.0
deepseek-v4-pro	65.7
gpt-5.5	62.2
gemma-4-31b	54.0

The sharpest result is GPT-5.5. Run on its own as a researcher, it is the strongest single model on this benchmark, scoring 63.0 solo. Hand it five reports to reconcile and it drops to 62.2, the bottom of the capable synthesizers, below DeepSeek V4 Pro at 65.7 and Kimi K2.6 at 67.0. The model that is best at doing the research alone lands among the worst at reconciling the research of others. Solving a task and synthesizing five reports are two different abilities, and being excellent at the first tells you almost nothing about the second. A great soloist defaults to its own view. A great synthesizer weighs five views it didn't write and resolves where they disagree.

Size matters here in a way it does not for a panelist. Gemma-4-31b collapses to 54.0, nearly eighteen points under the leaders. A 31-billion-parameter model holds its own as one voice on the panel, then runs out of room when asked to hold five frontier reports in context and reconcile them at once. The synthesizer has to keep all the evidence live, track which source said what, and adjudicate conflicts, and a small model lacks the room to do it. Panelists can stay small because each owns one slice. The synthesizer owns the whole thing, so it has to be big.

The obvious objection: if GLM-5.2 is the best synthesizer, why not just use it? Because it goes blank on Taiwan and Hong Kong. As we documented in the best synthesizer goes blank on Taiwan, GLM-5.2 refuses politically sensitive China content, and a synthesizer that drops whole topics is unsafe as a default no matter how the average score reads. GLM-5.2 is the strongest synthesizer, but it carries that hole; MiniMax-M3 is a hair behind with no such gap, which makes it the model we'd actually put in the synthesizer slot as the safe default.

This sits on top of the result in our synth evals post, where assembling a panel and synthesizing it beats any single frontier model, and it extends the finding from the best open models aren't on your leaderboard: solo-model rankings fail to predict who fuses well. The synthesizer is a capability you have to measure on its own, because the usual proxies of raw smarts, parameter count, and solo benchmark rank all mislead. You can see every panelist and judge model on our models page, and the harness that produced these numbers is open at TrustedRouter Synth Draco.

If synthesis is a separate skill, it is a separate training target. A model could be built specifically to read N reports and produce one reconciled answer, optimized for that job alone, free of any requirement that it also be a pleasant chatbot or a strong solo researcher. The synthesizer slot is the most consequential position in an agentic research stack, since it decides what the user actually reads, and right now we fill it with general-purpose models that happen to be decent at it. The numbers say a purpose-built synthesizer would beat all of them.