What actually makes a synth panel work

2026-06-21 · TrustedRouter Synth Draco on GitHub

Synth is TrustedRouter's multi-model fusion — a panel of models, a judge, and a synthesizer behind one API. This is the research behind it. Try Synth →

Synth — a committee of models, each writing its own answer, merged into one — beats any single model on deep research. We have shown that. But a committee has parts: who sits on the panel, who judges, who synthesizes. Which parts actually matter? We took a five-model open-weights committee apart on DRACO — 100 agentic deep-research tasks, graded by gemini-3.1-pro — one piece at a time. The answers were not the ones we expected.

The synthesizer barely matters — on an open panel. Hold the five panelists fixed and sweep every judge × synthesizer combination, and the score hardly moves: any config with a GLM-5.2 synthesizer lands around 69, and the judge is statistical noise (the three GLM-synth bars above overlap inside ±1 SE). That is the opposite of a frontier-mixed panel, where the same grid swings eight points — 48.7 to 73.4 — and the choice of synthesizer is everything. The lesson is mechanical: a clever synthesizer only earns its keep when the panel is diverse. When the panelists are close in capability, there is little to reconcile and the synthesizer washes out; mix in frontier models and suddenly the synthesizer's judgment decides the answer.

Most of the panel looks like dead weight. So we pulled each panelist out and re-scored the committee. Only two of the five — MiniMax M3 and DeepSeek V4 Pro — significantly hurt the score when removed. The other three were individually indistinguishable from free, including the two models that were also serving as the judge and the synthesizer. Read the leave-one-out alone and you would fire three of your five panelists.

But you cannot strip the panel. That would be a mistake — and we have the experiment to prove it. When we actually built a three-member panel, dropping the two "free" role-players, the score fell 3.4 points — nearly as much as removing the single most important model. The drop looks marginal in an unpaired view, but the panels are 0.80-correlated per task, so the right test is paired: per-task, the three-member panel loses −3.4 ± 0.95 against the full committee (t = −3.6; a 20,000-sample bootstrap 95% CI of [−5.3, −1.6] that excludes zero). Each member is individually droppable; the panel is not. Removing one is free because the others cover the gap — remove two and the gaps open. The breadth is doing work that no single leave-one-out can see. Call it a redundancy floor.

And the deepest surprise: the most valuable panelist is not the smartest model. DeepSeek V4 Pro is mid-pack as a solo researcher (59.9 on DRACO), and the strongest open solo, GPT-5.5 at 63.0, is not even on this committee. Yet DeepSeek and M3 carry it while stronger models sit on the bench. Synth does not reward raw capability — it rewards uncorrelated error. A model earns its seat by failing in different places than the rest, so that where one hallucinates a date or misses a source, the others do not, and the synthesizer keeps what survives cross-examination. So we measured it directly — grading every panelist's report on all 100 tasks and correlating the per-task scores. We expected a standout: one model whose errors are conspicuously independent of the rest. There isn't one.

Every pair of these very different models — open and closed lineages, five different labs — correlates between 0.47 and 0.71, mean 0.56, and each model's average correlation with the others spans just 0.55 to 0.58. That 0.03 spread is inside the noise. DeepSeek is nominally the least-correlated (0.550), but it is tied with everyone; the clean prediction — that one model's errors are the most independent and that is why it is load-bearing — is simply false.

The honest picture is subtler and, we think, more useful: diversity is real but diffuse. Roughly half of each pair's score variance is shared and half is independent — synth lives on that independent half — but no single model owns it. That is exactly why leave-one-out finds almost nothing (drop any one model and the others still cover its blind spots) while dropping two opens real gaps: the useful disagreement is spread thin across the whole panel, not banked in a hero. What sets the workhorses, M3 and DeepSeek, apart is not extra independence — it is that they pair that shared diversity with enough competence to put a correct answer on the table, where the weaker members supply uncorrelated error but rarely the right answer to keep.

Three findings, one shape: a synth panel's value lives in its diversity, and diversity is invisible to every single-component test. The synthesizer is flat until the panel is varied; each member looks free until you remove enough of them; the diversity that does the work is spread across the whole panel, not banked in any one model. None of it shows up if you only ever measure one thing at a time.

This is the research we do at TrustedRouter — open code, open results, reproducible end to end. The harness, the exact panel, and the DRACO tasks are all public. If you have a PhD and want to work on what actually makes an ensemble of models think, apply.