The Synthesis Layer Primacy Thesis:
Does Aggregation Rule Design Outweigh Debate Duration
in Multi-Agent LLM Systems?

A Systematic Literature Review with Experimental Design Proposal for Cabinet (Sparse Halo)
Sparse Halo Research · Cabinet System Study · April 2026
Abstract

This study examines Proposition P2 of the Cabinet research series: whether varying the Umpire's aggregation rule -- from plain synthesis to confidence-weighted synthesis to majority-position synthesis -- produces larger effect sizes than varying round count from 1 to 4, as measured by blind human evaluators on real Cabinet sessions. We synthesize evidence from eleven primary sources spanning multi-agent debate systems, mixture-of-agents architectures, financial multi-agent evaluation frameworks, and confidence-weighted communication protocols.

The literature presents a consistent, if not entirely settled, pattern: the mechanism by which agents' outputs are aggregated into a final response accounts for substantially more variance in output quality than the number of deliberation rounds. Quantitative anchors include: the Coordination Primacy Hypothesis (CPH), which documents a 15-30% Sharpe ratio reduction from removing coordination versus only 5-8% variance from model substitution in financial multi-agent systems; the Council Mode finding that replacing structured synthesis with majority voting increases hallucination rate by 32.7% (10.7% to 14.2%); the D3 ablation showing that aggregation mechanism (ensemble scaling from one to five jurors) contributes +8.8% accuracy while persona diversity adds only +3.8%; and the PartnerMAS supervisor ablation demonstrating that upgrading the aggregation agent alone yields the largest single-role performance gain (+4.6 percentage points). Against the thesis: failure mode analysis from 1,600 annotated multi-agent traces shows that specification failures (41.77%) and inter-agent misalignment (36.94%) account for far more total failure mass than task verification failures (21.30%), suggesting that the synthesis layer is not the dominant failure mode in deployed systems. DebUnc's confidence-weighting results are especially cautionary: the theoretical upper bound from perfect confidence signals is +14 percentage points, but real uncertainty metrics yield only +2 percentage points because current calibration quality (AUROC ~0.63) is only marginally better than chance.

For Cabinet specifically, these findings suggest that investing in Umpire aggregation logic -- particularly the transition from plain synthesis to evidence-weighted structured synthesis -- is likely to yield larger user-facing quality gains than increasing max_turns from 2 to 4. We propose a factorial experimental design (3 aggregation rules x 4 round counts, n=120 sessions minimum per cell) with ELO-style pairwise human evaluation to test this claim directly. Key limitations include: all primary studies use homogeneous evaluation benchmarks rather than consumer product sessions; the financial domain evidence (CPH) may not transfer cleanly to open-ended QA; and the 1-3 point run-to-run variance documented in representational collapse research means that small reported effects may not be reproducible in Cabinet's specific operating context.

Table of Contents

  1. Introduction and Research Question
  2. The Cabinet Architecture
  3. Theoretical Framework
  4. Evidence For Synthesis Layer Primacy
  5. Evidence Against or Complicating the Thesis
  6. Round Count Effects and Diminishing Returns
  7. The Confidence-Weighted Synthesis Question
  8. Proposed Experimental Design for Cabinet
  9. Implications for Cabinet's Product Roadmap
  10. Limitations
  11. Open Research Questions
  12. References

1. Introduction and Research Question

Multi-agent debate systems for large language models (LLMs) present a design space with at least two orthogonal dimensions of tuning: the number of deliberation rounds and the mechanism by which the final output is aggregated from agent contributions. Cabinet (Sparse Halo) exposes both dimensions explicitly -- max_turns governs how many rounds of agent exchange occur, while the synthesis_model and its associated system prompt define how the Umpire produces the user-facing response. Whether these two dimensions contribute equally to output quality, or whether one dominates, is a question with direct consequences for Cabinet's engineering priorities and product roadmap.

This study addresses Proposition P2 of the Cabinet research series:

Proposition P2 -- The Synthesis Layer Primacy Thesis

Varying the Umpire's aggregation rule (confidence-weighted synthesis vs. plain synthesis vs. majority-position synthesis) produces larger effect sizes on blind human evaluation scores than varying round count from 1 to 4, when measured on real Cabinet sessions using ELO-style pairwise comparison.

The practical stakes are real. If P2 holds, Cabinet's engineering resources should concentrate on the Umpire's aggregation logic -- the system prompt structure, the weighting of agent contributions, and the mechanism by which the synthesis model resolves contradictions. The current parallel-first-round + round-robin architecture with a fixed synthesis prompt may be leaving substantial quality on the table not because the agents debate too briefly, but because the Umpire synthesizes too crudely. If P2 does not hold -- if round count is the dominant lever -- then the product decision inverts: optimize for smarter early termination (max_turns reduction), not aggregation sophistication.

The question is also theoretically interesting because it pits two distinct mechanisms against each other. Round count governs the amount of deliberation -- how many times agents can challenge, refine, and respond to each other's positions. Aggregation rule governs the quality of the final distillation -- the cognitive work that converts a transcript of disagreements into a single coherent output. The literature, as we will show, suggests these are not symmetrically powerful. Aggregation appears to be generative: a high-quality synthesis can extract signal from a modest debate transcript that a poor synthesis would squander. The reverse does not hold as cleanly. More rounds with a weak aggregator do not reliably compensate for the aggregator's limitations.

This systematic review is organized as follows. Section 2 describes Cabinet's architecture as-built. Section 3 establishes the theoretical vocabulary. Sections 4 and 5 present evidence for and against the thesis in sequence. Section 6 isolates the round count effect with dedicated quantitative analysis. Section 7 examines the specific question of confidence-weighted synthesis. Section 8 proposes a rigorous experimental design. Sections 9-11 address implications, limitations, and open questions.

2. The Cabinet Architecture

Cabinet is a multi-agent debate orchestration system. Its core components -- drawn from analysis of orchestrator.py, judge.py, and cabinetPresets.ts -- are described below with the analytical vocabulary this study requires.

2.1 The Umpire (Synthesis Layer)

The Umpire is Cabinet's name for the synthesis model call that produces the user-facing output. It is a separate model call from all agent rounds. The synthesis_model is configurable independently of the agent models used for debate. The Umpire receives the full debate transcript -- all agent contributions, all judge annotations -- and produces a structured final response according to one of two system prompt templates:

The current synthesis prompt is structured but does not weight agent contributions by measured confidence, agreement score, or evidential quality. All agents' arguments enter the Umpire's context equally. This is the central design choice that Proposition P2 asks about: what happens when that weighting is made non-uniform?

2.2 The Judge Layer

Cabinet's judge.py module operates between debate rounds, scoring each round on three dimensions: disagreement score (0-10), evidence score (0-10), and a sycophancy flag (boolean). The judge uses JUDGE_MODEL_ID at temperature 0.1 for consistency. The SYCOPHANCY_NUDGE injection is triggered when the sycophancy flag fires -- injecting a prompt that nudges agents away from agreement convergence.

The judge's round-by-round scores are available in the debate transcript that the Umpire receives. Under the current architecture, the Umpire does not explicitly weight its synthesis by these scores. This is the gap that a confidence-weighted aggregation rule would fill.

2.3 Orchestration Parameters

Parameter Value / Range Implication for P2
max_turns Capped at 5; presets use 3-4 Round count variable in P2 experiment (1-4)
synthesis_model Configurable separately from agents Aggregation rule variable in P2 experiment
turn_strategy "round_robin" or "parallel_initial_then_round_robin" Controls whether agents see all peers' first responses before round 2
temperature 0.7 default Affects within-session variance; must be held fixed across P2 conditions
response_word_limit 800 (debate) / 950 (planning) Constrains agent response length; relevant to whether longer debates add information
Agent count 1-5 configurable; presets use 2-4 Should be held fixed in P2 experiment at 3 agents

Table 1. Cabinet orchestration parameters and their experimental relevance to Proposition P2.

2.4 Cabinet Presets and Agent Diversity

Cabinet's eight production presets use heterogeneous model compositions, drawing from GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.20, DeepSeek V3.2, Qwen3-235B, Kimi K2.5, and Perplexity Sonar. Presets run 3-4 rounds with 2-4 agents using Socratic, Adversarial, Debate, or Planning personas. The deliberate heterogeneity is consistent with the literature's recommendation for cross-model diversity, though -- as Section 5 will discuss -- that diversity recommendation is more nuanced than it first appears.

2.5 The Umpire's Information Advantage

A structural point worth emphasizing: the Umpire in Cabinet's design has access to the full judge-annotated transcript, including disagreement scores and evidence scores per round. This is richer information than most aggregators in the literature receive. The question P2 raises is whether the Umpire is currently exploiting this information or merely receiving it. A plain synthesis prompt treats this structured annotation as narrative context; a confidence-weighted synthesis prompt would treat it as a weighting signal. This distinction -- between receiving evidence and acting on it -- is central to the experimental design in Section 8.

3. Theoretical Framework

3.1 Key Definitions

Synthesis layer refers to any component of a multi-agent system that converts the outputs of multiple agents into a single coherent response. In Cabinet, this is the Umpire. In the broader literature, it is variously called the coordinator (CPH), the consensus model (Council Mode), the supervisor (PartnerMAS), the aggregator (MoA), or the synthesizer (MedARC). Despite diverse terminology, these components share a function: they must decide how to weight, combine, and resolve contradictions among agent contributions.

Aggregation rule refers to the algorithm or prompt logic that governs the synthesis layer's weighting of agent inputs. The simplest rule is equal weighting (plain synthesis): all agents' arguments are treated as equally credible. More sophisticated rules weight by confidence, evidence quality, round-specific score, agent agreement intensity, or domain-specific reliability. The P2 experiment tests three specific rules: plain synthesis (current Cabinet default), confidence-weighted synthesis (using judge scores), and majority-position synthesis (selecting the most commonly expressed position).

Round count refers to the number of complete exchange cycles that occur before the Umpire synthesizes. One round means each agent responds once to the initial query. Two rounds means agents have seen each other's first response and have refined their positions once. Four rounds means three cycles of cross-agent challenge and refinement. Cabinet's max_turns parameter controls this value.

Effect size in this study refers to the standardized difference in blind human evaluation scores between two conditions. We use Cohen's d for continuous scores and the Bradley-Terry model for ELO-derived pairwise win rates. A Cohen's d of 0.2 is conventionally small, 0.5 medium, 0.8 large. The Chatbot Arena literature suggests that even the top-ranked models differ by d ~0.18 from second-ranked, meaning the practical effect sizes in LLM quality comparisons are inherently small.[19]

3.2 Mapping Cabinet's Umpire to Literature Roles

Literature System Term for Synthesis Layer Mechanism Analogy to Cabinet Umpire
MoA (Wang et al., 2024) [1] Aggregator Generates new response from all proposer outputs; does not select Direct -- Umpire generates new response, does not vote
Council Mode (Wu et al., 2026) [2] Consensus model 4-phase structured synthesis: consensus / disagreement / unique / analysis Close -- current Umpire prompt has VERDICT + BREAKDOWN but does not explicitly separate consensus from unique findings
PartnerMAS (Li et al., 2025) [3] Supervisor (SPA) Two-step: consensus selection (frequency count), then conflict resolution (weighted inverse rank) Partial -- Cabinet judge scores provide a proxy for the weighting signal SPA uses explicitly
D3 SAMRE (Harrasse et al., 2026) [4] Jury + Judge 5 jurors with persona diversity; majority vote; judge as tie-breaker Diverges -- Cabinet uses a single synthesis model, not a juror panel
CPH (Nguyen & Pham, 2026) [5] Coordinator Inter-agent coordination protocol (hierarchical, debate-based, conference-based) The Umpire is the binding constraint in Cabinet's hierarchy -- analogous to CPH's D2 dimension
GSA (Li et al., 2025) [6] Generative self-aggregator Single model generates new response from all candidate responses as context Direct structural analog -- GSA's aggregation step is architecturally identical to Cabinet's Umpire call

Table 2. Mapping Cabinet's Umpire to synthesis layer terminology across the literature.

3.3 The Generative vs. Discriminative Distinction

A theoretical finding of independent importance: the literature converges on the view that generative aggregation -- producing a new response from all inputs as context -- outperforms discriminative aggregation -- selecting, ranking, or voting among existing responses. GSA demonstrates this directly: the generative synthesis step produces correct answers even when all candidate responses are incorrect, while choose-from-N cannot.[6] MoA similarly "significantly outperforms" an LLM ranker baseline that uses the same model to select rather than generate.[1] This has a direct implication for Cabinet's P2 experiment: "majority-position synthesis" -- the weakest of the three proposed aggregation rules -- is likely to be the worst-performing condition not just because it ignores confidence signals, but because it is discriminative rather than generative. This aligns with game-theoretic analysis showing that consensus-seeking equilibria outperform simple voting mechanisms in language model generation.[18]

3.4 The Noise Floor Problem

Any experimental test of P2 must contend with the noise floor. Representational collapse research documents 1-3 point run-to-run variance in multi-agent debate scores.[7] Chatbot Arena analysis shows that even the GPT-4 vs. Claude-v1 comparison (321 pairs) yields only p=0.0265 with an effect size of d=0.18[19] -- which is statistically significant but practically small and potentially below the noise floor for Cabinet's operating context. This means the P2 experiment must be designed with adequate statistical power to distinguish genuine aggregation effects from run-to-run variance.

4. Evidence For Synthesis Layer Primacy

We organize the evidence into three categories: ablation studies that directly isolate the synthesis layer's contribution, architectural analyses that compare synthesis mechanisms against alternatives, and scaling analyses that reveal where marginal returns are largest.

4.1 Ablation Studies

4.1.1 The Coordination Primacy Hypothesis -- FinCon and TradingAgents

The most directly cited quantitative anchor for synthesis primacy comes from the CPH paper, which synthesizes ablation data from FinCon and TradingAgents.[5] In both systems, removing coordination (the mechanism that aggregates and arbitrates among agent outputs) reduced the Sharpe ratio by 15-30%. Substituting a smaller backbone LLM within a fixed coordination protocol produced only 5-8% performance variance. The ratio is approximately 3-6x: the coordination/synthesis layer's contribution exceeds model selection by that margin.

Key Finding -- CPH Ablation

Removing coordination: -15 to -30% Sharpe ratio. Model substitution: only 5-8% performance variance. Ratio: 3-6x magnitude advantage for coordination over model selection. Source: CPH (arXiv:2603.27539), Section 5.3.2.

Critical caveats apply. These are author-reported ablations from FinCon and TradingAgents, not independently replicated. The CPH paper classifies them as Tier 2 evidence -- "suggestive rather than confirmatory." The financial domain is also specific: trading multi-agent systems optimize for a single quantitative metric (Sharpe ratio) under a well-defined performance function, which may not generalize to open-ended QA tasks like Cabinet's core use case.

4.1.2 Council Mode -- Structured Synthesis vs. Majority Vote

The Council Mode paper provides the most direct ablation for Cabinet's context.[2] In the "without structured synthesis (majority vote)" ablation condition, replacing the dedicated consensus synthesis model with simple majority voting on agent outputs increased the hallucination rate from 10.7% to 14.2% -- a 32.7% relative increase. The Truthful score fell from 82.6% to 77.3% (-5.3 points). The Quality Score fell from 91.7% to 85.4% (-6.3 points).

Aggregation Rule Halluc. Rate Truthful Score Quality Score Latency
Full Council (3 experts + structured synthesis) 10.7% 82.6% 91.7% 8.4s
Without structured synthesis (majority vote) 14.2% (+32.7% rel.) 77.3% (-5.3 pp) 85.4% (-6.3 pp) 6.8s
2 Experts + Synthesis (reduced agents) 12.8% 79.1% 88.2% 7.1s
Same-model ensemble (3x GPT-5.4) 15.6% 75.8% 83.1% 7.9s
Best single model (Claude Opus 4.6) 16.7% 74.8% 81.5% 4.1s

Table 3. Council Mode ablation results (HaluEval + TruthfulQA composite). Source: Wu et al. (2026), Table 4.

The Council Mode paper also shows that heterogeneous expert diversity roughly doubles the hallucination reduction benefit relative to a same-model ensemble: 35.9% relative reduction (heterogeneous) vs. 18.3% relative reduction (same-model). This implies that Cabinet's existing cross-model agent composition already captures part of the diversity benefit -- the remaining question is whether the Umpire is fully exploiting it.

4.1.3 D3 -- Ensemble Effect Dominates Persona Effect

The D3 ablation study (EACL 2026) isolates two components of aggregation quality: the ensemble effect (scaling from one juror to five) and the persona effect (adding diversity through role-based identities).[4] Scaling from one to five jurors yielded +8.8% accuracy on MT-Bench -- the "largest single contributor" to D3-MORE's performance. Persona effects added +3.8% on top of the ensemble effect.

Aggregation Configuration MT-Bench Accuracy Pos. Swap Consistency Cohen's Kappa
Single juror, no persona 72.5% 81.7% 0.45
Single juror, with persona 74.8% (+2.3 pp) 83.2% 0.47
Multi-juror (k=5), no personas 81.3% (+8.8 pp vs. base) 90.1% 0.54
Multi-juror (k=5), with personas 85.1% (+12.6 pp total) 94.8% 0.58

Table 4. D3 ablation of ensemble (aggregation) vs. persona (diversity) effects on MT-Bench accuracy. Source: Harrasse et al. (2026), Table 3.

A crucial finding from the D3 SAMRE stopping analysis bears on the round count question: 58% of 1,200 SAMRE evaluations stopped by round 2 via budgeted stopping, with a mean of only 2.71 rounds out of a maximum of 5. Forcing continuation beyond convergence changed verdicts in only 6% of cases.[4] The deliberative protocol reaches diminishing returns quickly; the aggregation mechanism is what differentiates quality across the remaining variance.

4.1.4 PartnerMAS -- Supervisor Upgrade Has Largest Single-Role Effect

PartnerMAS tested role-specific backbone model upgrades in a hierarchical business partner selection system.[3] Holding the Planner Agent (PA) and Specialized Agents (SA) fixed at gpt-4o-mini, upgrading only the Supervisor Agent (SPA -- the aggregation role) from gpt-4o-mini to gpt-4.1-mini raised the match rate from 64.40% to 69.03% -- a gain of +4.6 percentage points. Equivalent upgrades to PA or SA alone yielded smaller or negligible gains.

Key Finding -- PartnerMAS Role-Swap Ablation

Upgrading SPA alone: +4.6 pp (64.40% to 69.03%). Upgrading SA alone: +0.3 pp. Upgrading PA alone: +0.79 pp. "Coordination, not specialization, is the current bottleneck." Source: Li et al. (2025), Table 1.

PartnerMAS also beats a debate-based MAS baseline by 10.7 percentage points (70.89% vs. 60.19% for gpt-4.1-mini), suggesting that structured hierarchical aggregation outperforms an alternative multi-agent configuration that does not use a dedicated synthesis layer at all.

4.2 Architectural Comparisons

4.2.1 MoA -- Aggregation is Generative, Not Discriminative

Wang et al. (2024) present the foundational architectural argument for synthesis layer primacy: the aggregator in MoA does not select among proposed answers but generates a new response from all proposer outputs as context.[1] This "Aggregate-and-Synthesize" approach significantly outperforms an LLM ranker baseline that uses the same model to select the best proposed answer. On the MATH task, Layer 1 to Layer 2 aggregation produces the largest single-step quality gain for all six tested aggregator models, with diminishing returns from Layer 2 to Layer 3. For Mixtral-8x22B as aggregator, the Layer 1-to-2 gain is +0.252 (0.282 to 0.534); the Layer 2-to-3 gain is only +0.022.

The MoA paper's core insight -- that "the first response aggregation has the most significant boost on generation quality" -- is directly relevant to Cabinet: the first synthesis pass (the Umpire) is where most of the quality improvement occurs, not in the preceding rounds of agent exchange.

4.2.2 GSA -- Self-Aggregation Outperforms Self-Consistency

Generative Self-Aggregation (Li et al., 2025) demonstrates that a single model generating a new response from all its own prior candidates as context outperforms self-consistency (majority voting among prior candidates) on multiple benchmarks.[6] On GPQA with Llama 3 8B, GSA achieves 35.04% vs. self-consistency's 33.26% -- and does so with one fewer candidate (3 vs. 4) due to budget standardization. On Alpaca-Eval with GPT-4o Mini, GSA achieves 55.85% vs. the greedy baseline's 47.99% (+7.86 pp), a task where self-consistency cannot be applied at all.

The critical GSA finding for P2: when all three candidate responses are incorrect, GSA can still synthesize a correct answer by combining correct intermediate steps from wrong final answers. Choose-from-N (the discriminative analog) cannot do this by construction. This demonstrates that the synthesis layer adds value that cannot be replicated by deliberation alone.

4.3 Scaling Analysis

4.3.1 Self-MoA -- Quality Coefficient Exceeds Diversity Coefficient

Li et al. (2025) conduct a regression analysis of MoA performance as a function of proposer quality and cross-model diversity.[8] The quality coefficient consistently exceeds the diversity coefficient across all three benchmarks: alpha=2.558 vs. beta=1.841 on MMLU; alpha=4.548 vs. beta=1.421 on CRUX; alpha=4.719 vs. beta=2.839 on MATH. The linear models explain 69-77% of MoA performance variance.

Benchmark Alpha (Quality Coeff.) Beta (Diversity Coeff.) Ratio Alpha/Beta R-squared
MMLU 2.558 +/- 0.176 1.841 +/- 0.176 1.39 0.771
CRUX 4.548 +/- 0.459 1.421 +/- 0.459 3.20 0.685
MATH 4.719 +/- 0.416 2.839 +/- 0.416 1.66 0.760

Table 5. Self-MoA regression of MoA performance on quality and diversity. Source: Li et al. (2025), Table 4.

The practical implication: Self-MoA (repeated sampling from the best single model) outperforms Mixed-MoA (diverse models) by 6.6 percentage points on AlpacaEval 2.0 (65.7% vs. 59.1%). The aggregator's quality is the primary driver of MoA performance, not the diversity of the proposer pool -- which inverts the intuitive "more diversity is better" assumption.

5. Evidence Against or Complicating the Thesis

The case for synthesis layer primacy is substantial but not uncontested. Six bodies of evidence complicate or qualify the thesis.

5.1 Multi-Agent Failure Modes Are Dominated by Specification, Not Synthesis

Cemri et al. (2025) analyze 1,600+ annotated traces from seven deployed multi-agent systems using grounded theory methodology (Cohen's Kappa = 0.88 inter-annotator agreement).[9] The resulting MAST failure taxonomy shows:

Complication -- Failure Mode Distribution

Only 21.30% of multi-agent failures fall under task verification (the category most relevant to synthesis layer failure). 78.70% of failures occur upstream: specification design (41.77%) and inter-agent coordination (36.94%). Optimizing the synthesis layer leaves the dominant failure modes untouched. Source: Cemri et al. (2025), arXiv:2503.13657.

This finding does not falsify P2 directly -- P2 is about output quality in successful debates, not about system failure rates. But it is a caution: the path to better Cabinet outputs may run through specification quality (better task decomposition, better persona prompts) or inter-agent dynamics (better sycophancy detection) rather than the Umpire exclusively. Even if the synthesis layer has primacy over round count, it may not have primacy over all other design dimensions.

5.2 MAD Protocols Do Not Reliably Beat Self-Consistency Out of the Box

Smit et al. (2024) benchmark seven strategies across seven datasets using GPT-3.5-turbo.[10] No single protocol dominates all datasets. Self-Consistency achieves 0.78 on MMLU, outperforming all tested MAD protocols (Society of Mind: 0.73, ChatEval: 0.71, Multi-Persona: 0.72). On GPQA, even the best MAD result (0.32) barely exceeds single-agent performance (0.33). The headline conclusion: "MAD approaches currently do not outperform other ensembling methods such as Medprompt and self-consistency using their original implementations."

The critical nuance is that MAD protocols are substantially more hyperparameter-sensitive than non-debate approaches. Agreement intensity -- prompting agents to agree with each other X% of the time -- is the most powerful lever discovered. With proper tuning, Multi-Persona goes from worst-performing to best-performing protocol (+15 pp on USMLE subset). This means that the synthesis layer's quality may be ceiling-limited by the quality of the debate it receives, and the debate's quality is highly tuning-dependent.

5.3 "Stop Overvaluing MAD" -- Homogeneous Debate Fails to Beat CoT

A 2025 paper (arXiv:2502.08788) argues that multi-agent debate fails to outperform chain-of-thought or self-consistency even with more compute, and that model heterogeneity is the "universal antidote" to MAD's failures rather than any aggregation mechanism.[11] This claim directly contests the synthesis layer primacy thesis by suggesting that the quality of inputs (agent heterogeneity) matters more than the quality of the synthesis pass.

The tension between this finding and the Self-MoA result (Section 4.3.1) is instructive. Self-MoA shows that repeated sampling from a single top model beats diverse model mixing by 6.6 pp on AlpacaEval. "Stop Overvaluing MAD" argues for heterogeneity. These findings are reconcilable: heterogeneity is valuable when all models are roughly equally capable, but when one model is substantially stronger, concentrating on that model's quality-weighted outputs outperforms diluting with weaker models. For Cabinet, this means the right aggregation rule may depend on whether the synthesis model is substantially stronger than the debate agents.

5.4 Decision Protocols Are Task-Dependent -- No Universal Winner

Kaesberg (2024) evaluates eight decision protocols across seven benchmark datasets using Llama 3 8B and 70B.[12] The results show clean task-type dependence: consensus protocols win on knowledge-intensive tasks (MMLU, MMLU-Pro, GPQA), while voting and judge protocols win on logic tasks (StrategyQA, MuSR). On MuSR with 8B: Judge achieves 59.3% vs. Majority Consensus's 27.8% -- a 31.5 pp reversal. On MMLU-Pro with 8B: Consensus Average achieves 36.0% vs. Voting Average's 31.1% -- a 4.9 pp advantage for consensus.

Complication -- Task-Dependence of Protocol Effectiveness

No aggregation protocol is universally best. Consensus beats voting on knowledge tasks by ~5 pp; voting/judge beats consensus on logic tasks by up to 31 pp. Cabinet's mixed task portfolio (research queries, analysis, planning) means no single aggregation rule is optimal. Source: Kaesberg (2024), Tables 2-3.

The information access variation finding is also noteworthy for P2: enriching the decision step with confidence scores, full discussion history, or Wikipedia context "had minimal impact -- all task performances within one standard deviation of each other." If decision-step enrichment does not matter much in the Kaesberg study, it raises the question of whether confidence weighting at the Umpire step will have the effect size P2 hypothesizes.

5.5 Representational Collapse -- Run-to-Run Variance Equals Protocol Differences

Recent representational collapse research documents cosine similarity of 0.888 and effective rank of 2.17/3.0 in multi-agent debate outputs, with 1-3 point run-to-run variance.[7] This noise floor means that many reported protocol differences in the literature -- including some of the effect sizes cited in Section 4 -- may not be reproducible in a specific deployment context. If Cabinet's between-session variance is 2 points and the aggregation rule effect is 3 points, the experiment needs very large sample sizes to detect the effect reliably.

5.6 Adversarial Conditions Invert the Round Count Prediction

The Nature adversarial study (Kraidia et al., 2026) shows that under adversarial agent conditions, accuracy falls from ~0.5 at round 1 to below 0.2 by round 3, continuing to decay toward 0.1 at round 9.[13] Adding more debate rounds does not mitigate adversarial persuasion. While Cabinet's SYCOPHANCY_NUDGE injection is designed to prevent this degradation, the adversarial finding is a cautionary bound: maximum round counts above 3-4 carry real quality risks that are not offset by a better synthesis layer alone.

6. Round Count Effects and Diminishing Returns

This section directly addresses the P2 comparison variable: what does the literature say about the marginal value of additional debate rounds?

6.1 The D3 Stopping Analysis -- Most Debates Converge by Round 2

The most rigorous data on round count effects comes from D3's SAMRE stopping analysis of 1,200 evaluations.[4] Using a budgeted stopping rule (stop when score gap converges or budget is exceeded), the round distribution is:

Round 1: | 13.0% [=====]
Round 2: | 45.0% [==================]
Round 3: | 24.8% [=========]
Round 4: | 11.8% [====]
Round 5: |  5.3% [==]
                     (total: 58% stopped by round 2)
Mean rounds completed: 2.71 out of max 5

Forcing continuation beyond convergence changed verdicts in only 6% of cases, and primarily in tie scenarios. The token reduction from budgeted stopping was 40% vs. fixed 5-round protocols, while maintaining 92% of the accuracy achieved with unlimited rounds. The practical implication: for Cabinet's default 3-4 round presets, many debates are already past their informational convergence point by round 2.

6.2 MoA Layer Effects -- Diminishing Returns Begin at Layer 2

The MATH task data from MoA (Wang et al., 2024) provides a direct scaling analysis.[1] For Mixtral-8x22B as aggregator: Layer 1 score 0.282, Layer 2 score 0.534 (+0.252 gain), Layer 3 score 0.556 (+0.022 gain). For Llama-3-70B-Instruct: Layer 1 0.456, Layer 2 0.584 (+0.128), Layer 3 0.578 (-0.006 slight decline). For dbrx-instruct: Layer 1 0.314, Layer 2 0.456 (+0.142), Layer 3 0.522 (+0.066 -- still gains, but smaller). The pattern is consistent: "the first response aggregation has the most significant boost on generation quality."

6.3 MAD Round Count Effects -- Small and Plateau Quickly

The MAD benchmark study shows Multi-Persona with 2, 3, or 4 rounds giving 0.68, 0.71, and 0.72 on MedQA respectively -- a total gain of only +0.04 over two additional rounds, with most of that gain occurring in the first additional round.[10] Society of Mind at 4 agents with 2 rounds ($3.46 cost) achieves 0.73; the same system at 3 rounds ($5.19 cost) achieves 0.72 -- spending 50% more for no gain.

6.4 GroupDebate and Cabinet's max_turns Cap

The CPH paper notes that "structured debate improves accuracy over two to four rounds but risks Degeneration-of-Thought, where agents converge to a shared wrong answer through social pressure."[5] Cabinet's SYCOPHANCY_NUDGE is designed to counter this, but the cap at 5 max_turns is consistent with the literature's recommendation of 2-4 optimal rounds. The practical budget guidance across the literature is "two to three interaction rounds."

6.5 The Adversarial Round Degradation -- Nature Study

Under adversarial agent conditions, rounds do not merely plateau -- they actively hurt. Accuracy falls from ~0.5 at round 1 to below 0.2 by round 3, decaying further to approximately 0.1 at rounds 4-9.[13] The paper describes "steep curves down by the third round." While Cabinet's context is not adversarial in the same sense (all agents are instructed to be helpful), the sycophancy detection mechanism suggests the architects were aware of convergence-to-wrong-answer risks that are structurally similar.

6.6 Score Gap Trajectories by Task Type (D3 SAMRE)

D3's per-task analysis of score gap trajectories provides the most granular data on when additional rounds contribute information.[4] Coding tasks: gap peaks at ~20 points by round 2, stable plateau through round 5. Reasoning and ethics tasks: moderate, steady convergence from ~5 to ~10-11 points by rounds 3-5. Writing, roleplay, math: tight gaps (0-6 points), early plateau. For Cabinet's most common task types (research, analysis, planning), the expected gap convergence pattern suggests most deliberative information is captured by rounds 2-3, with rounds 4-5 adding minimal additional signal.

Task Type Score Gap Range Convergence Round Cabinet Analog
Coding / Factual QA Peak ~20 pts by round 2 Round 2 Fact-checking tasks
Reasoning / Ethics 8-11 pts, steady Rounds 3-5 Analysis, complex QA
Writing / Roleplay 0-6 pts, tight Round 2 Planning, creative
Math 0-6 pts, tightest Round 1-2 Quantitative analysis

Table 6. Score gap trajectory by task type across up to 7 SAMRE debate rounds. Source: Harrasse et al. (2026), Section G.4.

6.7 Synthesis of Round Count Evidence

Across all sources, the round count evidence converges on three findings: (1) the first aggregation pass yields the largest quality gain; (2) most debates reach informational convergence by round 2-3; and (3) forced continuation beyond convergence adds minimal quality while carrying adversarial and Degeneration-of-Thought risks.[17] In Cabinet's specific architecture, round count from 1 to 4 sweeps from inadequate deliberation (round 1 misses the first cross-agent challenge) to diminishing returns territory (rounds 3-4 are capturing late marginal signal). The effect size of moving from 2 rounds to 4 rounds should be measurably smaller than the effect size of moving from plain synthesis to evidence-weighted synthesis -- though the experimental test in Section 8 is needed to confirm this quantitatively.

7. The Confidence-Weighted Synthesis Question

One of P2's three aggregation conditions is confidence-weighted synthesis -- an Umpire that uses judge scores (disagreement, evidence quality) to weight agent contributions differentially. This section examines what the literature says about the practical value of confidence signals in multi-agent aggregation.

7.1 DebUnc -- The Theoretical-Practical Gap

DebUnc (Yoffe et al., 2025) is the most rigorous study of uncertainty-metric-guided debate.[14] The study tests Mean Token Entropy and TokenSAR as uncertainty metrics across 3-agent 3-round debates using Mistral-7B-Instruct-v0.2. Key findings:

The DebUnc Gap

Theoretical upper bound for confidence-weighted synthesis: +14 pp (Oracle AUROC = 1.0). Practical gain from current best uncertainty metrics: +2 pp (Entropy AUROC = 0.64). The 12 pp gap is not a design failure -- it reflects the fundamental difficulty of calibration. Current LLMs express "high confidence regardless of accuracy," meaning textual confidence signals (1-10 scale) are noisier than attention-based signals. Source: Yoffe et al. (2025), Tables 1, 4.

The architecture of confidence weighting matters significantly. DebUnc tests three communication methods: Confidence in Prompt (text-based confidence scores), Attention-Others (re-scale attention to other agents' tokens), and Attention-All (re-scale attention to all tokens including own). Attention-All performs best, and has a performance-AUROC slope of 0.59 -- meaning each unit improvement in the uncertainty metric's AUROC translates to a 0.59 unit improvement in accuracy. Confidence-in-Prompt has a slope of only 0.17. For Cabinet's Umpire, which operates at the prompt level rather than the attention level, the applicable gain function is closer to the shallow 0.17 slope than the steeper 0.59 slope.

7.2 Cabinet's Existing Judge Scores as Confidence Signal

Cabinet already generates per-round judge scores (disagreement 0-10, evidence 0-10, sycophancy flag) that are available in the synthesis prompt. These scores are a form of ground-truth annotation unavailable in most systems studied in the literature. The question is whether including explicit instructions to the Umpire to weight high-evidence-score rounds more heavily would have a material effect.

The Kaesberg study's information-access variation finding is a caution: enriching the decision step with confidence scores, logprob confidence, or consistency confidence had no statistically significant effect -- "all task performances within one standard deviation of each other."[12] However, this study uses Llama 3 8B/70B with relatively weak confidence calibration. Cabinet's judge model runs at temperature 0.1 with structured JSON output, which may provide better-calibrated scores than the confidence estimators tested in DebUnc and Kaesberg. This is ultimately an empirical question that the P2 experiment would answer.

7.3 MedARC -- Confidence-Aware Aggregation Has Independent Contribution

MedARC (Miao et al., 2025) implements two mechanisms: structured inter-agent summarization and confidence-aware aggregation.[15] The ablation studies confirm that both modules contribute "independently" to performance improvements on PubMedQA (DeepSeek-V3 backbone: zero-shot baseline 72.9% to MedARC full 77.2%, +4.3 pp absolute). The confidence-aware aggregation module weights final answers by agent confidence and reasoning quality, rather than equal-weighting all contributions.

The MedARC finding is consistent with the DebUnc oracle: when confidence signals are available and reasonably calibrated for the domain (medical QA), confidence-weighted aggregation does add measurable value. The question for Cabinet is whether Cabinet's judge scores are calibrated well enough to replicate this effect in general-domain QA and analysis tasks.

7.4 ReConcile and Collaborative Calibration

The broader confidence calibration literature (beyond DebUnc and MedARC) supports a conditional view: confidence weighting improves performance when agents are well-calibrated for the specific task domain, and degrades or adds noise when calibration is poor. LLMs are systematically overconfident on factual claims, systematically underconfident on subjective evaluation, and differentially calibrated by model family. For Cabinet's heterogeneous agent configurations (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro across different personas), confidence signal quality will vary by model and task type in ways that are difficult to predict a priori.

7.5 Council Mode's Four-Phase Synthesis as a Non-Confidence Alternative

Council Mode's structured synthesis -- which explicitly identifies consensus claims, disagreement claims, unique findings, and then synthesizes -- achieves 32.7% better hallucination performance than majority vote without any confidence weighting.[2] The synthesis model must identify "why experts disagree, not just that they disagree" and "preserve unique insights that might be lost in a voting scheme." This suggests that structural synthesis sophistication (what the Umpire does with the transcript) may be more tractable than confidence calibration (how it weights inputs), and potentially more impactful given current confidence signal quality.

8. Proposed Experimental Design for Cabinet

8.1 Design Overview

We propose a fully crossed factorial design with three aggregation rule conditions and four round count conditions, evaluated by blind human raters using ELO-style pairwise comparison. The design follows the methodology of Chatbot Arena (Bradley-Terry model)[19] adapted for Cabinet's session structure.

Factor Levels Implementation
Aggregation Rule (A) A1: Plain synthesis (current default)
A2: Evidence-weighted structured synthesis
A3: Majority-position synthesis
A1: Current Umpire system prompt. A2: Modified prompt that instructs Umpire to weight rounds with higher evidence_score more heavily and to separately identify consensus vs. contested claims (Council Mode structure). A3: Modified prompt that identifies the most commonly expressed position and synthesizes from that majority view.
Round Count (R) R1: 1 round
R2: 2 rounds
R3: 3 rounds
R4: 4 rounds
Set max_turns to 1, 2, 3, 4 respectively. Hold turn_strategy = "parallel_initial_then_round_robin" constant.

Table 7. Factorial design structure for P2 experiment. 3 x 4 = 12 conditions.

8.2 Task Stratification

Given the Kaesberg finding of strong task-type dependence, sessions must be stratified across task types. We recommend four task strata:

  1. Factual/research QA (e.g., "What is the current evidence on X?") -- analogous to knowledge-intensive MMLU tasks where consensus protocols tend to dominate
  2. Analysis and synthesis (e.g., "Evaluate the trade-offs of X vs. Y") -- analogous to reasoning tasks
  3. Planning and decision-making (e.g., "What should I do about X?") -- Cabinet's planning persona tasks
  4. Subjective evaluation (e.g., "Is this argument convincing?") -- tasks where persona diversity matters most per D3

Each factorial cell should contain at least 30 sessions per task stratum (30 x 4 = 120 sessions per cell). With 12 cells and 4 strata, this yields 1,440 total sessions. This is large but necessary given the noise floor problem (Section 5.5).

8.3 Power Analysis

The minimum detectable effect size drives sample size. Using the Chatbot Arena literature as a reference[19]: GPT-4 vs. Claude-v1 (321 pairs) yields Cohen's d = 0.18, p = 0.0265. This is statistically significant but the study is arguably underpowered for practical decision-making (the confidence interval spans effects that would and would not affect product decisions).

We target detection of Cohen's d = 0.25 (small-to-medium effect) with power = 0.80 and alpha = 0.05 (two-tailed). For a two-sample t-test comparison of ELO scores, this requires approximately 253 observations per condition. For a 12-cell design, this yields approximately 3,036 total pairwise evaluations.

However, pairwise ELO evaluation is more efficient than independent-sample comparisons because each comparison produces information about two conditions simultaneously. Using a Bradley-Terry model with pairwise comparisons, 1,440 sessions generating approximately 2,160 pairwise comparisons (each session is paired against 1.5 sessions on average in a round-robin design) should provide adequate power for the primary comparisons.

8.4 Evaluation Protocol -- Blind Human Evaluation

Raters: Minimum 5 blind human raters per comparison, drawn from the target user population (graduate students, researchers, business analysts). Raters are shown two Cabinet outputs (labeled A and B, randomized order) and asked to choose the better response on two dimensions: factual quality and overall helpfulness. Position effects are controlled by counterbalancing (some raters see the same pair in reverse order).

ELO scoring: Initial ELO scores set to 1000 for all 12 conditions. After each pairwise comparison, ELO scores update using the standard formula with K=32. Final ELO scores after all comparisons provide a cardinal ranking of all 12 conditions.

Blind protocol: Raters are not told which condition produced which output. Cabinet-specific formatting cues (debate breakdown structure, watchpoints) that might reveal round count are stripped from the comparison views. Only the final synthesis output (THE VERDICT or THE PLAN equivalent) is shown.

8.5 Primary Statistical Analysis

Main effect of Aggregation Rule (P2 primary test): Two-way ANOVA on ELO scores with Aggregation Rule (3 levels) and Round Count (4 levels) as factors, plus their interaction. The P2 hypothesis predicts that the main effect of Aggregation Rule will have a larger eta-squared (variance explained) than the main effect of Round Count -- consistent with debate collapse research showing that system-level synthesis loss dominates over intra-agent and inter-agent loss components in multi-agent hierarchies.[16]

Effect size comparison: Cohen's d between best vs. worst aggregation rule (A2 vs. A3, predicted) and best vs. worst round count (R2 vs. R1, predicted as the primary contrast). P2 holds if d(A) > d(R) with overlapping confidence intervals excluded.

Interaction testing: If Aggregation Rule x Round Count interaction is significant, the claim needs qualification: the synthesis layer may have primacy only at certain round counts. The interaction plot would reveal whether confidence-weighted synthesis has differential effects at 1 vs. 4 rounds.

Stratified analysis: Run all analyses separately for each of the four task strata, following the Kaesberg finding that optimal aggregation protocols are task-type dependent. Report stratum-specific effect sizes and confidence intervals.

8.6 Control Variables

Control Variable Fixed Value Rationale
Agent models 3 agents: GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro Holds diversity constant; all are production-quality models currently in Cabinet presets
Synthesis model GPT-5.4 (consistent with Council Mode methodology) Only vary the prompt/aggregation rule, not the model used for synthesis
Temperature 0.7 for all agents; 0.1 for judge Cabinet defaults
turn_strategy parallel_initial_then_round_robin All conditions start with same parallel first-round visibility
Task prompts Identical across conditions Each task prompt is run in all 12 factorial cells
Persona Debate persona for rounds 1-4; Planning persona for planning tasks Consistent with current Cabinet preset behavior

Table 8. Control variables for P2 factorial experiment.

8.7 Metrics Hierarchy

  1. Primary: ELO-derived win rate per condition (Bradley-Terry model)
  2. Secondary: Factual accuracy on a held-out reference set (for task strata 1 and 2 where ground truth is available)
  3. Secondary: Self-reported user satisfaction (1-5 scale, same raters)
  4. Exploratory: Judge disagreement score distributions across round count conditions (does more rounds increase or decrease judge-assessed disagreement?)
  5. Cost tracking: Token count and latency per condition (to compute cost-adjusted effect sizes)

8.8 Implementation Timeline

The proposed experiment requires two implementation phases. Phase 1 (weeks 1-2): Implement the three Umpire system prompt variants in a feature branch. Instrument Cabinet to log round count, judge scores per round, and synthesis model calls. Generate a seed set of 30 task prompts per stratum (120 total), reviewed for difficulty balance. Phase 2 (weeks 3-6): Run all 120 prompts x 12 conditions = 1,440 sessions. Conduct blind pairwise evaluation (2,160 comparisons, 5 raters each = 10,800 individual ratings). Analyze results using the statistical protocol in Section 8.5.

9. Implications for Cabinet's Product Roadmap

9.1 If P2 Holds (Aggregation Rule Effect > Round Count Effect)

If the P2 experiment confirms synthesis layer primacy, the roadmap implications are substantial. The highest-ROI engineering investment would be redesigning the Umpire's system prompt to implement structured evidence-weighted synthesis along the lines of Council Mode's four-phase output structure. This involves: (1) an explicit consensus extraction step (claims supported by all agents), (2) a disagreement mapping step (claims where agents conflict, with both positions preserved), (3) a unique-findings step (claims appearing only in one agent's contribution), and (4) a comprehensive analysis that resolves disagreements using judge evidence scores as weights.

A secondary finding -- if A2 (evidence-weighted) substantially outperforms A3 (majority-position) -- would validate maintaining the generative Umpire approach rather than moving to a voting-based aggregation. The literature strongly predicts this outcome (Council Mode's 32.7% hallucination rate increase from structured synthesis to majority vote), but the Cabinet-specific confirmation would justify the inference.

If P2 holds, max_turns optimization becomes less urgent. The current 3-4 round presets are in the diminishing-returns zone by the literature's analysis, and reducing to 2-3 rounds for most task types would save latency and cost without meaningful quality loss. This frees up compute budget for a more capable synthesis model or richer judge annotation.

9.2 If P2 Does Not Hold (Round Count Effect >= Aggregation Rule Effect)

If round count has an equal or larger effect, the roadmap inverts. The priority becomes smarter round management: task-adaptive stopping rules (stop when judge disagreement score drops below threshold, analogous to DebUnc's convergence detection), early-exit protocols for high-consensus sessions, and possibly a dynamic round count that allocates more rounds to high-disagreement sessions and fewer to low-disagreement ones. The D3 SAMRE budgeted stopping mechanism is the engineering template.

If the round count effect is larger, it also suggests the current Umpire synthesis is extracting most of the available signal from any given transcript length, and the quality ceiling is set more by debate depth than by synthesis quality. In this case, the product direction would be to increase the information available to the Umpire by improving the debate itself.

9.3 Task-Type Conditional Conclusions

Given the strong task-type dependence found in Kaesberg (2024), Cabinet should expect a stratified result even if P2 holds on average. For planning tasks (Cabinet's Planning persona), the literature suggests that consensus aggregation is especially valuable, making the aggregation rule difference larger. For analytical reasoning tasks, the round count effect may be relatively stronger. A product roadmap that treats aggregation rule as a default-configurable parameter per preset (similar to how models are already configured) is likely to outperform a one-size-fits-all solution.

9.4 The Sycophancy Nudge Interaction

Cabinet's SYCOPHANCY_NUDGE is triggered by the judge's sycophancy flag. If confidence-weighted synthesis (A2) uses the evidence score as a weighting signal, there is a potential interaction: rounds with high sycophancy (where agents agreed too quickly) will have lower evidence scores, and the Umpire will down-weight them. This is a feature, not a bug -- it means the Umpire's confidence weighting provides a second layer of sycophancy correction beyond the NUDGE injection. The experiment in Section 8 should track sycophancy flag rates across round count conditions to test whether this interaction is active.

10. Limitations

This section documents what the literature does not prove and what the proposed experiment would not resolve. These are substantive research gaps, not rhetorical hedges.

10.1 The Financial Domain Transfer Problem

The largest quantitative effect sizes cited in support of P2 -- the 15-30% Sharpe reduction from removing coordination vs. 5-8% variance from model substitution -- come from financial multi-agent systems (FinCon, TradingAgents) in a context that is structurally different from Cabinet's open-ended QA and analysis tasks.[5] Financial trading agents optimize for a single, continuous, measurable outcome (Sharpe ratio) under a well-defined objective function. Cabinet's tasks produce outputs evaluated by human preference -- a fundamentally noisier, multi-dimensional signal. The magnitude of coordination effects in financial settings may systematically overstate what is achievable in Cabinet's context. The CPH paper explicitly acknowledges this: "CPH is presented as a falsifiable research hypothesis supported by tiered structural evidence rather than as an empirically validated conclusion."

10.2 The Absence of Consumer Product Testing

No primary study in this review was conducted on a deployed consumer AI product. The benchmarks used -- MT-Bench, AlpacaEval, MMLU, HaluEval, TruthfulQA, PubMedQA -- are curated academic evaluation sets designed to measure specific capabilities in controlled conditions. Consumer product sessions have qualitatively different properties: open-ended, multi-turn, varying task types within sessions, users with different expertise levels, and evaluation criteria that include factors like tone, response length, and citation style that are not captured in benchmark accuracy scores. The effect sizes observed on academic benchmarks may be systematically different from effect sizes on Cabinet's actual user sessions.

10.3 The Noise Floor Is Not Trivially Solved by More Data

The 1-3 point run-to-run variance documented in representational collapse research is a structural property of the experimental setup, not an artifact that more data can eliminate.[7] This variance arises from non-determinism in LLM generation (even at temperature 0.7), prompt sensitivity to exact wording, and variability in how models interpret ambiguous task framings. For Cabinet specifically, the parallel-first-round design amplifies initial variance: each session's first-round outputs are generated independently and stochastically, and later rounds are causally dependent on the first round. Two sessions running the same task prompt with the same conditions will produce meaningfully different debates and thus different synthesis outputs. The P2 experiment must account for this not just in statistical power calculations but in the design of the session-level analysis (fixed-effects for individual task prompts are recommended).

10.4 Confidence Signal Quality Is Not Known for Cabinet's Judge

The DebUnc study uses token-level entropy (AUROC ~0.64) and TokenSAR (AUROC ~0.62) as uncertainty metrics, finding only +2 pp practical gain over standard debate because calibration quality is marginal.[14] Cabinet's judge model produces structured disagreement and evidence scores at temperature 0.1 -- a different mechanism that might have better or worse AUROC characteristics. We do not know the AUROC of Cabinet's judge scores because no calibration study has been conducted on them. The P2 experiment would provide one indirect test: if confidence-weighted synthesis (A2) substantially outperforms plain synthesis (A1), it implies the judge scores have meaningful discriminative power. But the reverse is not informative -- poor A2 performance could mean weak judge scores or weak synthesis prompt design.

10.5 The Interaction Between Synthesis Quality and Agent Heterogeneity

The Self-MoA finding (quality coefficient alpha dominates diversity coefficient beta) and the "Stop Overvaluing MAD" argument both suggest that agent heterogeneity is not straightforwardly beneficial.[8][11] Cabinet's presets use 2-4 heterogeneous agents from different model families. If the synthesis model is substantially weaker than the best debate agent, there is a theoretical risk that the synthesis step is a quality floor rather than a quality ceiling -- the Umpire cannot reliably extract the best signal from agents that may individually outperform it. The P2 experiment holds the synthesis model constant (GPT-5.4) to isolate the prompt effect, but a follow-on experiment varying the synthesis model's capability relative to agent capability is warranted.

10.6 The Three-Condition Aggregation Design Incompletely Covers the Space

P2 tests three aggregation rules: plain synthesis, evidence-weighted synthesis, and majority-position synthesis. These represent a subset of the design space. Missing conditions include: hybrid rules that combine structural synthesis with confidence weighting (Council Mode's full four-phase approach); chain-of-thought synthesis that makes the Umpire's reasoning explicit before committing to a verdict; adversarial synthesis that specifically seeks out the strongest challenge to the most popular position; and task-adaptive synthesis that switches protocol based on judge-assessed disagreement intensity. Testing only three conditions means P2's conclusions are bounded to comparisons within that subset -- the globally optimal aggregation rule for Cabinet may be outside the tested conditions.

10.7 Human Rater Agreement May Be Low on Cabinet's Task Types

The D3 study achieves 94.8% positional swap consistency with its pairwise evaluation design.[4] However, D3 evaluates the quality of responses to well-defined benchmark questions with relatively clear quality criteria. Cabinet's open-ended tasks -- "help me think through this decision" or "what are the key risks in this situation?" -- have lower inter-rater agreement by nature. Lower rater consistency inflates the noise floor and requires larger sample sizes to achieve the same statistical power. The proposed design's 5 raters per comparison is the minimum; for Cabinet's task types, 7-10 raters per comparison may be necessary to achieve adequate rater reliability.

11. Open Research Questions

The following propositions are framed as falsifiable hypotheses for future research. Each follows from the analysis in this study but is not directly testable within the P2 experimental design.

P3: Task-Adaptive Round Count Beats Fixed Round Count at Equal Average Token Cost

Based on D3's budgeted stopping result (58% of debates converge by round 2, 92% accuracy retained vs. fixed 5-round protocols at 40% token reduction), a Cabinet variant that stops debate when the judge disagreement score drops below a threshold will achieve equal or better user-rated quality than a fixed 3-round protocol at lower average token cost. Falsification condition: fixed 3-round sessions outperform adaptive-stopping sessions in blind pairwise evaluation, or adaptive sessions have higher per-session cost variance that users find unpredictable.

P4: Sycophancy Nudge Injection Reduces the Value of Additional Rounds

Cabinet's SYCOPHANCY_NUDGE is triggered when the sycophancy flag fires. If the nudge successfully breaks premature consensus, round 2 becomes more informationally valuable (agents genuinely revise positions). If the nudge is ineffective, subsequent rounds are low-signal. P4 proposes: sessions where the sycophancy flag fires in round 1 show a larger quality gain from adding round 2 (measured by judge evidence score) than sessions where no sycophancy flag fires. Falsification condition: no statistically significant difference in round 2 quality gain between flagged and unflagged sessions.

P5: The Synthesis Model's Capability Relative to Agent Capability Is the Binding Constraint on Umpire Quality

Based on Self-MoA's finding that aggregator quality is the primary driver of MoA performance (alpha=4.5 for CRUX vs. beta=1.4 for diversity), Cabinet's Umpire output quality is more sensitive to the synthesis model's capability than to the debate agents' heterogeneity. P5 predicts: holding the synthesis prompt and round count constant, upgrading the synthesis model from the current default to a substantially more capable model produces a larger effect size than upgrading all debate agents to the same substantially more capable model. Falsification condition: debate agent upgrade produces equal or larger effect than synthesis model upgrade on blind human evaluation.

P6: Structured Synthesis Outperforms Confidence Weighting When Judge Score Calibration Is Weak

Based on the DebUnc finding (practical confidence signals yield only +2 pp vs. +14 pp theoretical maximum) and the Council Mode finding (structural synthesis changes -- identifying consensus vs. unique claims -- yield +32.7% hallucination rate improvement without any confidence weighting), Cabinet's A2 condition (evidence-weighted) will underperform a Council-Mode-structured synthesis condition (four-phase output with explicit disagreement mapping) when the judge evidence score AUROC is below 0.75. Falsification condition: evidence-weighted synthesis (A2 as designed) produces equivalent or better user-rated quality to a four-phase structured synthesis variant across all task strata.

P7: The Optimal Number of Agents for a Fixed Token Budget Is Three, Not Two or Four

Based on PartnerMAS's finding that performance peaks with "approximately 4-5 active agents" (with more concentrated opinion diversity), Council Mode's finding that three experts balance diversity and computational cost, and the MoA collaborativeness finding that all tested LLMs improve when given auxiliary responses, the optimal agent count for Cabinet's 3-round default sessions is three agents, not two (insufficient diversity) or four (diminishing returns). Falsification condition: two-agent or four-agent Cabinet sessions outperform three-agent sessions in blind pairwise evaluation at matched token budgets.

P8: The Adversarial Degradation Risk Increases Non-Linearly After Round 3

The Nature adversarial study shows accuracy falling from ~0.5 at round 1 to below 0.2 by round 3 under adversarial conditions.[13] In the non-adversarial case, Cabinet's SYCOPHANCY_NUDGE provides partial protection, but sycophancy and adversarial persuasion are different failure modes that the nudge addresses differently. P8 predicts: for Cabinet sessions where the judge sycophancy flag fires, the quality of the round-4 output will be statistically lower than the quality of the round-3 output, controlling for task type and initial disagreement score. Falsification condition: no quality degradation in round 4 vs. round 3 in sycophancy-flagged sessions.

References

  1. [1] Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv:2406.04692. Duke University, Together AI, University of Chicago, Stanford University. https://arxiv.org/abs/2406.04692
  2. [2] Wu, S., Li, X., Feng, Y., Li, Y., & Wang, Z. (2026). Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus. arXiv:2604.02923. https://arxiv.org/abs/2604.02923
  3. [3] Li, L., Wu, H., Li, Z., Hu, J., Wang, Y., Huang, X., Hua, W., & Wang, W. (2025). PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features. arXiv:2509.24046. https://arxiv.org/abs/2509.24046
  4. [4] Harrasse, A., Bandi, C., & Bandi, H. (2026). Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation. In Proceedings of EACL 2026, pages 8376-8392. DOI: 10.18653/v1/2026.eacl-long.392. https://aclanthology.org/2026.eacl-long.392/
  5. [5] Nguyen, P., & Pham, T. (2026). Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness. arXiv:2603.27539. Georgia Institute of Technology / Adobe Inc. https://arxiv.org/abs/2603.27539
  6. [6] Li, Z., Feng, X., Cai, Y., Zhang, Z., Liu, T., Liang, C., Chen, W., Wang, H., & Zhao, T. (2025). LLMs Can Generate a Better Answer by Aggregating Their Own Responses. arXiv:2503.04104. Georgia Tech / Microsoft Azure / Amazon. https://arxiv.org/abs/2503.04104
  7. [7] Representational Collapse in Multi-Agent Debate. arXiv:2604.03809. [Cosine similarity 0.888, effective rank 2.17/3.0, 1-3 point run-to-run variance.] https://arxiv.org/abs/2604.03809
  8. [8] Li, W., Lin, Y., Xia, M., & Jin, C. (2025). Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? arXiv:2502.00674. Princeton University. https://arxiv.org/abs/2502.00674
  9. [9] Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J. E., & Stoica, I. (2025). Why Do Multi-Agent LLM Systems Fail? arXiv:2503.13657. UC Berkeley / MIT CSAIL. https://arxiv.org/abs/2503.13657
  10. [10] Smit, A., Grinsztajn, N., Duckworth, P., Barrett, T. D., & Pretorius, A. (2024). Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs. arXiv:2311.17371. InstaDeep. https://arxiv.org/abs/2311.17371
  11. [11] Stop Overvaluing Multi-Agent Debate. (2025). arXiv:2502.08788. [MAD fails to outperform CoT/SC; model heterogeneity as universal antidote.] https://arxiv.org/abs/2502.08788
  12. [12] Kaesberg, L. B. (2024). Decision Protocols in Multi-Agent Large Language Model Conversations. Master's thesis, Georg August University of Gottingen. GippLab. https://gipplab.uni-goettingen.de/wp-content/papercite-data/pdf/kaesberg2024.pdf
  13. [13] Kraidia, I., Qaddara, I., Almutairi, A., Alzaben, N., & Belhouari, S. B. (2026). When collaboration fails: persuasion driven adversarial influence in multi-agent large language model debate. Scientific Reports. DOI: 10.1038/s41598-026-42705-7. https://www.nature.com/articles/s41598-026-42705-7
  14. [14] Yoffe, L., Amayuelas, A., & Wang, W. Y. (2025). DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics. arXiv:2407.06426. UC Santa Barbara. https://arxiv.org/abs/2407.06426
  15. [15] Miao, Y., Wen, J., Luo, Y., & Li, J. (2025). MedARC: Adaptive multi-agent refinement and collaboration for enhanced medical reasoning in large language models. International Journal of Medical Informatics, 206:106136. DOI: 10.1016/j.ijmedinf.2025.106136. https://pubmed.ncbi.nlm.nih.gov/41109093/
  16. [16] Debate Collapse / System-level Loss. (2026). arXiv:2602.07186. [System-level loss L_sys most important; followed by L_intra, then L_inter.] https://arxiv.org/abs/2602.07186
  17. [17] Pryzant, R., et al. (2023). Automatic Prompt Optimization with "Gradient Descent" and Beam Search. [Degeneration-of-Thought: agents converging to shared wrong answers.] See also: Shi, F., et al. (2023). Large Language Models Can Be Easily Distracted by Irrelevant Context. Related: Du, Y., et al. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325. https://arxiv.org/abs/2305.14325
  18. [18] Leng, Q., Du, Y., Venkataraman, A., McCallum, A., & Xia, F. (2023). The Consensus Game: Language Model Generation via Equilibrium Search. [Consensus vs. voting game-theoretic analysis.] arXiv:2310.09139. https://arxiv.org/abs/2310.09139
  19. [19] Chiang, W., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132. [Bradley-Terry model; 1.37M+ comparisons; Cohen's d ~0.18 between top models.] https://arxiv.org/abs/2403.04132