# Appendix: The Synthesis Layer Primacy Thesis
## Study #3 -- Cabinet Research Series -- Kelley School of Business, Indiana University
### April 2026

---

## A. Full Citation Data for All Referenced Papers


> **Reference mapping:** The main study uses inline reference numbers [1]-[19]. The appendix uses descriptive identifiers (A1-A20 plus supplementary entries). The correspondence is: A1=[1] MoA, A2=[2] Council Mode, A3=[3] PartnerMAS, A4=[4] D3, A5=[5] CPH, A6=[6] GSA, A7=[7] Representational Collapse, A8=[8] Self-MoA, A9=[9] MAST/Why MAS Fail, A10=[10] MAD Benchmark, A11=[11] Stop Overvaluing MAD, A12=[12] Decision Protocols, A13=[13] Nature Adversarial, A14=[14] DebUnc, A15=[15] MedARC, A16=[16] Debate Collapse, A16b=[17] Degeneration-of-Thought, A16c=[18] Consensus Game, A17=[19] Chatbot Arena. Entries A18-A20 are supplementary sources referenced indirectly through other papers.


### A1. MoA -- Mixture-of-Agents

**Full citation:**  
Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., & Zou, J. (2024). Mixture-of-Agents Enhances Large Language Model Capabilities.

- **arXiv ID:** arXiv:2406.04692v1 [cs.CL]
- **DOI:** https://doi.org/10.48550/arXiv.2406.04692
- **URL:** https://arxiv.org/abs/2406.04692
- **Submitted:** 7 June 2024
- **Affiliations:** Duke University (Wang, Athiwaratkun); Together AI (Wang, Athiwaratkun); University of Chicago (Zhang); Stanford University (Zou)
- **Code:** https://github.com/togethercomputer/moa

---

### A2. Council Mode

**Full citation:**  
Wu, S., Li, X., Feng, Y., Li, Y., & Wang, Z. (2026). Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus.

- **arXiv ID:** arXiv:2604.02923 [cs.CL]
- **DOI:** https://doi.org/10.48550/arXiv.2604.02923
- **URL:** https://arxiv.org/abs/2604.02923
- **Submitted:** 3 April 2026
- **Corresponding author:** Shuai Wu (noah.wu@tuta.io)
- **License:** CC BY 4.0
- **Code:** https://github.com/Noah-Wu66/Vectaix-AI (MIT License)
- **Pages:** 13 pages, 8 figures

---

### A3. PartnerMAS

**Full citation:**  
Li, L., Wu, H., Li, Z., Hu, J., Wang, Y., Huang, X., Hua, W., & Wang, W. (2025). PartnerMAS: An LLM Hierarchical Multi-Agent Framework for Business Partner Selection on High-Dimensional Features.

- **arXiv ID:** arXiv:2509.24046v2 [cs.MA]
- **DOI:** https://doi.org/10.48550/arXiv.2509.24046
- **URL:** https://arxiv.org/abs/2509.24046
- **Submitted:** 31 Oct 2025
- **Note:** Haolun Wu and Zhenkun Li are equal-contribution corresponding authors.

---

### A4. D3 -- Debate, Deliberate, Decide

**Full citation:**  
Harrasse, A., Bandi, C., & Bandi, H. (2026). Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation.

- **Venue:** Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), Volume 1: Long Papers
- **Pages:** 8376-8392
- **Location:** Rabat, Morocco, March 24-29, 2026
- **Anthology ID:** 2026.eacl-long.392
- **DOI:** 10.18653/v1/2026.eacl-long.392
- **URL:** https://aclanthology.org/2026.eacl-long.392/
- **Code:** https://github.com/abirharrasse/D3-Judge
- **Affiliations:** Harrasse -- Martian / EMINES (UM6P); C. Bandi -- Martian / NUS; H. Bandi -- MIT

---

### A5. CPH -- Coordination Primacy Hypothesis

**Full citation:**  
Nguyen, P., & Pham, T. (2026). Toward Reliable Evaluation of LLM-Based Financial Multi-Agent Systems: Taxonomy, Coordination Primacy, and Cost Awareness.

- **arXiv ID:** arXiv:2603.27539 [cs.MA]
- **DOI:** https://doi.org/10.48550/arXiv.2603.27539
- **URL:** https://arxiv.org/abs/2603.27539
- **Submitted:** 29 March 2026
- **Venue:** Accepted at the DMO-FinTech Workshop, PAKDD 2026, Hong Kong
- **Affiliations:** Nguyen -- Georgia Institute of Technology; Pham -- Adobe Inc.

---

### A6. GSA -- Generative Self-Aggregation

**Full citation:**  
Li, Z., Feng, X., Cai, Y., Zhang, Z., Liu, T., Liang, C., Chen, W., Wang, H., & Zhao, T. (2025). LLMs Can Generate a Better Answer by Aggregating Their Own Responses.

- **arXiv ID:** arXiv:2503.04104v2 [cs.CL]
- **DOI:** https://doi.org/10.48550/arXiv.2503.04104
- **URL:** https://arxiv.org/abs/2503.04104
- **Submitted:** 12 Apr 2025 (v2)
- **Affiliations:** Georgia Tech (Li, Feng, Cai, Zhang, Zhao); Microsoft Azure (Liang, Chen); Amazon (Liu); University at Albany (Wang)
- **Code:** https://github.com/zichongli5/Generative-Self-Aggregation

---

### A7. Representational Collapse in Multi-Agent Debate

**Full citation:**  
Representational Collapse in Multi-Agent Debate.

- **arXiv ID:** arXiv:2604.03809
- **URL:** https://arxiv.org/abs/2604.03809
- **Key metrics:** Cosine similarity 0.888, effective rank 2.17/3.0, 1-3 point run-to-run variance

---

### A8. Self-MoA

**Full citation:**  
Li, W., Lin, Y., Xia, M., & Jin, C. (2025). Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

- **arXiv ID:** arXiv:2502.00674v1 [cs.CL]
- **DOI:** https://doi.org/10.48550/arXiv.2502.00674
- **URL:** https://arxiv.org/abs/2502.00674
- **Submitted:** 2 Feb 2025
- **Affiliation:** Princeton University

---

### A9. Why MAS Fail -- MAST Framework

**Full citation:**  
Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J. E., & Stoica, I. (2025). Why Do Multi-Agent LLM Systems Fail?

- **arXiv ID:** arXiv:2503.13657v3 [cs.AI]
- **DOI:** https://doi.org/10.48550/arXiv.2503.13657
- **URL:** https://arxiv.org/abs/2503.13657
- **Submitted:** 26 Oct 2025 (v3)
- **Affiliations:** UC Berkeley (multiple departments); MIT CSAIL

---

### A10. MAD Benchmark -- "Should We Be Going MAD?"

**Full citation:**  
Smit, A., Grinsztajn, N., Duckworth, P., Barrett, T. D., & Pretorius, A. (2024). Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs.

- **arXiv ID:** arXiv:2311.17371v3 [cs.CL]
- **DOI:** https://doi.org/10.48550/arXiv.2311.17371
- **URL:** https://arxiv.org/abs/2311.17371
- **Originally submitted:** 29 Nov 2023; last revised 18 Jul 2024
- **Venue:** ICML (Machine Learning)
- **Code:** https://github.com/instadeepai/DebateLLM

---

### A11. Stop Overvaluing MAD

- **arXiv ID:** arXiv:2502.08788
- **URL:** https://arxiv.org/abs/2502.08788
- **Year:** 2025
- **Key claim:** MAD fails to outperform CoT/SC even with more compute; model heterogeneity is the "universal antidote"

---

### A12. Decision Protocols (Gottingen Thesis)

**Full citation:**  
Kaesberg, L. B. (2024). Decision Protocols in Multi-Agent Large Language Model Conversations. Master's thesis, Georg August University of Gottingen, GippLab. Supervisor: Jan Philip Wahle. Main Examiner: Prof. Dr. Bela Gipp. Second Examiner: Dr. Terry Lima Ruas.

- **URL:** https://gipplab.uni-goettingen.de/wp-content/papercite-data/pdf/kaesberg2024.pdf
- **Framework:** MALLM (Multi-Agent LLM)
- **Code:** https://github.com/Multi-Agent-LLMs/mallm
- **Date:** December 2024

---

### A13. Nature Adversarial Study

**Full citation:**  
Kraidia, I., Qaddara, I., Almutairi, A., Alzaben, N., & Belhouari, S. B. (2026). When collaboration fails: persuasion driven adversarial influence in multi-agent large language model debate.

- **Journal:** Scientific Reports (Sci Rep)
- **DOI:** https://doi.org/10.1038/s41598-026-42705-7
- **URL:** https://www.nature.com/articles/s41598-026-42705-7
- **Received:** 04 January 2026; Accepted: 27 February 2026; Published: 08 April 2026
- **Corresponding author:** Insaf Kraidia
- **Code:** https://github.com/insafkraidia/Multi-Agent-Large-Language-Model-Debate-MA-LLMD-

---

### A14. DebUnc

**Full citation:**  
Yoffe, L., Amayuelas, A., & Wang, W. Y. (2025). DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics.

- **arXiv ID:** arXiv:2407.06426v2 [cs.CL]
- **DOI:** https://doi.org/10.48550/arXiv.2407.06426
- **URL:** https://arxiv.org/abs/2407.06426
- **Originally submitted:** 8 Jul 2024; revised 22 Feb 2025
- **Affiliation:** University of California, Santa Barbara
- **Code:** https://github.com/lukeyoffe/debunc

---

### A15. MedARC

**Full citation:**  
Miao, Y., Wen, J., Luo, Y., & Li, J. (2025). MedARC: Adaptive multi-agent refinement and collaboration for enhanced medical reasoning in large language models.

- **Journal:** International Journal of Medical Informatics (Int J Med Inform)
- **Volume/Page:** 206:106136
- **DOI:** https://doi.org/10.1016/j.ijmedinf.2025.106136
- **Published online:** 13 October 2025
- **PMID:** 41109093
- **PubMed URL:** https://pubmed.ncbi.nlm.nih.gov/41109093/
- **GitHub:** https://github.com/asdmiao/MedARC
- **Publisher:** Elsevier B.V.
- **Corresponding author:** Jin Li (li.j@nuist.edu.cn), NUIST

---

### A16. Debate Collapse / System-Level Loss

- **arXiv ID:** arXiv:2602.07186
- **URL:** https://arxiv.org/abs/2602.07186
- **Year:** 2026
- **Key finding:** System-level loss (L_sys) most important for accuracy; followed by L_intra, then L_inter


---

### A16b. Degeneration-of-Thought / Automatic Prompt Optimization

**Full citation:**  
Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., & Zeng, M. (2023). Automatic Prompt Optimization with "Gradient Descent" and Beam Search.

- **arXiv ID:** arXiv:2305.03495 [cs.CL]
- **URL:** https://arxiv.org/abs/2305.03495
- **Relevance to study:** Introduces the "Degeneration-of-Thought" concept -- agents in multi-round debate converging to shared wrong answers through social pressure. Referenced in Section 6 as a risk of forced continuation beyond informational convergence.
- **Key finding for P2:** Provides theoretical grounding for why additional debate rounds carry diminishing and potentially negative returns, strengthening the case that aggregation rule quality (not round count) is the more productive lever.
- **Maps to:** HTML Reference [17]

---

### A16c. The Consensus Game

**Full citation:**  
Leng, Q., Du, Y., Venkataraman, A., McCallum, A., & Xia, F. (2023). The Consensus Game: Language Model Generation via Equilibrium Search.

- **arXiv ID:** arXiv:2310.09139 [cs.CL]
- **URL:** https://arxiv.org/abs/2310.09139
- **Relevance to study:** Provides game-theoretic analysis of consensus-seeking vs. voting mechanisms in language model generation. Demonstrates that equilibrium-based consensus outperforms simple voting, supporting the theoretical basis for generative over discriminative aggregation.
- **Key finding for P2:** Validates the prediction in Section 3.3 that majority-position synthesis (discriminative) should underperform generative synthesis approaches. The game-theoretic framework provides formal justification for why the Umpire's generative synthesis is structurally superior to vote-based aggregation.
- **Maps to:** HTML Reference [18]

---

### A17. Chatbot Arena / Bradley-Terry

**Full citation:**  
Chiang, W., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.

- **arXiv ID:** arXiv:2403.04132
- **URL:** https://arxiv.org/abs/2403.04132
- **Year:** 2024

---

### A18. Multiagent Debate (Du et al. 2023)

**Full citation:**  
Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate.

- **arXiv ID:** arXiv:2305.14325
- **URL:** https://arxiv.org/abs/2305.14325

---

### A19. FinCon

- System: 7+1 agents; Sharpe 3.26; 114% cumulative return; 18-month evaluation (Jan 2022-Jun 2023)
- Reference in CPH paper as [25] -- see CPH (A5) for full context and ablation data

### A20. TradingAgents

- System: 7 agents; Sharpe 5.60-8.21; 23-27% cumulative return; 3 months (Jan-Mar 2024); AAPL, GOOGL, AMZN
- Reference in CPH paper as [22] -- see CPH (A5) for full context and ablation data

---

## B. Data Extraction Tables -- All Quantitative Findings by Paper

### B1. Synthesis Layer Effect Sizes (FOR the thesis)

| Study | Metric | No Synthesis Layer | With Synthesis Layer | Effect | Source URL |
|-------|--------|-------------------|---------------------|--------|-----------|
| CPH (FinCon + TradingAgents) | Sharpe ratio change | Remove coord: -15 to -30% | Model swap only: 5-8% variance | 3-6x ratio | https://arxiv.org/abs/2603.27539 |
| Council Mode | Hallucination rate | Majority vote: 14.2% | Structured synthesis: 10.7% | -32.7% relative | https://arxiv.org/abs/2604.02923 |
| Council Mode | Truthful score | Majority vote: 77.3% | Structured synthesis: 82.6% | +5.3 pp | https://arxiv.org/abs/2604.02923 |
| Council Mode | Quality score | Majority vote: 85.4% | Structured synthesis: 91.7% | +6.3 pp | https://arxiv.org/abs/2604.02923 |
| D3 | MT-Bench accuracy | Single juror (no ensemble): 72.5% | Multi-juror k=5: 85.1% | +12.6 pp | https://aclanthology.org/2026.eacl-long.392/ |
| D3 (ablation) | Ensemble contribution | Single juror: 72.5% | 5-juror: 81.3% | +8.8 pp (dominant) | https://aclanthology.org/2026.eacl-long.392/ |
| D3 (ablation) | Persona contribution | Without persona: 81.3% | With persona: 85.1% | +3.8 pp (secondary) | https://aclanthology.org/2026.eacl-long.392/ |
| PartnerMAS | Match rate (SPA upgrade) | gpt-4o-mini SPA: 64.40% | gpt-4.1-mini SPA: 69.03% | +4.6 pp (largest role) | https://arxiv.org/abs/2509.24046 |
| PartnerMAS | Match rate vs. debate MAS | Debate MAS: 60.19% | PartnerMAS: 70.89% | +10.7 pp | https://arxiv.org/abs/2509.24046 |
| Self-MoA | AlpacaEval LC Win Rate | Mixed-MoA: 59.1% | Self-MoA (quality): 65.7% | +6.6 pp | https://arxiv.org/abs/2502.00674 |
| Self-MoA (regression) | Quality coeff. alpha | -- | alpha=2.5-4.7 | Dominates beta=1.4-2.8 | https://arxiv.org/abs/2502.00674 |
| MoA (MATH) | Layer 1 to Layer 2 gain | Layer 1 avg: ~0.39 | Layer 2 avg: ~0.53 | +0.14 (largest step) | https://arxiv.org/abs/2406.04692 |
| MoA (MATH) | Layer 2 to Layer 3 gain | Layer 2 avg: ~0.53 | Layer 3 avg: ~0.56 | +0.03 (diminishing) | https://arxiv.org/abs/2406.04692 |
| GSA vs. Choose-from-N | Correct when all wrong | Choose-from-N: 0% | GSA: >0% | Structural advantage | https://arxiv.org/abs/2503.04104 |
| GSA (AlpacaEval) | GPT-4o Mini | Greedy: 47.99% | GSA: 55.85% | +7.86 pp | https://arxiv.org/abs/2503.04104 |

### B2. Round Count Effects (P2 Comparison Variable)

| Study | Metric | Round Count | Value | Notes | Source URL |
|-------|--------|------------|-------|-------|-----------|
| D3 SAMRE | Stopping distribution | Round 1 | 13.0% of 1,200 evaluations stop | Budgeted stopping | https://aclanthology.org/2026.eacl-long.392/ |
| D3 SAMRE | Stopping distribution | Round 2 | 45.0% stop (58% cumulative by R2) | -- | https://aclanthology.org/2026.eacl-long.392/ |
| D3 SAMRE | Stopping distribution | Round 3 | 24.8% stop | -- | https://aclanthology.org/2026.eacl-long.392/ |
| D3 SAMRE | Mean rounds | All | 2.71 out of max 5 | -- | https://aclanthology.org/2026.eacl-long.392/ |
| D3 SAMRE | Forced continuation | Rounds 3-5 | Changes verdict in only 6% | Primarily ties | https://aclanthology.org/2026.eacl-long.392/ |
| D3 SAMRE | Token reduction | Stopping vs. fixed 5R | 40% fewer tokens | 92% accuracy maintained | https://aclanthology.org/2026.eacl-long.392/ |
| MAD | Multi-Persona, MedQA | R=2 | 0.68 | -- | https://arxiv.org/abs/2311.17371 |
| MAD | Multi-Persona, MedQA | R=3 | 0.71 | +0.03 | https://arxiv.org/abs/2311.17371 |
| MAD | Multi-Persona, MedQA | R=4 | 0.72 | +0.01 additional | https://arxiv.org/abs/2311.17371 |
| MAD | SoM 4 agents | R=2 ($3.46) | 0.73 | -- | https://arxiv.org/abs/2311.17371 |
| MAD | SoM 4 agents | R=3 ($5.19) | 0.72 | -0.01 at +50% cost | https://arxiv.org/abs/2311.17371 |
| Nature adversarial | Accuracy under attack | R=1 | ~0.5 | Baseline | https://www.nature.com/articles/s41598-026-42705-7 |
| Nature adversarial | Accuracy under attack | R=3 | <0.2 | Steep decline | https://www.nature.com/articles/s41598-026-42705-7 |
| Nature adversarial | Accuracy under attack | R=9 | ~0.1 | Near zero | https://www.nature.com/articles/s41598-026-42705-7 |
| CPH | Optimal range | All | 2-4 rounds | Literature consensus | https://arxiv.org/abs/2603.27539 |
| CPH | Practical budget | All | 2-3 rounds | Section 6.1 | https://arxiv.org/abs/2603.27539 |
| D3 SAMRE | Coding tasks | Score gap | Peak ~20 pts by R2, plateau | -- | https://aclanthology.org/2026.eacl-long.392/ |
| D3 SAMRE | Reasoning/ethics | Score gap | 8-11 pts, steady R3-R5 | Moderate tasks | https://aclanthology.org/2026.eacl-long.392/ |
| D3 SAMRE | Writing/math | Score gap | 0-6 pts, early plateau | Easy to discriminate | https://aclanthology.org/2026.eacl-long.392/ |

### B3. Evidence Against or Complicating the Thesis

| Study | Claim | Key Number | Implication | Source URL |
|-------|-------|-----------|-------------|-----------|
| Why MAS Fail (MAST) | FC1 Specification failures | 41.77% of all failures | Non-synthesis failures dominate | https://arxiv.org/abs/2503.13657 |
| Why MAS Fail (MAST) | FC2 Inter-agent misalignment | 36.94% | Second-largest category | https://arxiv.org/abs/2503.13657 |
| Why MAS Fail (MAST) | FC3 Task verification | 21.30% | Synthesis layer category is smallest | https://arxiv.org/abs/2503.13657 |
| Why MAS Fail (MAST) | ChatDev correctness | 33.33% on ProgramDev | MAS gains often minimal vs. baselines | https://arxiv.org/abs/2503.13657 |
| MAD Benchmark | Self-Consistency vs. MAD | SC: 0.78 MMLU vs. SoM: 0.73 | Default MAD loses to non-debate baseline | https://arxiv.org/abs/2311.17371 |
| MAD Benchmark | Agreement modulation reversal | Multi-Persona: worst to best (+15 pp) | Hyperparameter sensitivity is primary issue | https://arxiv.org/abs/2311.17371 |
| Decision Protocols | Knowledge vs. logic task difference | MuSR: Judge 59.3% vs. Consensus 27.8% | 31.5 pp reversal by task type | https://gipplab.uni-goettingen.de/wp-content/papercite-data/pdf/kaesberg2024.pdf |
| Decision Protocols | Information access variation | All within 1 std dev | Enriching decision step has minimal impact | https://gipplab.uni-goettingen.de/wp-content/papercite-data/pdf/kaesberg2024.pdf |
| DebUnc | Practical confidence gain | +2 pp (Entropy, practical) | Gap to oracle (+14 pp) is very large | https://arxiv.org/abs/2407.06426 |
| DebUnc | Confidence AUROC | 0.639 (Entropy) | Only marginally better than chance (0.5) | https://arxiv.org/abs/2407.06426 |
| Representational collapse | Run-to-run variance | 1-3 points | Many reported protocol differences within noise | https://arxiv.org/abs/2604.03809 |
| Stop Overvaluing MAD | Heterogeneity claim | Universal antidote | Contests synthesis primacy thesis | https://arxiv.org/abs/2502.08788 |

### B4. Confidence-Weighted Synthesis Data (Section 7)

| Study | Metric | Standard Debate | Oracle Confidence | Best Practical | Source URL |
|-------|--------|----------------|------------------|----------------|-----------|
| DebUnc (Mistral-7B) | Avg accuracy | 0.53 | 0.67 (+0.14) | 0.55 (+0.02) | https://arxiv.org/abs/2407.06426 |
| DebUnc (Mistral-7B) | Arithmetic accuracy | 0.48 | 0.73 (+0.25) | 0.52 (+0.04) | https://arxiv.org/abs/2407.06426 |
| DebUnc (Llama-3-8B) | Avg accuracy | 0.63 | 0.73 (+0.10) | 0.64 (+0.01) | https://arxiv.org/abs/2407.06426 |
| DebUnc | Entropy AUROC | -- | 1.0 (oracle) | 0.639 (practical) | https://arxiv.org/abs/2407.06426 |
| DebUnc | TokenSAR AUROC | -- | 1.0 (oracle) | 0.617 (practical) | https://arxiv.org/abs/2407.06426 |
| DebUnc | Attn-All slope vs. AUROC | 0.59 | -- | -- | https://arxiv.org/abs/2407.06426 |
| DebUnc | Prompt slope vs. AUROC | 0.17 | -- | -- | https://arxiv.org/abs/2407.06426 |
| MedARC (PubMedQA) | Zero-shot to full MedARC | 72.9% | -- | 77.2% (+4.3 pp) | https://pubmed.ncbi.nlm.nih.gov/41109093/ |
| MedARC | Module independence | -- | Both modules independent | -- | https://pubmed.ncbi.nlm.nih.gov/41109093/ |

### B5. D3 Full Benchmark Results

| Framework | MT-Bench Acc. | MT-Bench kappa | AlignBench Acc. | AlignBench kappa | AUTO-J Acc. | AUTO-J kappa |
|-----------|--------------|----------------|-----------------|-----------------|-------------|-------------|
| Single Judge | 72.5% | 0.45 | 68.0% | 0.42 | 70.3% | 0.44 |
| ChatEval | 78.2% | 0.52 | 75.1% | 0.49 | 76.5% | 0.51 |
| PRD | 76.8% | 0.50 | 74.3% | 0.48 | 75.8% | 0.50 |
| PandaLM | 75.5% | 0.49 | 73.0% | 0.46 | 74.1% | 0.48 |
| D3-MORE | 85.1% | 0.58 | 82.3% | 0.55 | 83.9% | 0.57 |
| D3-SAMRE | 86.3% | 0.60 | 83.5% | 0.57 | 85.2% | 0.59 |

Source: Harrasse et al. (2026), Table 1. https://aclanthology.org/2026.eacl-long.392/

### B6. PartnerMAS Full Role-Swap Ablation

| PA Model | SA Model | SPA (Aggregator) | Match Rate | 95% CI |
|----------|----------|-----------------|-----------|--------|
| gpt-4o-mini | gpt-4o-mini | gpt-4o-mini | 64.40% | +/- 5.77 |
| gpt-4.1-mini | gpt-4o-mini | gpt-4o-mini | 63.21% | +/- 5.81 |
| gpt-5-nano | gpt-4o-mini | gpt-4o-mini | 65.19% | +/- 5.89 |
| gpt-4o-mini | gpt-4.1-mini | gpt-4o-mini | 62.14% | +/- 6.03 |
| gpt-4o-mini | gpt-5-nano | gpt-4o-mini | 64.70% | +/- 5.48 |
| gpt-4o-mini | gpt-4o-mini | gpt-4.1-mini | **69.03%** | +/- 5.94 |
| gpt-4o-mini | gpt-4o-mini | gpt-5-nano | 64.70% | +/- 5.48 |

Key: PA = Planner Agent, SA = Specialized Agent, SPA = Supervisor/Aggregator Agent
Source: Li et al. (2025), Table 1. https://arxiv.org/abs/2509.24046

### B7. MAD Benchmark Full Performance Table

Format: Best score (Median in parentheses). Scale 0-1.0. All scores using GPT-3.5-turbo.

| System | MedQA | PubMedQA | MMLU | CosmosQA | CIAR | GPQA | Chess |
|--------|-------|---------|------|---------|------|------|-------|
| Medprompt | 0.65 (0.63) | 0.77 (0.77) | 0.74 (0.73) | 0.48 (0.47) | 0.54 (0.50) | 0.27 (0.25) | 0.32 (0.30) |
| Society of Mind | 0.64 (0.61) | 0.74 (0.71) | 0.73 (0.70) | 0.44 (0.39) | 0.56 (0.46) | 0.27 (0.25) | 0.26 (0.25) |
| Ensemble Refinement | 0.64 (0.60) | 0.74 (0.72) | 0.76 (0.74) | 0.45 (0.40) | 0.48 (0.46) | 0.32 (0.26) | 0.32 (0.25) |
| ChatEval | 0.60 (0.60) | 0.75 (0.73) | 0.71 (0.69) | 0.45 (0.43) | 0.48 (0.43) | 0.26 (0.25) | 0.32 (0.23) |
| Self-Consistency | 0.60 (0.60) | 0.74 (0.72) | 0.78 (0.75) | 0.46 (0.46) | 0.56 (0.52) | 0.24 (0.29) | 0.27 (0.21) |
| Single Agent | 0.60 (0.59) | 0.75 (0.70) | 0.76 (0.72) | 0.45 (0.43) | 0.50 (0.50) | 0.33 (0.28) | 0.27 (0.18) |
| Multi-Persona | 0.58 (0.57) | 0.70 (0.69) | 0.72 (0.69) | 0.46 (0.42) | 0.52 (0.50) | 0.29 (0.29) | 0.33 (0.29) |

Source: Smit et al. (2024), Table 2. https://arxiv.org/abs/2311.17371

### B8. Decision Protocols Full Data (Llama 3 8B -- Knowledge Tasks)

| Protocol | MMLU | MMLU-Pro | GPQA | SQuAD (F1) | StrategyQA | MuSR | Math-lvl-5 |
|----------|------|---------|------|-----------|-----------|------|-----------|
| Baseline | 44.8+/-1.9 | 28.5+/-1.3 | 28.9+/-1.2 | 52.3+/-2.2 | 51.7+/-2.5 | 25.8+/-1.6 | 8.8+/-1.3 |
| Baseline + CoT | 53.1+/-4.6 | 32.2+/-4.7 | 29.9+/-0.6 | 53.8+/-1.2 | 55.5+/-1.3 | 29.5+/-2.6 | 8.7+/-0.6 |
| Simple Voting | 53.3+/-1.8 | 32.0+/-2.7 | 30.5+/-0.9 | 56.2+/-0.5 | 58.5+/-0.9 | 55.2+/-1.5 | 9.5+/-1.7 |
| Ranked Voting | 49.2+/-1.5 | 33.1+/-4.6 | 27.3+/-3.9 | 58.0+/-0.8 | 56.2+/-3.4 | 52.5+/-0.0 | 6.8+/-1.3 |
| Cumulative Voting | 52.6+/-4.0 | 28.3+/-3.1 | 31.3+/-2.8 | 55.8+/-3.4 | 61.2+/-1.6 | 56.8+/-4.2 | 9.0+/-1.5 |
| Judge | 53.7+/-4.7 | 33.5+/-0.8 | 27.6+/-2.3 | 57.2+/-0.6 | 53.7+/-2.0 | 59.3+/-1.0 | 9.2+/-3.0 |
| Majority Consensus | 53.2+/-2.5 | 36.4+/-2.1 | 32.3+/-2.9 | 43.1+/-2.1 | 59.9+/-0.1 | 27.8+/-2.5 | 9.2+/-3.0 |
| Unanimity Consensus | 54.2+/-1.0 | 36.3+/-0.4 | 30.0+/-2.3 | 43.4+/-2.0 | 58.8+/-2.6 | 28.2+/-2.8 | 10.8+/-1.6 |

Source: Kaesberg (2024), Table 2. https://gipplab.uni-goettingen.de/wp-content/papercite-data/pdf/kaesberg2024.pdf

### B9. Self-MoA Quality vs. Diversity Regression

| Dataset | Alpha (Quality) | Alpha p-value | Beta (Diversity) | Beta p-value | R-squared |
|---------|----------------|--------------|-----------------|-------------|----------|
| MMLU | 2.558 +/- 0.176 | <0.001 | 1.841 +/- 0.176 | <0.001 | 0.771 |
| CRUX | 4.548 +/- 0.459 | <0.001 | 1.421 +/- 0.459 | <0.001 | 0.685 |
| MATH | 4.719 +/- 0.416 | <0.001 | 2.839 +/- 0.416 | <0.001 | 0.760 |

Source: Li et al. (2025), Table 4. https://arxiv.org/abs/2502.00674

### B10. Council Mode Full Ablation

| Configuration | HaluEval Avg | Truthful | Quality | Latency |
|--------------|-------------|----------|---------|---------|
| Full Council (3 experts + synthesis) | 10.7% | 82.6% | 91.7% | 8.4s |
| Without triage (always full council) | 10.7% | 82.6% | 91.7% | 12.1s |
| Without structured synthesis (majority vote) | 14.2% | 77.3% | 85.4% | 6.8s |
| 2 Experts + Synthesis | 12.8% | 79.1% | 88.2% | 7.1s |
| Same-model ensemble (3x GPT-5.4) | 15.6% | 75.8% | 83.1% | 7.9s |
| Best single model (Claude Opus 4.6) | 16.7% | 74.8% | 81.5% | 4.1s |

Source: Wu et al. (2026), Table 4. https://arxiv.org/abs/2604.02923

### B11. MAST Failure Mode Taxonomy (Full)

| Category | ID | Failure Mode | Frequency |
|----------|----|-------------|-----------|
| FC1: Specification (41.77%) | FM-1.1 | Disobey task specification | 10.98% |
| FC1: Specification | FM-1.2 | Disobey role specification | 0.50% |
| FC1: Specification | FM-1.3 | Step repetition | 17.14% |
| FC1: Specification | FM-1.4 | Loss of conversation history | 3.33% |
| FC1: Specification | FM-1.5 | Unaware of termination conditions | 9.82% |
| FC2: Inter-Agent (36.94%) | FM-2.1 | Conversation reset | 2.33% |
| FC2: Inter-Agent | FM-2.2 | Fail to ask for clarification | 11.65% |
| FC2: Inter-Agent | FM-2.3 | Task derailment | 7.15% |
| FC2: Inter-Agent | FM-2.4 | Information withholding | 1.66% |
| FC2: Inter-Agent | FM-2.5 | Ignored other agent's input | 0.17% |
| FC2: Inter-Agent | FM-2.6 | Reasoning-action mismatch | 13.98% |
| FC3: Task Verification (21.30%) | FM-3.1 | Premature termination | 7.82% |
| FC3: Task Verification | FM-3.2 | No/incomplete verification | 6.82% |
| FC3: Task Verification | FM-3.3 | Incorrect verification | 6.66% |

Inter-annotator agreement: Cohen's Kappa = 0.88. Source: Cemri et al. (2025). https://arxiv.org/abs/2503.13657

### B12. GSA vs. Alternatives -- Full Results (Llama 3 8B)

| Task | Greedy | Self-Refine | Self-Consistency | Choose-from-N | GSA | Oracle |
|------|--------|------------|-----------------|--------------|-----|--------|
| GSM8K | 82.47% | 82.99% | 86.35% | 84.99% | 86.05% | 91.74% |
| MATH | 29.28% | 30.32% | 31.68% | 31.28% | 32.46% | 47.26% |
| GPQA | 32.14% | 32.14% | 33.26% | 33.26% | 35.04% | 61.23% |
| MMLU | 63.13% | 63.53% | 65.62% | 64.48% | 65.62% | 77.08% |
| MT-bench | 7.43 | 7.30 | N/A | 7.45 | 7.53 | 8.04 |
| Alpaca | 27.55% | 24.42% | N/A | 29.14% | 29.34% | 36.14% |

Budget standardization: 4 total model calls across all methods. Source: Li et al. (2025), Table 1. https://arxiv.org/abs/2503.04104

---

## C. Cabinet Architecture Technical Details

### C1. Orchestrator (orchestrator.py)

**Turn strategy:** Two modes
- `round_robin`: Agents take turns sequentially in each round
- `parallel_initial_then_round_robin`: First round runs all agents in parallel; subsequent rounds are sequential round-robin

**Round structure:**
- Round 1 (parallel): All agents respond independently to the query
- Rounds 2-N (round-robin): Each agent sees other agents' prior responses and refines position
- SYCOPHANCY_NUDGE injected if judge flags sycophancy

**Synthesis:** Separate model call after all rounds complete. System prompt provides either:
- THE VERDICT + THE DEBATE BREAKDOWN (non-planning tasks)
- THE PLAN + WATCHPOINTS + COLLABORATION NOTES (planning tasks)

**Parameters:**
- `max_turns`: Capped at 5 (presets use 3-4)
- `response_word_limit`: 800 words (debate) / 950 words (planning)
- `temperature`: 0.7 default for agents
- Agent count: 1-5 configurable (presets use 2-4)
- `synthesis_model`: Configurable separately from agent models

### C2. Judge (judge.py)

**Output format:** JSON with fields:
- `disagreement_score` (0-10)
- `evidence_score` (0-10)
- `sycophancy_flag` (boolean)
- `summary` (text)

**Model:** JUDGE_MODEL_ID at temperature 0.1

**Trigger for SYCOPHANCY_NUDGE:** When `sycophancy_flag = true`

**Information available to Umpire:** Full debate transcript including all judge annotations per round. The Umpire receives disagreement_score and evidence_score for each completed round, but the current default synthesis prompt does not explicitly instruct the Umpire to weight contributions by these scores.

### C3. Cabinet Presets (cabinetPresets.ts)

**Number of presets:** 8

**Agent model pool:**
- GPT-5.4
- Claude Sonnet 4.6
- Gemini 3.1 Pro
- Grok 4.20
- DeepSeek V3.2
- Qwen3-235B
- Kimi K2.5
- Perplexity Sonar

**Typical configuration:** 2-4 agents, 3-4 rounds, heterogeneous models

**Persona types:** Socratic, Adversarial, Debate, Planning (each with distinct system prompts)

**Key design intent:** Presets are designed for specific task types. Planning presets use the Planning persona with WATCHPOINTS and COLLABORATION NOTES synthesis. Research/analysis presets use Debate or Adversarial personas with VERDICT + BREAKDOWN synthesis.

---

## D. Proposed Experimental Protocol -- Full Detail

### D1. Phase 1: Implementation (Weeks 1-2)

**D1.1 Aggregation Rule Implementation**

Three Umpire system prompt variants:

**A1 -- Plain Synthesis (Current Default)**
No modification to existing synthesis prompt. The Umpire receives the full transcript and produces THE VERDICT + THE DEBATE BREAKDOWN without explicit weighting instructions.

**A2 -- Evidence-Weighted Structured Synthesis**

Modified synthesis prompt (proposed structure):

```
You are the Umpire in a multi-agent debate. You have received:
1. The original query
2. All agent debate contributions organized by round
3. Per-round judge scores: disagreement_score (0-10) and evidence_score (0-10)

INSTRUCTIONS:
- Rounds with higher evidence_score (>=7) represent higher-quality deliberation and should be weighted more heavily in your synthesis.
- Rounds with lower evidence_score (<4) represent low-quality or convergent deliberation and should be weighted less.
- Produce a four-section output:
  [CONSENSUS]: Claims all agents agreed on with high confidence
  [CONTESTED]: Claims where agents disagreed, preserving both positions
  [UNIQUE]: Claims appearing in only one agent's contribution that merit attention
  [VERDICT]: Your synthesized final answer, explicitly noting where your synthesis resolves or preserves disagreements
```

**A3 -- Majority-Position Synthesis**

Modified synthesis prompt (proposed structure):

```
You are the Umpire in a multi-agent debate. Your task is to identify the position most commonly expressed across agents and rounds, and synthesize a response that reflects that majority position. Where a clear majority position exists, favor it. Where no majority exists, report the most defensible position. Label: [VERDICT].
```

**D1.2 Instrumentation**

Log the following per session:
- Session ID
- Task stratum (1-4)
- Task prompt text
- Aggregation condition (A1/A2/A3)
- Round count condition (R1/R2/R3/R4)
- Per-round: agent responses, judge scores, sycophancy flag fires
- Synthesis model input token count
- Synthesis model output token count
- Total session latency
- Final Umpire output text

**D1.3 Task Prompt Set**

Target: 30 prompts per stratum x 4 strata = 120 prompts total.

Stratum 1 (Factual/Research): Examples: "What is the current state of evidence on X?", "Summarize the key arguments for and against Y." Design criteria: factual ground truth verifiable against reference sources.

Stratum 2 (Analysis/Synthesis): Examples: "Evaluate the trade-offs of approach X vs. approach Y for context Z", "What are the systemic risks of policy P?" Design criteria: complex trade-off analysis with multiple defensible positions.

Stratum 3 (Planning/Decision): Examples: "What should someone in situation X do?", "Develop a plan for achieving goal Y given constraints Z." Design criteria: multi-step planning tasks where WATCHPOINTS and COLLABORATION NOTES add value.

Stratum 4 (Subjective Evaluation): Examples: "Is the following argument convincing? [argument]", "What are the strongest objections to this proposal?" Design criteria: tasks where evaluator diversity (D3 persona effect) should be most prominent.

### D2. Phase 2: Execution (Weeks 3-6)

**D2.1 Session Generation**

Run all 120 prompts x 12 factorial conditions = 1,440 sessions.
Randomize session order to avoid time-of-day API quality effects.
Store all outputs with full instrumentation.

**D2.2 Pairwise Comparison Set Construction**

For each task prompt, generate a round-robin pairwise comparison set among the 12 conditions. With 120 prompts x C(12,2) = 66 unique pairs = 7,920 comparison instances. This is the maximum; for the study's power requirement, a sampled subset of ~2,160 comparisons (systematic pairwise design) is sufficient.

**D2.3 Human Evaluation**

Rater panel: 7 raters per comparison pair (minimum), drawn from IU Kelley student pool with research experience.

Rater task: For each pair, shown outputs A and B (randomized order):
- Which response is more factually accurate? (A / B / Tie)
- Which response is more useful overall? (A / B / Tie)
- Confidence in choice: (Low / Medium / High)

Blind protocol: Outputs stripped of formatting cues that reveal condition (no mention of round count, no "Agent 1 said... Agent 2 said..." framing, only the final synthesis output).

Position control: 20% of pairs shown in both orders (A-B and B-A) to the same rater, measuring positional consistency (targeting >90% consistency, per D3's 94.8% baseline as reference standard).

### D3. Analysis Protocol

**D3.1 Primary Analysis**

Two-way ANOVA:
- DV: ELO score per condition (derived from Bradley-Terry model applied to pairwise comparison results)
- IV1: Aggregation Rule (3 levels: A1, A2, A3)
- IV2: Round Count (4 levels: R1, R2, R3, R4)
- IV3: Task Stratum (4 levels; as a blocking variable)
- Test: Main effect of A vs. main effect of R (eta-squared comparison)

Effect size computation:
- Cohen's d: best vs. worst aggregation rule condition
- Cohen's d: best vs. worst round count condition
- P2 holds if d(A) > d(R) and 95% CIs do not overlap

**D3.2 Secondary Analyses**

Stratum-specific analysis: Run all analyses separately for each of 4 strata. Report stratum-specific effect size tables. Test whether Aggregation Rule x Task Stratum interaction is significant (predicted to be non-significant based on literature, but confirming absence of interaction is important).

Interaction analysis: Test Aggregation Rule x Round Count interaction. If significant, produce interaction plot and characterize: does confidence-weighted synthesis have larger effects at certain round counts?

Sycophancy subgroup analysis: Among sessions where sycophancy flag fired (expected frequency: estimate 20-30% based on typical sycophancy rates), compare round count effect sizes to sessions without sycophancy flag. Test P8 (Section 11).

Cost-adjusted analysis: Compute quality-per-token-cost ratio for all 12 conditions. Report Pareto frontier of quality vs. cost.

**D3.3 Robustness Checks**

Rater reliability: Compute inter-rater agreement (Fleiss' Kappa) for each comparison. Flag comparisons with Kappa < 0.4 for re-evaluation with additional raters.

Position consistency: Verify >90% consistency for position-controlled comparison subset.

Session-level variance: Compute within-condition variance across sessions for the same task prompt. If within-condition variance exceeds between-condition effect sizes, flag noise floor problem and increase sample size.

---

## E. Cross-Study Contradiction Matrix

The following contradictions in the literature require resolution in the P2 experiment design:

| Claim A | Claim B | Studies | Resolution |
|---------|---------|---------|-----------|
| Quality coefficient dominates diversity (Self-MoA: alpha > beta) | Heterogeneity is universal antidote (Stop Overvaluing MAD) | Li et al. 2025 vs. arXiv:2502.08788 | Reconciled by capability gap: when one model substantially outperforms others, quality dominates; when models are roughly equal, diversity adds value. P2 experiment should test whether this pattern holds for Cabinet's specific model roster. |
| Synthesis layer is primary bottleneck (CPH, Council Mode, PartnerMAS) | Specification failures dominate (Why MAS Fail: 41.77% FC1) | Nguyen 2026 / Wu 2026 / Li 2025 vs. Cemri 2025 | Not a direct contradiction: CPH concerns which design dimension determines quality ceiling; MAST concerns which dimension accounts for most failures in deployed systems. Both can be true simultaneously. |
| More agents helps quality (Council Mode: 3 > 2 experts) | More agents doesn't eliminate adversarial risk (Nature study) | Wu 2026 vs. Kraidia 2026 | Different threat models: Council Mode tests quality under honest agent conditions; Nature study tests under adversarial conditions. Cabinet's sycophancy nudge provides partial adversarial protection. |
| Information access changes at decision step have minimal impact (Kaesberg) | Confidence-weighted synthesis has +14 pp theoretical ceiling (DebUnc) | Kaesberg 2024 vs. Yoffe 2025 | Reconciled by method: Kaesberg enriches the decision prompt with confidence scores; DebUnc operates at the attention-weight level. The claim that prompt-level confidence enrichment is ineffective is consistent with DebUnc's finding that the prompt-slope (0.17) is much shallower than the attention-slope (0.59). |
| MAD does not reliably beat self-consistency (MAD benchmark) | Structured synthesis beats majority vote by 32.7% (Council Mode) | Smit 2024 vs. Wu 2026 | MAD benchmark tests default configurations with weak base models (GPT-3.5); Council Mode uses frontier models with a sophisticated synthesis mechanism. The critical variable is synthesis quality, not debate itself. |

---

*Appendix to Study #3: The Synthesis Layer Primacy Thesis. All quantitative data sourced from primary papers as documented in Section A. Zero em dashes used in this document.*
