# Appendix: Structured Multi-Agent Debate as a Consumer AI Interface

**Sparse Halo Research, April 2026**

---

## A. Full Citation Metadata

### [1] Du et al. 2023
- **Title:** Improving Factuality and Reasoning in Language Models through Multiagent Debate
- **Authors:** Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch
- **Institution:** MIT CSAIL / Google DeepMind
- **Year:** 2023
- **Venue:** arXiv preprint; ICML 2024 Oral
- **arXiv ID:** 2305.14325
- **DOI:** 10.48550/arXiv.2305.14325
- **URL:** https://arxiv.org/abs/2305.14325

### [2] Yoffe et al. 2024
- **Title:** DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics
- **Authors:** Luke Yoffe, Alfonso Amayuelas, William Yang Wang
- **Institution:** University of California, Santa Barbara
- **Year:** 2024
- **Venue:** Findings of EMNLP 2025
- **arXiv ID:** 2407.06426
- **DOI:** 10.18653/v1/2025.findings-emnlp.1265
- **URL:** https://arxiv.org/abs/2407.06426

### [3] Motwani et al. 2024
- **Title:** MALT: Improving Reasoning with Multi-Agent LLM Training
- **Authors:** Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H.S. Torr, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt
- **Institution:** University of Oxford, Stanford University, others
- **Year:** 2024
- **Venue:** arXiv preprint
- **arXiv ID:** 2412.01928
- **DOI:** 10.48550/arXiv.2412.01928
- **URL:** https://arxiv.org/abs/2412.01928

### [4] Hegazy 2024
- **Title:** Diversity of Thought Elicits Stronger Reasoning Capabilities in Multi-Agent Debate Frameworks
- **Authors:** Mahmood Hegazy
- **Institution:** University of Montreal / Mila Quebec AI Institute
- **Year:** 2024
- **Venue:** arXiv preprint
- **arXiv ID:** 2410.12853
- **DOI:** 10.48550/arXiv.2410.12853
- **URL:** https://arxiv.org/abs/2410.12853

### [5] Chen et al. 2023
- **Title:** ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
- **Authors:** Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal
- **Institution:** UNC Chapel Hill
- **Year:** 2023
- **Venue:** arXiv preprint
- **arXiv ID:** 2309.13007
- **DOI:** 10.48550/arXiv.2309.13007
- **URL:** https://arxiv.org/abs/2309.13007

### [6] Smit et al. 2023
- **Title:** Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs
- **Authors:** Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, Arnu Pretorius
- **Institution:** InstaDeep
- **Year:** 2023
- **Venue:** arXiv preprint
- **arXiv ID:** 2311.17371
- **DOI:** 10.48550/arXiv.2311.17371
- **URL:** https://arxiv.org/abs/2311.17371

### [7] Zhang et al. 2025
- **Title:** If Multi-Agent Debate is the Answer, What is the Question?
- **Authors:** Hangfan Zhang, Zhiyao Cui, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, Shuyue Hu
- **Institution:** Not specified
- **Year:** 2025
- **Venue:** arXiv preprint
- **arXiv ID:** 2502.08788
- **DOI:** 10.48550/arXiv.2502.08788
- **URL:** https://arxiv.org/abs/2502.08788

### [8] Tran et al. 2025
- **Title:** Multi-Agent Collaboration Mechanisms: A Survey of LLMs
- **Authors:** Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O'Sullivan, Hoang D. Nguyen
- **Institution:** University College Cork, Ireland
- **Year:** 2025
- **Venue:** arXiv preprint
- **arXiv ID:** 2501.06322
- **DOI:** 10.48550/arXiv.2501.06322
- **URL:** https://arxiv.org/abs/2501.06322

### [9] Malmqvist 2024
- **Title:** Sycophancy in Large Language Models: Causes and Mitigations
- **Authors:** Linus Malmqvist
- **Institution:** Not specified
- **Year:** 2024
- **Venue:** arXiv preprint
- **arXiv ID:** 2411.15287
- **DOI:** 10.48550/arXiv.2411.15287
- **URL:** https://arxiv.org/abs/2411.15287

### [10] Choi et al. 2025
- **Title:** When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
- **Authors:** Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li
- **Institution:** University of Wisconsin-Madison
- **Year:** 2025
- **Venue:** arXiv preprint
- **arXiv ID:** 2510.07517
- **DOI:** 10.48550/arXiv.2510.07517
- **URL:** https://arxiv.org/abs/2510.07517

### [11] Nguyen et al. 2025
- **Title:** Bias in Multi-Agent Systems
- **Authors:** Thien Nguyen et al.
- **Institution:** Not specified
- **Year:** 2025
- **Venue:** arXiv preprint
- **arXiv ID:** 2510.10943
- **URL:** https://arxiv.org/abs/2510.10943

### [12] Sentosa and Widianto 2025
- **Title:** MACS: A Cognitive Diversity Multi-Agent Consensus Framework for Bias Mitigation in Automated Evaluation Systems
- **Authors:** Arrival Dwi Sentosa, Julyan Widianto
- **Institution:** Bandung Institute of Technology (ITB), School of Electrical Engineering and Informatics
- **Year:** 2025
- **Venue:** 2025 International Conference on Electrical Engineering and Informatics (ICEEI)
- **URL:** https://scholar.google.com/citations?user=2A92OE0AAAAJ (author profile)

### [13] Patel 2026
- **Title:** Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
- **Authors:** Dipkumar Patel
- **Institution:** LLMs Research Inc.
- **Year:** 2026
- **Venue:** arXiv preprint
- **arXiv ID:** 2604.03809
- **DOI:** 10.48550/arXiv.2604.03809
- **URL:** https://arxiv.org/abs/2604.03809

### [14] Li et al. 2024
- **Title:** Improving Multi-Agent Debate with Sparse Communication Topology
- **Authors:** Yunxuan Li et al.
- **Institution:** Not specified
- **Year:** 2024
- **Venue:** arXiv preprint
- **arXiv ID:** 2406.11776
- **DOI:** 10.48550/arXiv.2406.11776
- **URL:** https://arxiv.org/abs/2406.11776

### [15] Minsky 1986
- **Title:** The Society of Mind
- **Authors:** Marvin Minsky
- **Institution:** MIT
- **Year:** 1986
- **Publisher:** Simon & Schuster
- **ISBN:** 978-0671657130

### [16] Tetlock and Gardner 2015
- **Title:** Superforecasting: The Art and Science of Prediction
- **Authors:** Philip E. Tetlock, Dan Gardner
- **Institution:** University of Pennsylvania (Tetlock)
- **Year:** 2015
- **Publisher:** Crown
- **ISBN:** 978-0804136693

### [17] Kahneman, Sibony, and Sunstein 2021
- **Title:** Noise: A Flaw in Human Judgment
- **Authors:** Daniel Kahneman, Olivier Sibony, Cass R. Sunstein
- **Institution:** Princeton (Kahneman), HEC Paris (Sibony), Harvard (Sunstein)
- **Year:** 2021
- **Publisher:** Little, Brown
- **ISBN:** 978-0316451406

### [18] Chang 2024
- **Title:** SocraSynth: Multi-LLM Reasoning Platform Using Conditional Statistics
- **Authors:** Edward Y. Chang
- **Institution:** Not specified
- **Year:** 2024
- **Venue:** arXiv preprint
- **arXiv ID:** 2402.06634
- **DOI:** 10.48550/arXiv.2402.06634
- **URL:** https://arxiv.org/abs/2402.06634

### [19] Doudkin et al. 2025
- **Title:** The Spark Effect
- **Authors:** Andrey Doudkin et al.
- **Institution:** Not specified
- **Year:** 2025
- **Venue:** arXiv preprint
- **arXiv ID:** 2510.15568
- **DOI:** 10.48550/arXiv.2510.15568
- **URL:** https://arxiv.org/abs/2510.15568

### [20] Wang et al. 2024
- **Title:** Rethinking the Bounds of LLM Reasoning: Are Multi-Agent Discussions the Key?
- **Authors:** Qineng Wang et al.
- **Institution:** Not specified
- **Year:** 2024
- **Venue:** arXiv preprint
- **arXiv ID:** 2402.18272
- **DOI:** 10.48550/arXiv.2402.18272
- **URL:** https://arxiv.org/abs/2402.18272

---

## B. Raw Competitive Audit Findings

### Consumer AI Products (April 2026)

| Product | Multi-Agent Debate | Notes | Source |
|---|---|---|---|
| **ChatGPT (OpenAI)** | No | Has memory and custom instructions; OpenAI Agents SDK is developer-only and locked to OpenAI models | Product inspection, April 2026 |
| **Claude (Anthropic)** | No | Extended thinking (visible CoT) is single-agent; Claude Agent SDK available for developers | Product inspection, April 2026 |
| **Perplexity** | No | Search-focused; no deliberation mechanism | Product inspection, April 2026 |
| **Poe (Quora)** | No (manual model switching) | Group chats for up to 200 users/models launched Nov 2025; no structured debate protocol; no synthesis agent; context loss between model switches | Product inspection, April 2026 |
| **OpenWebUI** | Possible via custom pipelines | MoA pipeline exists but requires: Git, Python, API keys, .env config, server management | https://github.com/open-webui |
| **Grok 4.20 (xAI)** | Yes (fixed roles) | First consumer product with native debate. 4 agents: Grok (Captain), Harper (Research), Benjamin (Logic), Lucas (Contrarian). 65% hallucination reduction. Fixed roles, no user config, no visible transcript by default, locked to xAI model | xAI product announcement, February 2026 |

### Developer Frameworks (April 2026)

| Framework | Requirements | Multi-Agent Debate | Model Support | Source |
|---|---|---|---|---|
| **AutoGen (Microsoft)** | Python, pip, OpenAI API key, 50+ lines of code | Yes | Multi-model | https://github.com/microsoft/autogen |
| **CrewAI** | Python, pip, API keys, YAML or code config | Yes | Multi-model | https://github.com/crewai |
| **LangGraph** | Python, langchain imports, StateGraph definitions | Yes | Multi-model | https://github.com/langchain-ai/langgraph |
| **LLM Council** | API keys for multiple providers, technical setup | Yes | Multi-model by design | Open-source, Andrej Karpathy |

### Key Competitive Gaps (Post-Grok 4.20)

1. **Configurability:** No consumer product allows users to choose debate personas, set deliberation depth, or control debate structure.
2. **Transparency:** No consumer product makes the full debate transcript visible as a core feature.
3. **Persona Library:** No consumer product offers persistent, selectable expert personas.
4. **Model Agnosticism:** Grok 4.20 is locked to xAI's model. Developer frameworks support multiple models but require coding.

---

## C. Framework Falsification Conditions

### Framework 1: The Persona Substitution Hypothesis

**Thesis:** Role-differentiated persona prompting of a single model is structurally sufficient to produce some of the reasoning gains demonstrated in multi-model debate, because some gains derive from adversarial pressure and role differentiation. However, the full diversity benefit requires architectural model diversity.

**Falsification Condition:** If a study demonstrated that role-differentiated prompts on a single model produced accuracy gains statistically indistinguishable from multi-model debate across 5 or more benchmarks, the diversity premium would be disproven. Conversely, if persona-differentiated single-model debate consistently failed to outperform single-agent baselines, the adversarial pressure component of the hypothesis would be falsified.

### Framework 2: The Synthesis Layer Primacy Thesis

**Thesis:** The quality and design of the synthesis agent (Umpire) is a stronger determinant of output quality than the number of debate rounds or the diversity of debating agents.

**Falsification Condition:** If output quality correlated more strongly with debate round count or agent diversity than with synthesis mechanism design across controlled experiments where all three variables were independently manipulated, the thesis would be falsified. Specifically, if doubling debate rounds from 2 to 4 produced larger quality gains than switching from majority vote to confidence-weighted synthesis, the thesis would not hold.

### Framework 3: The Consumer Accessibility Gap

**Thesis:** Multi-agent deliberative AI has been academically validated since 2023 but has not reached broad consumer adoption because of an underappreciated UX design challenge, not a technical capability gap.

**Falsification Condition:** If a consumer product launched with configurable multi-agent debate (user-selectable personas, adjustable depth, visible debate process, no code required) and failed to gain adoption despite adequate distribution and marketing, the thesis that the gap is a UX problem would be falsified. This would suggest the barrier is demand-side (users do not want deliberation) rather than supply-side (products have not offered it).

---

## D. Research Proposition Statements (Verbatim)

### P1: Persona Substitution Gains (Framework 1)

"We predict that users querying Cabinet with two differentiated personas on a single model will produce responses rated 10 to 20% higher on factual accuracy by blind evaluators, compared to single-agent responses, as measured by a panel of 50 or more evaluators across 200 or more diverse queries."

### P2: Synthesis Layer Primacy (Framework 2)

"We predict that varying the Umpire's synthesis strategy (e.g., confidence-weighted integration vs. simple majority rule vs. last-agent-standing) will produce larger effect sizes on response quality than varying the number of debate rounds from 1 to 4, as measured by blind human evaluation scores on a 1-to-7 Likert scale."

### P3: Consumer Demand for Deliberation (Framework 3)

"We predict that users given access to configurable deliberation depth (choice of 1, 2, or 3 rounds) will voluntarily select multi-round debate for 40% or more of queries within 30 days, and that satisfaction scores for multi-round responses will exceed single-round responses by 15% or more, as measured by in-app ratings."

---

## E. Key Quantitative Findings by Paper

| Paper | Finding | Metric | Value | Benchmark | Notes |
|---|---|---|---|---|---|
| Du et al. [1] | Debate improves reasoning | Accuracy gain | Significant (not single number) | GSM-8K, MMLU, Arithmetic, Chess | Plateaus by round 3-4 |
| DebUnc [2] | Attention-based confidence improvement | Avg accuracy | 0.53 -> 0.55 (entropy), 0.67 (Oracle) | MMLU, GSM-8K, TruthfulQA, Arithmetic | Oracle ceiling shows +26.4% potential |
| DebUnc [2] | TruthfulQA improvement (Llama) | Accuracy | 0.52 -> 0.68 (Oracle) | TruthfulQA | +30.8% relative |
| DebUnc [2] | Attention slope | Slope value | Attn-All 0.59, Attn-Others 0.45, Prompt 0.17 | Cross-benchmark | Attention > text prompt for confidence |
| DebUnc [2] | AUROC for uncertainty estimation | AUROC | Entropy avg 0.639, TokenSAR avg 0.627 | Cross-benchmark | |
| MALT [3] | Training for debate roles | Relative improvement | MATH +15.66%, GSM-8K +7.42%, CSQA +9.40% | MATH, GSM-8K, CSQA | Largest gains in debate literature |
| Hegazy [4] | Diverse vs. homogeneous models | Accuracy | 91% (diverse) vs. 82% (homogeneous) | GSM-8K | 9pp gap favoring model diversity |
| Hegazy [4] | Diverse ensemble SOTA | Accuracy | 94% (ASDiv), surpassing GPT-4 and Gemini Ultra | ASDiv | |
| Hegazy [4] | MATH performance | Relative advantage | +24% over GPT-4, +14% over Gemini Ultra | MATH | |
| ReConcile [5] | Multi-model vs. homogeneous | Accuracy | 79.0% (multi) vs. 72.2% (3xChatGPT) | StrategyQA | 6.8pp gap |
| ReConcile [5] | BERTScore diversity | BERTScore similarity | 0.8739 (multi) vs. 0.9102 (homogeneous) | StrategyQA | Lower similarity = higher accuracy |
| ReConcile [5] | Confidence-weighted voting | Accuracy | 79.0% (weighted) vs. 77.1% (majority) | StrategyQA | |
| ReConcile [5] | Round-by-round accuracy | Accuracy per round | R0: 74.3%, R1: 77.0%, R2: 79.0%, R3: 78.7% | StrategyQA | Peaks at R2, declines at R3 |
| ReConcile [5] | Consensus speed | Consensus % | 100% by R3 vs. 87% for standard debate at R4 | StrategyQA | |
| Smit et al. [6] | Agreement intensity modulation | Accuracy change | ~15% improvement (worst to best) | MedQA | Implementation details dominate |
| Smit et al. [6] | Multi-Persona round data | Accuracy by round | R2: 0.68, R3: 0.71, R4: 0.72 | MedQA | Diminishing returns |
| Zhang et al. [7] | Heterogeneity as antidote | Qualitative finding | "Universal antidote" for MAD performance | 9 benchmarks, 4 models | |
| MACS [12] | Adversarial peer review | Scoring accuracy | 92.3% | Scoring tasks | |
| MACS [12] | Variance reduction | Variance reduction | 63% | Scoring tasks | |
| Patel [13] | Representational collapse | Cosine similarity | 0.888 (mean, same model, different roles) | Hidden representations | |
| Patel [13] | Effective rank | Rank | 2.17 / 3.0 | 3-agent committee | Role prompts produce collapsed diversity |
| Patel [13] | DALC protocol performance | Accuracy | 87% vs. 84% (self-consistency) at 26% lower token cost | GSM-8K | |
| Spark Effect [19] | Persona conditioning diversity | Diversity gain | +4.1 on 1-10 scale | Diversity evaluation | Gap to human experts: 1.0 point |
| Grok 4.20 | Hallucination reduction | Hallucination rate | ~12% -> ~4.2% (65% reduction) | Internal benchmarks (xAI) | First consumer implementation |
| Grok 4.20 | Marginal cost | Cost multiplier | 1.5-2.5x single pass | Internal (shared KV cache) | Not 4x due to architecture |

---

## F. Additional References (Cited in Research Data but Not Primary)

| Paper | Authors | Year | arXiv ID | Key Relevance |
|---|---|---|---|---|
| When "A Helpful Assistant" Is Not Really Helpful | Zheng et al. | 2024 | 2311.10054 | Personas in system prompts do not improve objective task performance |
| Expert Personas Improve Alignment but Damage Accuracy | Hu et al. | 2026 | N/A | Persona prompting steers tone but mixed accuracy results |
| Persona is a Double-edged Sword | Kim et al. | 2024 | 2408.08631 | Negative impact of persona injection; Jekyll & Hyde ensembling proposed |
| Two Tales of Persona in LLMs | Tseng et al. | 2024 | 2406.01171 | Survey of role-playing and personalization approaches |
| Town Hall Debate Prompting | Sandwar et al. | 2025 | 2502.15725 | Multi-persona debate within single LLM |

---

*This appendix accompanies "Structured Multi-Agent Debate as a Consumer AI Interface: A Review of the Evidence and a Framework for Accessible Deliberative AI" published at sparsehalo.xyz, April 2026.*
