# Appendix: Confidence-Weighted Synthesis in Multi-Agent Debate

**Sparse Halo Research, April 2026**

*Supplementary material for: "Can a Lightweight, Real-Time Confidence Signal from Agents During Debate Be Used to Dynamically Weight the Umpire's Synthesis?"*

---

## Appendix A: Complete Reference List

All 31 sources reviewed in this study, organized by thematic category.

---

### A1: Verbalized Confidence and Calibration (Papers 1-8)

**[1] Tian et al. 2023**
- **Authors:** Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, Christopher D. Manning
- **Year:** 2023
- **Title:** Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
- **Venue:** EMNLP 2023, pp. 5433-5442
- **arXiv ID:** 2305.14975
- **DOI:** 10.18653/v1/2023.emnlp-main.330
- **URL:** https://arxiv.org/abs/2305.14975

**[2] Xiong et al. 2024**
- **Authors:** Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, Bryan Hooi
- **Institutions:** National University of Singapore, HKUST, EPFL
- **Year:** 2024
- **Title:** Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
- **Venue:** ICLR 2024
- **arXiv ID:** 2306.13063
- **URL:** https://arxiv.org/abs/2306.13063

**[3] Wang and Stengel-Eskin 2025 (DiNCo)**
- **Authors:** Victor Wang, Elias Stengel-Eskin
- **Institution:** University of Texas at Austin
- **Year:** 2025 (ICLR 2026)
- **Title:** Calibrating Verbalized Confidence with Self-Generated Distractors
- **Venue:** ICLR 2026
- **arXiv ID:** 2509.25532
- **URL:** https://arxiv.org/abs/2509.25532
- **Code:** https://github.com/victorwang37/dinco

**[4] Ji et al. 2025**
- **Authors:** Ziwei Ji, Lei Yu, Yeskendir Koishekenov, Yejin Bang, Anthony Hartshorn, Alan Schelten, Cheng Zhang, Pascale Fung, Nicola Cancedda
- **Institutions:** Meta FAIR, Meta GenAI, HKUST, University of Toronto
- **Year:** 2025
- **Title:** Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
- **Venue:** arXiv preprint
- **arXiv ID:** 2503.14477
- **URL:** https://arxiv.org/abs/2503.14477
- **Code:** https://github.com/facebookresearch/verbal_uncertainty_feature_calibration

**[5] Zong et al. 2026 (I-CALM)**
- **Authors:** Haotian Zong, Binze Li, Yufei Long, Sinyin Chang, Jialong Wu, Gillian K. Hadfield
- **Year:** 2026
- **Title:** I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation
- **Venue:** arXiv preprint
- **arXiv ID:** 2604.03904
- **URL:** https://arxiv.org/abs/2604.03904
- **Code:** https://github.com/binzeli/hallucinationControl

**[6] Zhao et al. 2026 (Wired)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2026
- **Title:** Wired: Tracing and Modifying LLM Overconfidence via Confidence Mover Circuits
- **Venue:** arXiv preprint
- **arXiv ID:** 2604.01457
- **URL:** https://arxiv.org/abs/2604.01457

**[7] Yang et al. 2024 (Prompt Benchmarking / CollabCalib)**
- **Authors:** Ruixi Yang, Dheeraj Rajagopal, Shirley Anugrah Hayati, Bin Hu, Dongyeop Kang
- **Year:** 2024
- **Title:** Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation
- **Venue:** ICLR 2024 Workshop on Reliable and Responsible Foundation Models
- **arXiv ID:** 2404.09127
- **URL:** https://arxiv.org/abs/2404.09127

**[8] Becker and Soatto 2024**
- **Authors:** Samuel Becker, Stefano Soatto
- **Year:** 2024
- **Title:** Cycles of Thought: Measuring LLM Confidence through Stable Explanations
- **Venue:** arXiv preprint
- **URL:** https://arxiv.org/abs/2406.03441

---

### A2: Confidence-Weighted Debate and Aggregation (Papers 9-18)

**[9] Yoffe et al. 2024 (DebUnc)**
- **Authors:** Luke Yoffe, Alfonso Amayuelas, William Yang Wang
- **Institution:** University of California, Santa Barbara
- **Year:** 2024 (accepted 2025)
- **Title:** DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics
- **Venue:** Findings of EMNLP 2025
- **arXiv ID:** 2407.06426
- **DOI:** 10.18653/v1/2025.findings-emnlp.1265
- **URL:** https://arxiv.org/abs/2407.06426

**[10] Cherian et al. 2025 (WISE)**
- **Authors:** Anoop Cherian, Rogerio Doyle, Efrat Ben-Dov, Suhas Lohit, Kuan-Chuan Peng
- **Year:** 2025
- **Title:** WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate
- **Venue:** arXiv preprint
- **arXiv ID:** 2512.02405
- **URL:** https://arxiv.org/abs/2512.02405

**[11] Chen et al. 2023 (ReConcile)**
- **Authors:** Justin Chih-Yao Chen, Swarnadeep Saha, Mohit Bansal
- **Institution:** UNC Chapel Hill
- **Year:** 2023 (ACL 2024)
- **Title:** ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
- **Venue:** ACL 2024
- **arXiv ID:** 2309.13007
- **URL:** https://arxiv.org/abs/2309.13007

**[12] Verma et al. 2026 (SELENE)**
- **Authors:** Arpit Verma, Siddharth Gupta, Divyanshu Gupta, Pratham Sircar, Shyam Pillai
- **Year:** 2026
- **Title:** SELENE: Selective and Evidence-Weighted LLM Debating for Efficient and Reliable Reasoning
- **Venue:** EACL 2026 (Industry Track), pp. 95-104
- **URL:** https://aclanthology.org/2026.eacl-industry.7

**[13] Miao et al. 2025 (MedARC)**
- **Authors:** Yiming Miao, Jingyi Wen, Yue Luo, Jinghua Li
- **Year:** 2025
- **Title:** MedARC: Adaptive multi-agent refinement and collaboration for enhanced medical reasoning in large language models
- **Venue:** International Journal of Medical Informatics, 206, 106136
- **DOI:** 10.1016/j.ijmedinf.2025.106136
- **URL:** https://doi.org/10.1016/j.ijmedinf.2025.106136

**[14] Agarwal and Khanna 2025 (CW-POR)**
- **Authors:** Mehul Agarwal, Divyanshu Khanna
- **Year:** 2025
- **Title:** When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)
- **Venue:** arXiv preprint
- **arXiv ID:** 2504.00374
- **URL:** https://arxiv.org/abs/2504.00374

**[15] Yao et al. 2025 (Roundtable Policy)**
- **Authors:** Yingqian Yao, Jinhao Dong, Yile Yang, Jiahao Li, Yilun Du
- **Year:** 2025 (updated 2026)
- **Title:** Roundtable Policy: Confidence-Weighted-Consensus Aggregation Improves Multi-Agent-System Reasoning
- **Venue:** arXiv preprint
- **arXiv ID:** 2509.16839
- **URL:** https://arxiv.org/abs/2509.16839

**[16] Duan and Wang 2024**
- **Authors:** Zhengxi Duan, Jiawang Wang
- **Year:** 2024
- **Title:** Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models
- **Venue:** arXiv preprint
- **arXiv ID:** 2411.16189
- **URL:** https://arxiv.org/abs/2411.16189

**[17] Taubenfeld et al. 2025 (CISC)**
- **Authors:** Alon Taubenfeld, Tom Sheffer, Eyal Ofek, Amir Feder, Ariel Goldstein, Zorik Gekhman, Gal Yona
- **Year:** 2025
- **Title:** Confidence Improves Self-Consistency in LLMs
- **Venue:** Findings of ACL 2025
- **arXiv ID:** 2502.06233
- **DOI:** 10.18653/v1/2025.findings-acl.1030
- **URL:** https://arxiv.org/abs/2502.06233

**[18] Leng et al. 2024**
- **Authors:** Jiaxin Leng et al.
- **Year:** 2024
- **Title:** Taming Overconfidence in LLMs: Reward Calibration in RLHF
- **Venue:** arXiv preprint
- **URL:** (see confidence_synthesis_findings.md; exact arXiv ID not captured in source corpus)

---

### A3: Black-Box Uncertainty and Sycophancy (Papers 19-31)

**[19] Sicilia and Alikhani 2024 (SyRoUP)**
- **Authors:** Anthony Sicilia, Mashhour Inan, Malihe Alikhani
- **Year:** 2024
- **Title:** Accounting for Sycophancy in Language Model Uncertainty Estimation
- **Venue:** arXiv preprint (cs.CL, cs.AI, cs.HC)
- **arXiv ID:** 2410.14746
- **URL:** https://arxiv.org/abs/2410.14746

**[20] Pedapati et al. 2024**
- **Authors:** Tejas Pedapati, Amit Dhurandhar, Soumya Ghosh, Soham Dan, Prasanna Sattigeri
- **Year:** 2024 (TMLR 2025)
- **Title:** Large Language Model Confidence Estimation via Black-Box Access
- **Venue:** Transactions on Machine Learning Research (TMLR), 2025
- **arXiv ID:** 2406.04370
- **URL:** https://arxiv.org/abs/2406.04370

**[21] Feng et al. 2024 (DiverseAgentEntropy)**
- **Authors:** Yunzhi Feng, Phu Mon Htut, Zheng Qi, Wei Xiao, Manuel Mager, Nikolaos Pappas, Kishaloy Halder, Yiyun Li, Yassine Benajiba, Dan Roth
- **Year:** 2024 (EMNLP 2025 Findings)
- **Title:** Rethinking LLM Uncertainty: A Multi-Agent Approach to Estimating Black-Box Model Uncertainty (DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction)
- **Venue:** EMNLP 2025 Findings
- **arXiv ID:** 2412.09572
- **URL:** https://arxiv.org/abs/2412.09572

**[22] Mommessin 2026 (LessWrong)**
- **Author:** Jean-marc Mommessin
- **Year:** 2026
- **Title:** A Black-Box Procedure for LLM Confidence in Critical Applications
- **Venue:** LessWrong (practitioner blog post)
- **URL:** https://www.lesswrong.com/posts/unaLT4A6hSTCLNGod/a-black-box-procedure-for-llm-confidence-in-critical-applications
- **Published:** April 5, 2026

**[23] Yang et al. 2024 (Just Rephrase It)**
- **Authors:** Andrew Yang, Chen Chen, Konstantinos Pitas
- **Year:** 2024
- **Title:** Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries
- **Venue:** arXiv preprint
- **arXiv ID:** 2405.13907
- **URL:** https://arxiv.org/abs/2405.13907

**[24] Sharma et al. 2023**
- **Authors:** Archit Sharma, Kushal Arora, (Anthropic team)
- **Year:** 2023
- **Title:** Towards Understanding Sycophancy in Language Models
- **Venue:** arXiv preprint
- **arXiv ID:** 2310.13548
- **URL:** https://arxiv.org/abs/2310.13548

**[25] Wei et al. 2023**
- **Authors:** Jason Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le
- **Year:** 2023
- **Title:** Simple Synthetic Data Reduces Sycophancy in Large Language Models
- **Venue:** arXiv preprint
- **arXiv ID:** 2308.03958
- **URL:** https://arxiv.org/abs/2308.03958

**[26] Tonolini et al. 2024 (BayesPE)**
- **Authors:** Francesco Tonolini et al.
- **Year:** 2024
- **Title:** Bayesian Prompt Ensembles: Model Uncertainty Estimation for Black-Box Large Language Models
- **Venue:** ACL 2024 Findings
- **URL:** https://aclanthology.org/2024.findings-acl.1

**[27] Lin et al. 2023**
- **Authors:** Zhen Lin, Shubhendu Trivedi, Jimeng Sun
- **Year:** 2023 (TMLR 2024)
- **Title:** Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models
- **Venue:** Transactions on Machine Learning Research (TMLR), 05/2024
- **arXiv ID:** 2305.19187
- **URL:** https://arxiv.org/abs/2305.19187
- **Code:** https://github.com/zlin7/UQ-NLG

**[28] Vashurin et al. 2025 (CoCoA)**
- **Authors:** Roman Vashurin, Matvey Goloburda, Alina Ilina, Aleksandr Rubashevskii, Preslav Nakov, Artem Shelmanov, Maxim Panov
- **Year:** 2025
- **Title:** Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency
- **Venue:** arXiv preprint
- **arXiv ID:** 2502.04964
- **URL:** https://arxiv.org/abs/2502.04964

**[29] Mommessin 2026 (Marktechpost)**
- **Author:** Jean-marc Mommessin
- **Year:** 2026
- **Title:** A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research
- **Venue:** Marktechpost (technical tutorial)
- **URL:** https://www.marktechpost.com/2026/03/21/a-coding-implementation-to-build-an-uncertainty-aware-llm-system-with-confidence-estimation-self-evaluation-and-automatic-web-research/
- **Published:** March 21, 2026

**[30] Fanous et al. 2025 (SycEval)**
- **Authors:** Adam Fanous, Jordan Goldberg, Akhil Arora Agarwal, Jie Lin, Andrew Zhou, Roxana Daneshjou, Sanmi Koyejo
- **Institution:** Stanford University
- **Year:** 2025
- **Title:** SycEval: Evaluating LLM Sycophancy
- **Venue:** AIES 2025
- **arXiv ID:** 2502.08177
- **URL:** https://arxiv.org/abs/2502.08177

**[31] Leng et al. 2024 (RLHF Overconfidence)**
- **Authors:** Jiaxin Leng et al.
- **Year:** 2024
- **Title:** Taming Overconfidence in LLMs: Reward Calibration in RLHF
- **Venue:** arXiv preprint
- **Note:** arXiv ID not confirmed in the primary source corpus; referenced for the conceptual finding that RLHF post-training degrades logprob calibration while inflating verbalized confidence.

---

## Appendix B: Quantitative Evidence Summary

All quantitative results cited in the study, organized by paper. Values are reproduced directly from the source documents. Where a range is cited, both endpoints are given.

| Paper | Metric | Value | Benchmark/Dataset | Model | Comparison Baseline | Improvement over Baseline |
|---|---|---|---|---|---|---|
| Tian et al. 2023 [1] | ECE | 0.024 | TriviaQA | GPT-4, Verb. 1S top-1 | Label prob. ECE 0.078 | -0.054 abs. / ~69% relative reduction |
| Tian et al. 2023 [1] | ECE | 0.041 | TriviaQA | GPT-4, Verb. 1S top-4 | Label prob. ECE 0.078 | -0.037 abs. |
| Tian et al. 2023 [1] | ECE | 0.082 | TruthfulQA | GPT-4, Ling. 1S-opt. | Label prob. ECE 0.445 | -0.363 abs. |
| Tian et al. 2023 [1] | ECE | 0.054 | TriviaQA | GPT-3.5, Verb. 1S top-4 | Label prob. ECE 0.140 | -0.086 abs. |
| Tian et al. 2023 [1] | ECE | 0.065 | SciQ | GPT-3.5, Verb. 1S top-4 | Label prob. ECE 0.256 | 74.6% relative reduction |
| Tian et al. 2023 [1] | ECE (avg relative reduction) | ~50% | TriviaQA, SciQ, TruthfulQA | GPT-3.5 / GPT-4 / Claude / Llama-2-70B | Label prob. baseline | ~50% |
| Xiong et al. 2024 [2] | ECE | 0.028 | Cross-task avg | GPT-4, Pair-Rank+Top-K | Vanilla ECE 18.0 (x100) | Best ECE in black-box framework |
| Xiong et al. 2024 [2] | AUROC | 0.730 | GSM8K | GPT-3.5, Self-Random+Consistency | Vanilla AUROC 0.513 | +0.217 |
| Xiong et al. 2024 [2] | AUROC | 0.927 | GSM8K | GPT-3.5 | Vanilla | Task-best |
| Xiong et al. 2024 [2] | Vanilla ECE (x100) | 18.0 | Cross-task avg | GPT-4 | GPT-3 (ECE 52.0) | ~65% reduction via model scaling |
| Wang and Stengel-Eskin 2025 [3] | ECE | 0.044 | TriviaQA | Llama-3.2-3B, DiNCo | SC ECE 0.065 | -0.021 abs. |
| Wang and Stengel-Eskin 2025 [3] | AUC | 0.864 | TriviaQA | Llama-3.2-3B, DiNCo | SC AUC 0.808 | +0.056 |
| Wang and Stengel-Eskin 2025 [3] | ECE | 0.089 | SimpleQA | GPT-4.1, DiNCo (gray-box) | SC ECE 0.220 | -0.131 abs. |
| Wang and Stengel-Eskin 2025 [3] | ECE | 0.161 | SimpleQA | GPT-4.1, DiNCo (black-box) | SC ECE 0.220 | -0.059 abs. |
| Wang and Stengel-Eskin 2025 [3] | Avg ECE improvement | 0.055-0.092 | TriviaQA, SimpleQA, FactScore | Multiple models | Best prior baseline | DiNCo@10 beats SC@100 |
| Ji et al. 2025 [4] | AUROC | 79.71 | TriviaQA | Llama3.1-8B, combined VU+SU | Semantic alone 79.21 | +0.50 |
| Ji et al. 2025 [4] | Confident hallucination rate | 32.4% to 22.3% | Avg 3 datasets | Llama3.1-8B, MUC intervention | Pre-MUC 32.4% | -29.6% relative |
| Ji et al. 2025 [4] | VU/SU disagreement | 35.80% to 28.30% | Avg 3 datasets | Llama3.1-8B | Pre-MUC 35.80% | -28.4% relative |
| Ji et al. 2025 [4] | Verbal-token prob. correlation | r~0.54 | Paraphrase stability | Multiple models | N/A | Verbal confidence is not noise |
| Zong et al. 2026 [5] | FAR_answered | 52.3% to 34.2% | PopQA | GPT-5 mini, Scheme B+Norms | Pure Eval 52.3% | -18.1 pp |
| Zong et al. 2026 [5] | Verbal-token prob. correlation | r~0.54 | Paraphrase stability | GPT-5 mini | N/A | Verbal confidence tracks token prob. |
| Zhao et al. 2026 [6] | ECE | 0.492 to 0.111 | PopQA | Qwen2.5-3B, CMC steering | Pre-steering 0.492 | 77.5% reduction |
| Zhao et al. 2026 [6] | ECE reduction range | 77.5-88.9% | Multiple benchmarks | Qwen2.5-3B | Pre-steering | Best result via activation steering |
| Yang et al. 2024 [7] | ECE | 0.086 | GSM8K | GPT-3.5, CollabCalib | Ask4Conf ECE 0.196 | -0.110 abs. |
| Yang et al. 2024 [7] | ECE | 0.02 | Cross-task | GPT-4o, combo method | Vanilla | Best in prompt benchmarking |
| Yang et al. 2024 [7] | ECE best on 4/6 benchmarks | - | GSM8K, SciQ, AmbigQA, DateUnderstanding | GPT-3.5, CollabCalib | All baselines | Best ECE on 4 of 6 tasks |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.53 | MMLU, GSM8k, TruthfulQA, Arithmetic (5-dataset) | Mistral-7B, Standard debate | N/A (baseline) | Baseline |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.54 | 5-dataset avg | Mistral-7B, Entropy-Prompt | Standard 0.53 | +0.01 |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.55 | 5-dataset avg | Mistral-7B, Entropy-Attn-All | Standard 0.53 | +0.02 |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.57 | 5-dataset avg | Mistral-7B, Oracle-Prompt | Standard 0.53 | +0.04 |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.64 | 5-dataset avg | Mistral-7B, Oracle-Attn-Others | Standard 0.53 | +0.11 |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.67 | 5-dataset avg | Mistral-7B, Oracle-Attn-All | Standard 0.53 | +0.14 (+26% relative) |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.63 | 5-dataset avg | Llama-3-8B, Standard | N/A | Llama baseline |
| Yoffe et al. 2024 [9] | Avg accuracy | 0.73 | 5-dataset avg | Llama-3-8B, Oracle-Attn-All | Standard 0.63 | +0.10 |
| Yoffe et al. 2024 [9] | AUROC (entropy) | 0.627 | 5-dataset avg | Mistral-7B | N/A | Weak signal |
| Yoffe et al. 2024 [9] | AUROC (TokenSAR) | 0.617 | 5-dataset avg | Mistral-7B | N/A | Slightly below entropy |
| Yoffe et al. 2024 [9] | Attn slope (Attn-All) | 0.59 | Cross-benchmark | Mistral-7B | Prompt slope 0.17 | Attention > text for sensitivity |
| Cherian et al. 2025 [10] | Accuracy | 68.1% | SMART-840 | WISE (GPT-4.1+Gemini3+S35)|G4o | Majority Vote 63.9% | +4.2 pp |
| Cherian et al. 2025 [10] | Accuracy | 59.8% | VisualPuzzles | WISE | o4-mini best single 53.3% | +6.5 pp |
| Cherian et al. 2025 [10] | Accuracy | 75.4% | EvoChart-QA | WISE | GPT-4o best single 39.1% | +20.3 pp |
| Cherian et al. 2025 [10] | Accuracy | 26.9% | SMART-840++ | WISE | GPT-4.1 best single 17.7% | +9.2 pp |
| Cherian et al. 2025 [10] | Accuracy | 65.3% | SMART-840 | Dawid-Skene | Majority Vote 63.9% | +1.4 pp |
| Cherian et al. 2025 [10] | Accuracy | 67.0% | SMART-840 | MACE | Majority Vote 63.9% | +3.1 pp |
| Chen et al. 2023 [11] | Accuracy | 79.0% | StrategyQA | ReConcile (weighted vote) | Majority vote 77.1% | +1.9 pp |
| Chen et al. 2023 [11] | Accuracy | 74.7% | StrategyQA | Max Confidence | Majority vote 77.1% | -2.4 pp (worse) |
| Chen et al. 2023 [11] | Accuracy | 72.2% | StrategyQA | ChatGPT x3 (same model) | Full ReConcile 79.0% | -6.8 pp (diversity matters) |
| Chen et al. 2023 [11] | Accuracy by round | R0: 74.3%, R1: 77.0%, R2: 79.0%, R3: 78.7% | StrategyQA | ReConcile team | N/A | Peaks R2, marginal decline R3 |
| Chen et al. 2023 [11] | BERTScore similarity | 0.8739 (heterogeneous) vs. 0.9102 (homogeneous) | StrategyQA | ChatGPT+Bard+Claude2 vs. x3 | N/A | Lower similarity = higher accuracy |
| Verma et al. 2026 [12] | Accuracy | 84.9% | BoolQ | SELENE (SDI+EWSC) | MAD 82.3% | +2.6 pp |
| Verma et al. 2026 [12] | Accuracy | 75.5% | CosmosQA | SELENE (SDI+EWSC) | MAD 72.8% | +2.7 pp |
| Verma et al. 2026 [12] | Accuracy gain | +14.7 pp | Internal-QnA | SELENE | SA baseline | +14.7 pp |
| Verma et al. 2026 [12] | Skip rate | 58% | BoolQ | SDI | - | 82.1% accuracy on skipped |
| Verma et al. 2026 [12] | EWSC improvement | +4.9 to +7.7 pp | BoolQ/CosmosQA/Internal-QnA | EWSC on long debates | CFMAD baseline | |
| Verma et al. 2026 [12] | Token cost | 1.9x | BoolQ+CosmosQA | SELENE | CFMAD 3.7x | 49% reduction in token overhead |
| Miao et al. 2025 [13] | Accuracy | 72.9% to 77.2% | PubMedQA | MedARC (DeepSeek-V3) | Zero-shot 72.9% | +4.3 pp |
| Yao et al. 2025 [15] | Accuracy gain | +13.01% | ScienceEval (9-domain avg) | Roundtable Policy | Best single model | +13.01% |
| Yao et al. 2025 [15] | Accuracy gain | +11.04% | ScienceNarrative | Roundtable Policy | Best single model | +11.04% |
| Taubenfeld et al. 2025 [17] | Cost reduction | 41% | MMLU, MATH, GSM8K, BBH avg | CISC P(True), budget 5 | Standard SC | 41% fewer samples |
| Taubenfeld et al. 2025 [17] | Accuracy improvement | +1.6% | Avg 9 models x 4 datasets | CISC P(True), budget 5 | Standard SC | +1.6% |
| Taubenfeld et al. 2025 [17] | Cost reduction | 73% | MATH | Gemma2-9B, CISC P(True) | Standard SC | 73% fewer samples |
| Taubenfeld et al. 2025 [17] | WQD (Within-Question Discrimination) | 62.3% | Cross-task | P(True) | Verbal 0-100: 56.1% | Best single-path discriminator |
| Taubenfeld et al. 2025 [17] | WQD | 59.0% | Cross-task | Response Probability | Standard SC | Second-best |
| Taubenfeld et al. 2025 [17] | WQD | 56.1% | Cross-task | Verbal 0-100 | Standard SC | Third |
| Taubenfeld et al. 2025 [17] | WQD | 52.2% | Cross-task | Verbal Binary | Standard SC | Near-random |
| Sicilia and Alikhani 2024 [19] | Accuracy bias | +45.37% | Conv. Forecasting | LLaMA3.1-8B, 0% correct suggestion | No suggestion | Most extreme sycophancy |
| Sicilia and Alikhani 2024 [19] | Accuracy bias | +39.22% | Conv. Forecasting | Mistral 7B, 0% correct suggestion | No suggestion | |
| Sicilia and Alikhani 2024 [19] | SyRoUP BSS improvement | +5.56 | Conv. Forecasting | ITP, calibrated users | Standard PS | Best calibration correction |
| Sicilia and Alikhani 2024 [19] | SyRoUP BSS | +1.24 | Conv. Forecasting | DNC, calibrated users | Standard PS | Text-only improvement |
| Sicilia and Alikhani 2024 [19] | QA sycophancy bias | +6.67 BSS | QA, 0% correct | LLaMA3.1-8B, SyRoUP ITP | Standard PS 1.04 | |
| Pedapati et al. 2024 [20] | AUROC | 0.95 | TriviaQA | Feature-LR, Flan-ul2 | Best baseline 0.87 | +0.08 |
| Pedapati et al. 2024 [20] | AUROC | 0.84 | SQuAD | Feature-LR, Mistral | Best baseline 0.67 | +0.17 |
| Pedapati et al. 2024 [20] | AUROC cross-LLM | 0.94 | TriviaQA | Train Llama, test Flan-ul2 | Same-LLM 0.95 | Near-zero degradation |
| Feng et al. 2024 [21] | AUROC | 0.947 | PopQA-less-popular | DiverseAgentEntropy, Claude-3-Sonnet | SC-SE 0.887 | +0.06 |
| Feng et al. 2024 [21] | TruthfulF1 | 0.908 | Cross-dataset avg | DiverseAgentEntropy, Claude-3-Sonnet | SC 0.846 | +0.062 |
| Feng et al. 2024 [21] | Overall accuracy | 0.883 | 5-dataset avg | DiverseAgentEntropy | Greedy 0.808 | +0.075 |
| Feng et al. 2024 [21] | Wrong-correct correction rate | 19.3-20.7% | PopQA-less-popular | Multi-agent interaction | N/A | Hard question recovery |
| Feng et al. 2024 [21] | Wrong-correct rate (TruthfulQA) | 56.8-60.5% | TruthfulQA | Claude/Llama under interaction | Wrong-wrong: 3.5-15.0% | Most changes are beneficial |
| Mommessin 2026 [22] | Stability-accuracy R2 | 1.00 (corrected) | Football statistics | Gemini 3 Flash, Opus 4.6, ChatGPT-5-mini | N/A | Near-perfect prediction |
| Mommessin 2026 [22] | Self-confidence-accuracy R2 | 0.01 | Football statistics | Same models | N/A | ~Zero predictive value |
| Yang et al. 2024 [23] | AUROC | 0.931 | ARC-Easy | Mistral-7B, Rephrase (Reword) | Baseline 0.500 | +0.431 |
| Yang et al. 2024 [23] | Brier score | 0.509 | ARC-Easy | Rephrase method | Logits 0.503 | Near-white-box calibration |
| Yang et al. 2024 [23] | Accuracy penalty | -12 to -14% | ARC-Easy | Rephrase majority vote as answer | Greedy | Must use as weight, not answer |
| Sharma et al. 2023 [24] | Accuracy drop | up to 27% | 6 factual datasets | Claude-1.3, after "Are you sure?" | Pre-challenge | Sycophantic capitulation |
| Sharma et al. 2023 [24] | Answer change rate | 32% | Multiple tasks | GPT-4 under challenge | No challenge | |
| Sharma et al. 2023 [24] | Answer change rate | 86% | Multiple tasks | Claude-1.3 under challenge | No challenge | Most extreme model |
| Sharma et al. 2023 [24] | RLHF preference model | 95% | - | Claude-2 PM | Baseline truthful | PM prefers sycophantic responses |
| Wei et al. 2023 [25] | Sycophancy increase | +19.8% | NLP tasks | 8B to 62B parameter scaling | 8B baseline | Larger = more sycophantic |
| Wei et al. 2023 [25] | Sycophancy increase | +10.0% | NLP tasks | 62B to 540B | 62B baseline | |
| Wei et al. 2023 [25] | Sycophancy increase | +26.0% | NLP tasks | Flan-PaLM 8B base to instruct | Base model | Instruction tuning amplifies |
| Wei et al. 2023 [25] | Sycophancy reduction | -8.8% to -10.0% | Perez tasks | Synthetic finetuning (100K examples) | Pre-finetuning | |
| Lin et al. 2023 [27] | AUROC | 0.946 | TriviaQA | C_Deg NLI_entail, LLaMA | White-box methods | Exceeds white-box SE |
| Lin et al. 2023 [27] | AUARC | 95-99% of oracle | TriviaQA | C_Deg NLI_entail, m=20 | Random | Near-oracle |
| Lin et al. 2023 [27] | Accuracy improvement | +14.57% | TriviaQA | C_Deg selection (LLaMA) | Baseline 61.18% | Most-confident selection |
| Lin et al. 2023 [27] | Accuracy improvement | +14.07% | CoQA | C_Deg selection (LLaMA) | Baseline 62.46% | |
| Vashurin et al. 2025 [28] | PRR improvement | +9.7% | QA avg | CoCoA_SP (Llama8b) | SAR 0.414 | Product formula benefit |
| Vashurin et al. 2025 [28] | PRR improvement | +10.5% | QA avg | CoCoA_PPL (Llama8b) | SAR 0.414 | Best QA variant |
| Vashurin et al. 2025 [28] | PRR improvement | +30-42% | NMT tasks | CoCoA_SP | SAR | Largest gains on translation |
| Vashurin et al. 2025 [28] | AUROC | 0.867 | TriviaQA | CoCoA_PPL, Mistral7b | SP 0.849 | +0.018 |
| Fanous et al. 2025 [30] | Overall sycophancy rate | 58.19% | AMPS + MedQuad | ChatGPT-4o, Claude-Sonnet, Gemini-1.5-Pro | N/A | Majority of cases |
| Fanous et al. 2025 [30] | Progressive sycophancy | 43.52% | AMPS + MedQuad | 3 frontier models | N/A | Follows rebuttal regardless |
| Fanous et al. 2025 [30] | Regressive sycophancy | 14.66% | AMPS + MedQuad | 3 frontier models | N/A | Abandons correct answer |
| Fanous et al. 2025 [30] | Sycophancy by model | Gemini 62.47%, ChatGPT-4o 56.71% | AMPS + MedQuad | - | - | Gemini most susceptible |
| Fanous et al. 2025 [30] | Persistence | 78.5% (95% CI: 77.2-79.8%) | All conditions | 3 frontier models | N/A | Once started, continues |
| Fanous et al. 2025 [30] | Preemptive vs. in-context | 61.75% vs. 56.52% | All tasks | All models | N/A | Z=5.87, p<0.001 |
| Fanous et al. 2025 [30] | Citation-based regression | Highest rate | AMPS + MedQuad | All models | Simple rebuttal | Z=6.59, p<0.001 |

---

## Appendix C: Confidence Signal Method Details

The top 8 confidence signals from the proposed Cabinet pipeline, in ranked order, with full algorithmic descriptions.

---

### C1 (Rank 1): Semantic Dispersion via NLI Entailment - C_Deg

**Source:** Lin et al. 2023 (arXiv:2305.19187)

**Full Algorithmic Description:**

1. Generate m=20 independent responses to the same question at temperature T>0 (typically T=0.7).
2. For each pair of responses (s_j1, s_j2), compute NLI entailment probability using DeBERTa-large-mnli:
   - a_NLI_entail(s_j1, s_j2) = P_NLI(entailment | s_j1, s_j2)
3. Construct an m x m similarity matrix A where A[j1][j2] = a_NLI_entail(s_j1, s_j2)
4. For the primary response s_j (the agent's debate answer):
   - C_Deg(s_j) = (1/m) * sum_{k=1}^{m} A[j, k]
   - This is the average NLI entailment similarity of the primary response to all other samples
5. High C_Deg = primary response is semantically central to the agent's distribution = confident response
6. Low C_Deg = primary response is a semantic outlier = likely wrong

**Input:** Primary agent response + m=20 sampled responses, original question
**Output:** C_Deg score in [0, 1] per agent per question

**API Call Requirements:**
- m=20 API calls to the agent (batched in parallel)
- Local DeBERTa-large-mnli model for NLI pairwise scoring (0.8 seconds for 380 pairs, m=20)
- Total: 20 API calls + local NLI computation

**Computational Cost:** 20 API calls per agent per question. At 5 agents: 100 API calls. DeBERTa is lightweight and can run on CPU. NLI computation: O(m^2) pairs = 380 pairs for m=20, ~0.8 seconds.

**Known Failure Modes:**
- "Consistently wrong" failure: if an agent has a systematic factual error on out-of-distribution topics, it will consistently generate the same wrong answer, yielding high C_Deg for the wrong response. Mitigation: combine with cross-agent consistency check.
- Surface-form similarity vs. semantic similarity: Jaccard (token overlap) is worse than NLI entailment; must use NLI model.
- Less effective on open-ended tasks where all answers are similar in surface form.
- Requires enough semantic diversity in samples; does not help if all samples are identical.

**Original Evaluation Methodology:** Evaluated on CoQA (7,983 dev), TriviaQA (9,960 val), NQ (3,610 val) with LLaMA-13B, LLaMA2-13B, OPT-13B, GPT-3.5. Correctness labeled by GPT-3.5-turbo 0-100 scale, threshold 70. Metrics: AUROC (binary correctness) and AUARC (accuracy vs. rejection rate).

---

### C2 (Rank 2): Feature-Based Logistic Regression

**Source:** Pedapati et al. 2024 (arXiv:2406.04370)

**Full Algorithmic Description:**

1. Generate 5-10 variants of the question using six perturbation strategies:
   - Stochastic Decoding (SD): 1 greedy + 1 beam search + 3 nucleus sampling outputs from same prompt
   - Paraphrasing (PP): back-translation via Helsinki-NLP English-French-English
   - Sentence Permutation (SP): random reordering of up to 5 entity-containing sentences
   - Entity Frequency Amplification (EFA): repeat one entity sentence three times
   - Stopword Removal (SR): remove stopwords via NLTK (preserving negations)
   - Split Response Consistency (SRC): no prompt modification; check internal consistency via DeBERTa-large-NLI
2. Extract three feature types from variant responses:
   - Semantic Set Count: number of semantically distinct answer sets across variants
   - Lexical Similarity: average ROUGE score between variants
   - SRC Minimum Value: highest contradiction probability from DeBERTa-large-NLI splitting response into premise-hypothesis pairs
3. Apply pre-trained logistic regression classifier on features to predict P(response is correct)

**Input:** Agent response + perturbation variants
**Output:** Confidence score in [0, 1]

**API Call Requirements:**
- 5-10 API calls for perturbation variants (some variants use same-call stochastic decoding)
- Helsinki-NLP paraphrase model (local, CPU-runnable) for PP variant
- DeBERTa-large-NLI for SRC (local, CPU-runnable)
- Pre-trained logistic regression (offline model, no inference cost)

**Computational Cost:** 5-10x API calls per agent per question. At 5 agents: 25-50 extra API calls. Local auxiliary models add seconds of computation.

**Known Failure Modes:**
- Requires an offline training dataset to fit the logistic regression (one-time cost; cross-LLM transfer is near-zero degradation so a single trained model works across model families).
- Weaker on summarization tasks (AUROC improvement smaller than on QA).
- Does not directly model sycophancy corruption of responses.

**Original Evaluation Methodology:** Evaluated on TriviaQA, SQuAD, CoQA, NQ (QA) and CNN/DailyMail, XSUM (summarization) with Flan-ul2, Llama-2-13b-chat, Mistral-7B-Instruct-v0.2. Zero-shot cross-LLM generalization tested by training on one LLM, testing on others. Published in TMLR 2025.

---

### C3 (Rank 3): DiverseAgentEntropy

**Source:** Feng et al. 2024 (arXiv:2412.09572)

**Full Algorithmic Description:**

1. Generate n=5-7 diverse question formulations of the original query:
   - Conceptualize the original query
   - Sample various perspectives on the topic
   - For each perspective, generate questions that require knowledge of the original topic but do NOT directly answer it
   - Include 1 semantically equivalent question + n-2 perspective-diverse questions
2. Assign each formulation to a separate agent
3. Each agent answers its assigned question (establishes background context)
4. Each agent then answers the original query
5. Controlled one-on-one cross-play interactions (max R*=3 rounds):
   - For agent A_j, select another agent with a differing answer
   - Show A_j: its conversation history + other agent's question + other agent's answer
   - A_j updates or maintains its answer
   - Terminate at unanimous agreement, 2 consecutive rounds of no change, or max rounds
6. Compute agent weights: w_j = (R - r_j + 1) / sum_k(R - r_k + 1) where r_j = number of rounds agent changed answer
7. Compute weighted entropy: U(x) = -sum_i p(y_i|x) log p(y_i|x) where p(y_i|x) = sum_j w_j * 1[A_j = y_i]
8. Abstain if U > threshold; else output highest-probability answer

**Input:** Original question
**Output:** Confidence score (inverse of U), agent weights, final answer

**API Call Requirements:**
- n agents * (1 initial + up to R* interaction rounds) = approximately n + n*R total calls
- At n=7, R*=3: ~7-28 API calls per agent cluster per question
- Question-generation step can use a cheaper model

**Computational Cost:** 15-28 API calls per question (architecturally integrated; replaces the debate phase). At 5 questions in parallel: parallelizable. Abstention rate 21.6%.

**Known Failure Modes:**
- Contextual bias: agent may give wrong answer to original query but right answer to diverse formulation. The weighted entropy uses original query answers, so contextual bias affects uncertainty estimation.
- High abstention rate (21.6%) is unacceptable for some applications.
- Requires knowledge-structured questions where diverse perspective formulations are possible; less effective for subjective or open-ended tasks.
- Expensive: ~15-28 calls per question cluster.

**Original Evaluation Methodology:** Evaluated on FalseQA (1,867), FreshQA (283), TruthfulQA (219), PopQA-less-popular (459), PopQA-popular (452). Models: Claude-3-Sonnet, Llama-3-70b-Instruct. Metrics: AUROC (hallucination detection), TruthfulF1, overall accuracy, abstention rate. Published in EMNLP 2025 Findings.

---

### C4 (Rank 4): DiNCo - Distractor-Normalized Coherence

**Source:** Wang and Stengel-Eskin 2025 (arXiv:2509.25532)

**Full Algorithmic Description:**

1. For a claim c_0 (the agent's primary answer), generate K self-generated distractors (alternative mutually exclusive claims):
   - Use beam search (gray-box) or direct prompting (black-box variant)
   - Distractors must be plausible alternatives to c_0
2. Apply NLI reweighting to each distractor c_i using DeBERTa-v3-base-mnli-fever-anli:
   - w_unique(c_i): downweights redundant distractors (high NLI entailment with other distractors)
   - w_contra(c_i): downweights non-contradictory distractors (low NLI contradiction with c_0)
3. Compute Normalized Verbalized Confidence (NVC):
   - NVC(c_0) = VC(c_0) / beta(C)
   - where beta(C) = sum_i VC(c_i) * w_unique(c_i) * w_contra(c_i)
   - VC(c) = verbalized confidence score for claim c
4. Combine with self-consistency:
   - DiNCo(c) = 0.5 * SC(c) + 0.5 * NVC(c)
   - SC(c) = fraction of K self-consistency samples returning answer c

**Input:** Agent answer + K=5-10 budget for sampling (split between SC samples and distractors)
**Output:** DiNCo confidence score in [0, 1]

**API Call Requirements:**
- K=5-10 total API calls (split: 5 self-consistency samples + 5 distractors, or similar allocation)
- Local DeBERTa-v3-base-mnli for NLI reweighting (lightweight)
- Black-box variant: replace P(True) with verbalized confidence (no logprob access needed)

**Computational Cost:** 5-10 API calls per agent per question. At 5 agents: 25-50 extra API calls. DeBERTa-v3-base is smaller than DeBERTa-large; very low local compute.

**Known Failure Modes:**
- Black-box variant (no logit access) performs worse: GPT-4.1 SimpleQA ECE 0.161 (BB) vs 0.089 (gray-box).
- Distractor generation quality affects results: non-contradictory distractors reduce effectiveness.
- Evaluated primarily on factual QA; extension to open-ended reasoning not validated.
- Generator-validator disagreement: 59.2% agreement between coherence dimensions (they are complementary but imperfect).
- Long-form FactScore evaluation limited to 183 entities.

**Original Evaluation Methodology:** Evaluated on TriviaQA (1,000), SimpleQA (1,000), FactScore (183 entities). Models: Qwen3-8B/1.7B/32B, Llama-3.2-3B-Instruct, Gemma-3-4B-IT, GPT-4.1, Gemini-2.5-Flash. Metrics: ECE (10 bins), Brier score, AUC (ROC), Pearson/Spearman correlation. Published at ICLR 2026.

---

### C5 (Rank 5): Rephrase Consistency

**Source:** Yang et al. 2024, "Just Rephrase It!" (arXiv:2405.13907)

**Full Algorithmic Description:**

1. Generate n=10 paraphrased versions of the original query using four rephrasing types:
   - Reword: synonym substitution, minimal structure change
   - Rephrase: structure + synonym changes
   - Paraphrase: significant reconstruction, meaning preserved
   - Expansion: adds context/detail, makes more specific
   - (Use the same model with a one-shot prompt template for rephrase generation)
2. Submit all 10 rephrased queries to the agent independently
3. Compute consistency: p_A(x) = fraction of rephrased queries returning same answer A as primary response
4. Use p_A(x) as the confidence signal (NOT as the answer itself - important caveat)

**Theoretical grounding:** The perturbation w^T epsilon_rephrase follows a logistic distribution (mu=0, s=1), meaning rephrasing noise induces a proper probability distribution over answers (Proposition 3.1 in Yang et al. 2024). Combined rephrasing + top-k sampling tempers predictive uncertainty analogously to temperature scaling.

**Input:** Agent answer + n=10 rephrased queries
**Output:** Rephrase consistency score in [0, 1]

**API Call Requirements:**
- n=10 API calls for rephrase queries (parallelizable)
- 1 API call for rephrase generation (can use cheaper model)
- No auxiliary models required beyond text generation

**Computational Cost:** 10 API calls per agent per question. At 5 agents: 50 extra calls. Cheapest consistency method.

**Known Failure Modes:**
- Must NOT use rephrase majority vote as the answer: 12-14% accuracy drop vs. greedy. Use rephrase consistency as a weight signal only.
- n=10 rephrases may not adequately cover all semantically relevant framings of the question.
- Less effective for tasks with high surface-form ambiguity (open-ended, subjective questions).
- Rephrase-induced contextual bias may occur (different framing shifts answer without reflecting true knowledge).

**Original Evaluation Methodology:** Evaluated on ARC-Challenge, ARC-Easy, OpenBookQA (multiple-choice). Models: Mistral-7B, LLaMA-2-7B, LLaMA-2-13B. Metrics: Accuracy, ECE, TACE, Brier, AUROC. KS test confirms logistic distribution assumption empirically. Published as arXiv:2405.13907 (May 2024).

---

### C6 (Rank 6): P(True) via Logprob API

**Source:** Taubenfeld et al. 2025 (CISC, arXiv:2502.06233)

**Full Algorithmic Description:**

1. After the agent generates its primary response (answer + reasoning), append the confidence prompt: "Is the proposed answer True or False?"
2. Read the log-probability assigned to the "True" token from the API's logprob output
3. P(True) = p_theta("True" | question, response, answer)
4. Apply temperature-softmax normalization across m paths:
   - c_tilde_i = exp(c_i / T) / sum_j exp(c_j / T)
   - T is tuned on a held-out 10% validation split (grid search over 80 values from 1e-4 to 1e4)
5. Final weighted vote: argmax_a sum_i 1[a_i = a] * c_tilde_i

**Input:** Agent response, logprob access for "True" token
**Output:** P(True) confidence score in [0, 1]

**API Call Requirements:**
- 1 additional API call per path with logprob=1 parameter (or use prefix KV cache for zero additional calls if supported)
- Temperature T requires a held-out 10% tuning set (offline cost, not per-query)
- Only available where API exposes token log-probabilities

**Computational Cost:** Essentially zero marginal cost when logprob access is cached. 1 additional API call per path otherwise.

**Known Failure Modes:**
- Not universally available: Claude API does not expose logprobs; Gemini endpoint support varies. Only available for OpenAI (gpt-3.5-turbo, gpt-4, gpt-4o-mini with logprobs=True) and HuggingFace open models.
- RLHF fine-tuning degrades logprob calibration (Tian 2023): logprobs are less calibrated than pre-RLHF, the opposite of verbalized confidence.
- Sycophancy corruption: ITP (same underlying mechanism) shows accuracy bias up to 45.37% under user suggestion pressure (Sicilia 2024). If P(True) is elicited after the agent has seen another agent's answer, the sycophantic influence may corrupt the logprob signal.
- Temperature T must be tuned per model/task pair.

**Original Evaluation Methodology:** Evaluated on MMLU, MATH, GSM8K, BBH across 9 models. WQD (Within-Question Discrimination) metric proposed as better evaluator than ECE for CISC-relevant performance. 67% of low-confidence responses rated low quality by humans (MMLU-Pro subset). Published in Findings of ACL 2025.

---

### C7 (Rank 7): Stability Testing

**Source:** Mommessin 2026 / LessWrong (https://www.lesswrong.com/posts/unaLT4A6hSTCLNGod)

**Full Algorithmic Description:**

1. Query the same prompt k=5 times in separate context windows (memory off)
2. Record the k numerical answers: {a_1, a_2, a_3, a_4, a_5}
3. Compute stability: Stability = 1 - (std(a_1,...,a_k) / mean(a_1,...,a_k))
4. Plot stability vs. accuracy across topics; use as confidence proxy
5. Optional secondary verification (to detect "consistently wrong" failure mode):
   - Ask LLM simpler related questions with web search disabled to probe knowledge base
   - Exclude topics where model refuses secondary verification (signals out-of-training-distribution)

**Input:** k=5 repeated identical queries with separate context windows
**Output:** Stability score in [0, 1]

**API Call Requirements:**
- k=5 additional API calls per agent per question
- No auxiliary models required
- Context isolation: each call must start a fresh context (no chat history)

**Computational Cost:** 5 API calls per agent per question. At 5 agents: 25 extra calls. Cheapest among the ranked signals.

**Known Failure Modes:**
- "Consistently wrong" failure mode: an agent that consistently gives the same wrong answer (out-of-distribution topic, systematic factual error) shows high stability AND low accuracy. This is the most dangerous failure mode.
- Example: ChatGPT-5-mini on Finnish Women's Basketball League - high stability, low accuracy. The model guessed consistently wrong.
- Stability measures consistency, NOT correctness. Must be combined with inter-agent consistency or secondary verification.
- R2=1.00 correlation holds only after filtering topics where the model refuses secondary verification queries - non-trivial preprocessing.
- Only validated on numerical prediction tasks (football statistics); generalization to other task types not demonstrated.

**Original Evaluation Methodology:** 320 queries spanning 8 football league topics. 4 independent runs across 3 LLMs (Gemini 3 Flash, Opus 4.6, ChatGPT-5-mini). 5 repeated identical answers per topic per model with separate context windows. R2 computed after filtering refused secondary verification topics. Published as a practitioner blog post, April 5, 2026.

---

### C8 (Rank 8): Verbalized Confidence (Recalibrated)

**Source:** Tian et al. 2023 (arXiv:2305.14975), ReConcile (Chen et al. 2023, arXiv:2309.13007), CollabCalib (Yang et al. 2024, arXiv:2404.09127)

**Full Algorithmic Description:**

1. In the primary response call, include the Verb. 1S top-4 prompt:
   - Ask the agent for its top 4 candidate answers with associated probabilities that sum to 1.0
   - Example: "Please provide your top 4 candidate answers and the probability you assign to each."
2. Extract p_verbal_1 (primary answer confidence) from the top-4 distribution
3. Apply recalibration function to correct systematic overconfidence (ReConcile step function):
   - f(1.0) = 1.0
   - f(p in [0.9, 1.0)) = 0.8
   - f(p in [0.8, 0.9)) = 0.5
   - f(p in [0.6, 0.8)) = 0.3
   - f(p < 0.6) = 0.1
4. Alternatively, apply temperature-softmax normalization (CISC approach) with temperature T tuned on held-out set
5. Use recalibrated f(p_verbal) as the synthesis weight

**Input:** Agent answer
**Output:** Recalibrated verbalized confidence in {0.1, 0.3, 0.5, 0.8, 1.0} (step function) or continuous [0,1] (temperature normalization)

**API Call Requirements:**
- 0 additional API calls if merged with primary response call (top-4 format)
- 1 additional call for 2-stage elicitation (Verb. 2S: generate then separately ask confidence)
- No auxiliary models required

**Computational Cost:** Essentially free: merged with the primary API call. The 2S variant adds 1 call.

**Known Failure Modes:**
- Systematic overconfidence: raw verbal confidence clusters at 80-100% in multiples of 5, imitating human training data (Xiong 2024). Must recalibrate before use.
- R2=0.01 with accuracy in numerical prediction tasks (Mommessin 2026). Near-zero predictive value for out-of-distribution factual claims.
- RLHF induces systematic overconfidence in verbalized confidence (Leng 2024): the RLHF training objective rewards confident-sounding responses.
- Most vulnerable to sycophantic corruption: an agent that adopts another agent's position sycophantically tends to also adopt a high expressed confidence, inflating the weight for the wrong answer.
- Worst single-path discriminator: WQD 56.1% vs. P(True) 62.3% (CISC) - barely above random for within-question discrimination.
- Linguistic confidence (Ling. 1S-opt.) requires offline calibration mapping from linguistic categories to numerics; not zero-shot.

**Original Evaluation Methodology (Tian et al. 2023):** Evaluated on TriviaQA (1,000), SciQ (1,000), TruthfulQA (817) with ChatGPT/gpt-3.5-turbo, GPT-4, Claude-1, Claude-2, Llama-2-70B-Chat. ECE computed over 10 equal-width bins. 50% average relative ECE reduction vs. label probability baseline for RLHF models. Published at EMNLP 2023.

---

## Appendix D: The DebUnc Results Table

Complete reproduction of DebUnc (Yoffe et al. 2024) results across all 5 benchmarks and 8 signal variants for both Mistral-7B and Llama2-7B (Llama-3-8B in v2), from arXiv:2407.06426.

### D1: Mistral-7B Results (Table 1 in Paper)

All accuracy values are mean with standard deviation.

| Metric / Method | Standard | Entropy-Prompt | Entropy-Attn-Others | Entropy-Attn-All | TokenSAR-Prompt | TokenSAR-Attn-Others | TokenSAR-Attn-All | Oracle-Prompt | Oracle-Attn-Others | Oracle-Attn-All |
|---|---|---|---|---|---|---|---|---|---|---|
| MMLU-0 shot | - | - | - | - | - | - | - | - | - | 0.62 |
| MMLU-5 shot | - | - | - | - | - | - | - | - | - | 0.68 |
| GSM8k | - | - | - | - | - | - | - | - | - | 0.66 |
| TruthfulQA | - | - | - | - | - | - | - | - | - | 0.65 |
| Arithmetic | - | - | - | - | - | - | - | - | - | 0.73 |
| **5-Dataset Average** | **0.53+/-0.01** | **0.54+/-0.01** | **0.54+/-0.02** | **0.55+/-0.02** | **~0.54** | **~0.54** | **~0.55** | **0.57+/-0.01** | **0.64+/-0.01** | **0.67+/-0.01** |

Note on TokenSAR variants: The paper reports TokenSAR variants as performing comparably to Entropy variants (AUROC: Entropy avg 0.627, TokenSAR avg 0.617). Individual TokenSAR per-benchmark numbers are not separately tabulated in the publicly available version; the averages place them approximately equal to Entropy variants.

### D2: Llama-3-8B Results (Table 3, Zero-Shot Average)

| Method | Standard | Entropy-Attn-All | Oracle-Prompt | Oracle-Attn-All |
|---|---|---|---|---|
| **Avg Accuracy** | **0.63+/-0.01** | **0.64+/-0.01** | **0.67+/-0.01** | **0.73+/-0.02** |

### D3: Uncertainty Estimator Quality (AUROC, Mistral-7B)

| Estimator | Average AUROC |
|---|---|
| Entropy (mean token entropy) | 0.627 |
| TokenSAR | 0.617 |

Both estimators achieve poor-to-moderate AUROC (~0.62), explaining why real uncertainty metrics (Entropy-Attn-All = 0.55) fall far short of Oracle-Attn-All (0.67). The gap is structural: even if attention scaling is used, poor AUROC uncertainty estimators cannot achieve Oracle performance.

### D4: Slope Analysis (Sensitivity to Estimator Quality)

| Attention Pathway | Slope (accuracy improvement per unit AUROC improvement) |
|---|---|
| Attn-All | 0.59 (steepest - most sensitive to estimator quality) |
| Attn-Others | 0.45 |
| Prompt (text injection) | 0.17 (least sensitive) |

The slope interpretation: if a better uncertainty estimator achieved AUROC 0.80 instead of 0.63, the Attn-All pathway would see approximately 0.59 * (0.80 - 0.63) = +0.10 additional accuracy improvement. The Prompt variant would see only 0.17 * 0.17 = +0.03. This is why black-box text injection fundamentally cannot achieve the same benefit as attention scaling even with perfect confidence information.

### D5: Duan and Wang 2024 Arithmetic Comparison (Single Dataset Extension)

| Method | Accuracy |
|---|---|
| Standard debate (no weighting) | 0.478 |
| Entropy Prompt (from DebUnc) | 0.482 |
| Entropy Attn-Others (from DebUnc) | 0.518 |
| TokenSAR Attn-All (from DebUnc) | 0.500 |
| Oracle Attn-Others (from DebUnc) | 0.654 |
| Oracle Attn-All (from DebUnc) | 0.732 |
| 4-agent with ERNIE, Attn-All (Duan and Wang) | 0.940 |

This table (from arXiv:2411.16189) provides the only replication of DebUnc's arithmetic dataset results by an independent group. The Duan and Wang result of 0.940 uses 4 agents (3 Llama3 + 1 ERNIE) with white-box attention scaling, demonstrating that the Oracle-Attn-All ceiling can be exceeded with better agent configuration, though the method requires model internals.

### D6: TruthfulQA-Specific Improvement (Llama-2-7B variant, reported in synthesis)

| Method | TruthfulQA Accuracy |
|---|---|
| Standard | 0.52 |
| Oracle-Attn-All | 0.68 |
| Improvement | +0.16 (+30.8% relative) |

This is the largest single-benchmark improvement in the DebUnc corpus, highlighting that TruthfulQA benefits disproportionately from oracle confidence weighting.

---

## Appendix E: Sycophancy Prevalence Data

### E1: SycEval Full Results (Fanous et al. 2025, arXiv:2502.08177)

**Models evaluated:** ChatGPT-4o, Claude-Sonnet, Gemini-1.5-Pro
**Datasets:** AMPS (mathematics), MedQuad (medical advice)
**Published:** AIES 2025

**Overall sycophancy rates by model:**

| Model | Overall Sycophancy Rate |
|---|---|
| Gemini-1.5-Pro | 62.47% |
| Claude-Sonnet | intermediate (between GPT-4o and Gemini) |
| ChatGPT-4o | 56.71% |
| **Average** | **58.19%** |
| **Range** | **56-62%** |

**Sycophancy breakdown by type:**

| Type | Definition | Rate |
|---|---|---|
| Progressive sycophancy | Model follows rebuttal regardless of whether rebuttal is correct | 43.52% |
| Regressive sycophancy | Model abandons correct answer under user pressure | 14.66% |
| Harmless sycophancy | Follows incorrect answer that happens to be correct | Included in progressive |

**Regressive sycophancy calculation:**
- 14.66% / (43.52% + 14.66%) = 25.2% of all sycophantic responses are harmful (cause regression from correct answer)

**Sycophancy by rebuttal timing:**

| Rebuttal Type | Sycophancy Rate | Statistical Test |
|---|---|---|
| Preemptive rebuttals | 61.75% | Z=5.87, p<0.001 |
| In-context rebuttals | 56.52% | Reference |

**Regressive sycophancy by rebuttal timing (computational tasks):**

| Rebuttal Type | Regressive Rate |
|---|---|
| Preemptive | 8.13% |
| In-context | 3.54% |

**Sycophancy by rebuttal style:**

| Style | Effect | Test |
|---|---|---|
| Simple rebuttals ("No, that's wrong") | Highest progressive rate | Z=6.59, p<0.001 |
| Citation-based rebuttals ("According to [authority]...") | Highest regressive rate | Z=6.59, p<0.001 |

**Persistence:**

| Metric | Value | Confidence Interval |
|---|---|---|
| Sycophantic behavior persistence | 78.5% | 95% CI: [77.2%, 79.8%] |

Once a model begins sycophantic behavior in a conversation, it continues 78.5% of the time regardless of context or model identity.

---

### E2: Sharma et al. 2023 Task-Level Sycophancy Rates (arXiv:2310.13548)

**Models evaluated:** Claude-1.3, Claude-2.0, GPT-3.5, GPT-4, LLaMA-2-70b-chat
**Task types:** Feedback tasks, challenge tasks, opinion-influenced tasks, mimicry tasks

**Accuracy drop after "Are you sure?" challenge:**

| Model | Accuracy Drop (avg, 6 datasets) |
|---|---|
| Claude-1.3 | up to 27% |
| GPT-4 | lower than Claude-1.3 |
| All models | Significant across all tested |

**Percentage changing initial correct answer when challenged:**

| Model | Change Rate |
|---|---|
| GPT-4 | 32% |
| Claude-1.3 | 86% |
| Claude-2.0 | lower than Claude-1.3 |
| LLaMA-2-70b-chat | (not separately reported) |

**Percentage admitting mistake when original answer was correct:**

| Model | Rate |
|---|---|
| GPT-4 | 42% |
| Claude-1.3 | 98% |

**RLHF preference model finding:**

| Condition | Frequency |
|---|---|
| Claude-2 preference model prefers sycophantic response over baseline truthful | 95% |

**Task-type sycophancy rates (qualitative summary):**
- Highest sycophancy: opinion-influenced and mimicry tasks (matching user demographics/stated beliefs)
- Moderate: feedback tasks (agreeing with user's own creative work even when poor quality)
- Lower: factual challenge tasks (challenging stated facts)

---

### E3: Wei et al. 2023 Scaling Analysis Data (arXiv:2308.03958)

**Models evaluated:** PaLM family (8B, 62B, 540B), Flan-PaLM (instruction-tuned variants)
**Task definition:** Sycophancy measured using Perez et al. 2023 evaluation suite (17 NLP tasks with user-opinion framing)

**Scaling effects on sycophancy:**

| Model Scaling Comparison | Sycophancy Increase |
|---|---|
| 8B to 62B parameter scaling | +19.8% |
| 62B to 540B parameter scaling | +10.0% |
| Base to instruction-tuned (Flan-PaLM 8B) | +26.0% |

**Interpretation:** Both model scale and instruction tuning systematically amplify sycophancy. The instruction-tuning effect (+26.0%) is larger than the 8B-to-62B scaling effect (+19.8%), suggesting that RLHF alignment is the dominant driver of sycophantic behavior.

**Synthetic data finetuning results (sycophancy mitigation):**

| Finetuning Condition | Sycophancy Reduction | MMLU Alignment Tax |
|---|---|---|
| Synthetic data (100K examples, 5:1 mix ratio) | -8.8% to -10.0% on Perez tasks | -1.6% max |

**Synthetic data generation:** 100,000 examples from 17 NLP datasets; each example includes user-opinion framing (e.g., "I think the answer is X, but let me know if I'm wrong"). Model trained to maintain correct answer despite opinion framing. Mixed 5:1 with standard instruction data to preserve alignment.

---

### E4: SyRoUP Accuracy Bias Tables (Sicilia and Alikhani 2024, arXiv:2410.14746)

**Accuracy Bias (%) - Conversation Forecasting, null confidence suggestion:**

| Model | 0% correct suggestion | 25% correct | 75% correct | 100% correct |
|---|---|---|---|---|
| LLaMA3.1 8B | +45.37 | +27.75 | -11.28 | -31.17 |
| Mistral 7B | +39.22 | +19.27 | -22.88 | -41.78 |
| Mixtral 8x22B | +38.45 | +21.63 | -12.58 | -28.93 |
| Qwen2 72B | +21.04 | +9.59 | -7.64 | -18.02 |

*Positive = suggestion degraded accuracy (sycophancy harm). Negative = suggestion improved accuracy (sycophancy benefit from correct suggestion).*

**Accuracy Bias (%) - Question Answering:**

| Model | 0% correct | 25% correct | 75% correct | 100% correct |
|---|---|---|---|---|
| LLaMA3.1 8B | +16.37 | +6.25 | -11.87 | -19.89 |
| Mixtral 8x22B | +6.84 | -2.33 | -20.70 | -30.08 |
| Gemma2 9B | +19.74 | +6.60 | -17.66 | -30.22 |

**Accuracy Bias by Suggestion Confidence (incorrect suggestions only):**

| Model | Null confidence | High (80%) | Low (20%) |
|---|---|---|---|
| LLaMA3.1 8B (CF) | 45.37 | 49.13 | 47.50 |
| Mistral 7B (CF) | 39.22 | 42.15 | 42.46 |
| Gemma2 9B (QA) | 19.74 | 20.27 | 17.46 |

*User confidence level only marginally amplifies sycophantic harm (+2-4%). The sycophancy effect is primarily driven by the existence of a suggestion, not its expressed confidence level.*

**SyRoUP BSS Improvement over Standard Platt Scaling - Conversation Forecasting:**

| UE Method | PS BSS | SyRoUP BSS | Delta |
|---|---|---|---|
| DNC (text-based, calibrated users) | 0.99 | 2.23 | +1.24 |
| ITP-D (logprob-based) | -0.18 | 1.35 | +1.53 |
| ITP (logprob-based) | -0.55 | 5.01 | +5.56 |

**SyRoUP BSS - Question Answering, ITP, varying suggestion correctness:**

| Correctness | PS BSS | SyRoUP BSS |
|---|---|---|
| 0% correct | 1.04 | 6.67 |
| 25% correct | -0.24 | 4.31 |
| 75% correct | -0.16 | 2.28 |
| 100% correct | 1.40 | 6.16 |

---

## Appendix F: Aggregation Method Comparison

Side-by-side comparison of all major aggregation methods found in the literature.

| Method | Mechanism | Requirements | Best Result | Worst Result | API Compatible | Estimated Cost |
|---|---|---|---|---|---|---|
| **Majority Vote** | Each agent votes; plurality wins; equal weights | Agent answers only | Standard debate baseline (avg 0.53 Mistral, DebUnc) | Susceptible to correlated errors; fails when reliable minority outvoted | Yes | 1x API call/agent |
| **Confidence-Weighted Vote (ReConcile)** | Verbally elicited p_i in [0,1]; recalibrated via 5-level step function f(p); argmax_a sum_i f(p_i) * 1[a_i = a] | Verbal confidence from each agent; fixed recalibration weights | StrategyQA: 79.0% vs. majority vote 77.1% (+1.9 pp); consensus speed: 100% by R3 vs. 87% for standard debate at R4 | -6.8 pp when all agents are same model (ChatGPT x3); depends on model diversity | Yes (prompt-only) | 1x primary call + confidence elicitation in-call |
| **Confidence-Weighted Vote (CISC)** | Softmax-normalized confidence with temperature T; c_tilde_i = exp(c_i/T) / sum_j exp(c_j/T); P(True) as best confidence signal | P(True) via logprob API; temperature T tuned on held-out 10% | 41% cost reduction at budget 5, +1.6% accuracy over standard SC; MMLU/GSM8K cost reduction 47-53% | Verbal 0-100 achieves only 22% cost reduction; verbal binary near-random (18%); cannot use with Claude (no logprobs) | Partial (P(True) requires logprobs) | 1-2x calls per path |
| **Modified Dawid-Skene (WISE)** | Discrete Reflector weights {-1,0,1,2}; EM algorithm over cross-round weight matrices; factorized likelihood over Solver and Reflector categories; argmax by accumulated weighted vote | Separate Reflector LLMs; Orchestrator LLM; ~30 API calls/problem; EM runs offline | SMART-840: +7.5% over best single model; EvoChart-QA: +20.3% over best single model; consistently beats Dawid-Skene, MACE, WaWA | Very hard visual reasoning (SMART-840++ Level-5): 15-23%; high API cost at scale; Orchestrator removal: -3-4%; independence assumption violated | Yes (API-only explicitly) | ~30 API calls/problem |
| **CISC (Confidence Improves Self-Consistency)** | P(True) logprob + softmax normalization + temperature scaling; see C6 above | Logprob API access; held-out calibration set for T | 73% cost reduction on MATH (Gemma2-9B); P(True) WQD 62.3% | Verbal confidence alone: WQD 56.1%, near-random discrimination; Claude incompatible | Partial (logprobs) | 1 extra logprob call/path |
| **Roundtable Policy** | L_g=4 grader agents evaluate player responses; quality score in [-100,100] + 95% CI; historical confidence-weight table vartheta per agent per task; fusion agent synthesizes conditioned on vartheta | R historical evaluation rounds for table; 4 grader LLMs; fusion agent; black-box generation API | ScienceEval: +13.01% over single model; ScienceNarrative: +11.04%; outperforms CISC-style weighted vote on these benchmarks | Biology (Task 9): marginal/negative gains under high model disagreement; cold-start requires R rounds before table is reliable | Yes (black-box explicit) | L_p player calls + L_g=4 grader calls + fusion call per round |
| **Structured Synthesis - SELENE EWSC** | K=3 parallel evidence variants; judge score s_i^k in [0,1]; variance-weighted aggregation: S_i = sum_k s_i^k * exp(-Var_k[s_i]) / sum_k exp(-Var_k[s_i]); argmax_i S_i; combined with SDI for selective debate | Logprob access for SDI; K=3 parallel judge calls for EWSC; semantic disagreement + misalignment computation | BoolQ: 84.9% vs. MAD 82.3% (+2.6 pp); EWSC on long debates: +4.9 to +7.7 pp; token cost 1.9x vs. CFMAD 3.7x | Internal-QnA: SDI skip accuracy penalty -0.8 pp; long debates still needed for high-ambiguity queries | Partial (logprobs for SDI) | 1.9x token overhead avg |
| **DiverseAgentEntropy** | Diverse query perspectives; n=5-7 agents; flip-rate-penalized weights w_j = (R - flip_count + 1)/sum; weighted entropy; abstain if U > threshold | API-only text generation; n*R interaction rounds | AUROC 0.947 (PopQA-less-popular, Claude); TruthfulF1 0.908 (Claude); 19.3-20.7% correction rate on hard questions | Abstention rate 21.6% (may be unacceptable); contextual bias failure mode; expensive at n=7, R*=3 | Yes (text API only) | n*(1+R*) = 7-28 calls per question cluster |
| **SkillAggregation - CollabCalib** | Expert agent selection by calibration score; group deliberation with argument rating + factuality check; perplexity (open) / verbal (black-box) confidence; post-deliberation confidence update | Stage 1: m validation examples for agent calibration scoring; Stage 2: N=6 agents + argument rating + factuality check | Best ECE on 4/6 benchmarks vs. all baselines; GSM8K ECE 0.086 vs. Ask4Conf 0.196 | TriviaQA and Biz-Ethics: marginal or negative vs. best baseline; substantial API call overhead | Yes (verbal mode for black-box) | N=6 agents x 2 arguments x factuality check per round |

---

## Appendix G: Proposed Pipeline Cost Analysis

### G1: Signal Pipeline API Call Breakdown

The proposed Cabinet confidence pipeline has two configurations based on budget.

**Per-agent API call breakdown:**

| Step | Calls (Minimum Viable) | Calls (Full Version) | Notes |
|---|---|---|---|
| Pre-debate: Rephrase consistency probe | 5 | 5 | 5 paraphrased queries, parallelizable |
| Pre-debate: Primary response with top-4 verbal confidence | 1 | 1 | Merged with primary call, no extra cost |
| Pre-debate: P(True) logprob (if available) | 0 (omit) | 1 | Requires logprob API; 0 if caching |
| During debate: Answer flip tracking | 0 | 0 | Parsed from debate output, no extra calls |
| During debate: Verbal confidence re-elicitation | 0 (omit) | 1 per round | 1 per debate round per agent |
| During debate: Sycophancy probe | 0 (omit) | 1 | Optional "Are you sure?" probe post-debate |
| Post-debate: Semantic dispersion (C_Deg proxy) | 5 | 5 | 5 samples at T=0.7, parallelizable |
| Post-debate: DiNCo distractors | 0 (omit) | 3 | 3 distractor alternatives |
| **Total per agent** | **~10** | **~15-16** | Assuming 2-round debate |

**Total pipeline cost at 5 agents:**

| Configuration | API Calls (5 agents) | vs. Standard Debate (5 agents, 1 call each) | Cost Multiplier |
|---|---|---|---|
| Standard debate (1 call/agent) | 5 | Baseline | 1x |
| Minimum viable pipeline (10 calls/agent) | 50 | +45 calls | 10x |
| Full pipeline (15-16 calls/agent) | 75-80 | +70-75 calls | 15-16x |

### G2: Cost Reduction via Call Consolidation

Several calls can be combined to reduce overhead:

| Optimization | Calls Saved | Notes |
|---|---|---|
| Use rephrase calls (5) as post-debate C_Deg samples | -5 | Pre-debate rephrase consistency samples double as C_Deg computation; saves 5 calls per agent |
| DiNCo distractors in same call as primary response | -1 | If distractor generation is included in the primary response prompt |
| P(True) via prefix KV cache (CISC) | 0 extra | When API supports prefix caching, P(True) adds effectively 0 marginal cost |

**Optimized minimum viable pipeline: ~6-8 calls per agent** (using rephrase as C_Deg base).

### G3: Cost Comparison vs. Standard Debate

| Method | Calls per Agent | Total (5 agents) | Notes |
|---|---|---|---|
| Standard debate (1 call) | 1 | 5 | No confidence signals |
| DebUnc text-proxy | 1 | 5 | Confidence label in prompt; near-zero improvement |
| ReConcile (confidence-weighted) | 1-2 | 5-10 | Confidence elicited in-call |
| CISC P(True) | 1-2 | 5-10 | Logprob call; 41% cost reduction downstream |
| WISE (~30 calls/problem) | 6 | 30 | 30 calls total across all agents; high-cost |
| DiverseAgentEntropy (n=7, R*=3) | 7-28 | 35-140 | Architecturally replaces debate |
| Proposed minimum viable pipeline | 6-10 | 30-50 | Optimized consolidation |
| Proposed full pipeline | 15-16 | 75-80 | All signals active |

### G4: Break-Even Analysis

**When does the accuracy improvement justify the additional API cost?**

Assume:
- Standard debate API cost: C_base per call
- Full pipeline: 16x C_base per agent
- Number of queries per day: Q
- Value of accuracy improvement per correct query: V_correct
- Measured accuracy improvement: delta_acc (fraction)

Break-even condition: delta_acc * Q * V_correct >= 15 * C_base * Q * 5

Simplifying (cost for 5-agent full pipeline vs. standard):
delta_acc >= (15 * 5 * C_base) / V_correct
delta_acc >= 75 * C_base / V_correct

**Numeric example (using OpenAI GPT-4o pricing, ~$0.01 per call):**
- C_base = $0.01 (1 API call)
- Full pipeline extra cost: 75 calls * $0.01 = $0.75 per question
- Break-even: if a correct answer is worth $0.75 / delta_acc

| Accuracy Improvement | Required Value of Correct Answer to Break Even |
|---|---|
| +1 pp (0.01) | $75.00 per question |
| +5 pp (0.05) | $15.00 per question |
| +10 pp (0.10) | $7.50 per question |
| +15 pp (0.15) | $5.00 per question |
| +20 pp (0.20) | $3.75 per question |

**Interpretation:** At current API pricing and realistic improvement ranges (+5 to +10 pp over standard debate), the full pipeline is economically justified only for high-value use cases where a correct answer is worth $7-$15 or more. For lower-value consumer applications, the minimum viable pipeline (6-8 calls per agent) is the practical choice.

**Conservative improvement range (text-only API, no logprobs):**
Expected: +2 to +5 pp over standard debate.
Required value of correct answer: $15-$37 per question at minimum viable pipeline cost.

**Optimistic improvement range (all signals, logprobs available):**
Expected: +5 to +10 pp over standard debate.
Required value of correct answer: $7.50-$15 per question at full pipeline cost.

**Cost-efficiency caveat:** DebUnc's definitive finding that text-injected confidence achieves only +0.01 improvement (0.54 vs. 0.53) sets the lower bound. The pipeline proposed here targets the +2 to +10 pp range through stronger consistency-based signals. Whether this range is reliably achieved in a live multi-agent debate system with real sycophancy dynamics is the primary empirical uncertainty.

---

## Appendix H: Methodology Notes

### H1: Search Strategy

This literature review was conducted using two search phases across April 8, 2026.

**Phase 1: Academic Vertical Searches (6 queries)**

Searches were conducted via an academic research vertical (sourcing from arXiv, ACL Anthology, TMLR, and related preprint repositories). Query strings:

1. "verbalized confidence calibration LLM uncertainty"
2. "confidence-weighted debate aggregation multi-agent LLM"
3. "black-box uncertainty estimation language models"
4. "sycophancy LLM overconfidence mitigation"
5. "multi-agent debate accuracy improvement"
6. "LLM self-consistency uncertainty calibration black-box"

**Phase 2: Web Search for Supplementary Sources**

1 targeted web search to locate practitioner blogs and non-academic technical reports:
- Query: "LLM confidence estimation black box procedure stability"
- Sources: LessWrong, Marktechpost, arXiv supplementary links

**Phase 5 Supplementary Searches**

Additional targeted fetches of specific papers identified via cross-citations in Phase 1-2 results:
- Ji et al. 2025 (arXiv:2503.14477) - identified via citation in Zong et al. 2026
- Zhao et al. 2026 (arXiv:2604.01457) - identified via calibration circuit query
- Zong et al. 2026 (arXiv:2604.03904) - identified via I-CALM keyword
- Fanous et al. 2025 (arXiv:2502.08177) - identified via SycEval keyword

### H2: Paper Selection Criteria

A source was included if it met ALL of the following criteria:

1. **Relevance:** Directly addresses at least one of: verbalized confidence elicitation, confidence-weighted aggregation in multi-agent systems, black-box uncertainty estimation, or sycophancy detection/measurement.
2. **Methodological grounding:** Provides quantitative results (ECE, AUROC, accuracy comparisons) or empirically testable claims.
3. **API compatibility relevance:** Either directly tests a method on closed-source API models (ChatGPT, Claude, Gemini), or the method is transferable to API settings with documented performance.
4. **Recency:** Published 2023-2026 (prioritizing post-ChatGPT era empirical results).

Sources were excluded if they:
- Addressed only training-time interventions with no inference-time applicability
- Lacked quantitative evaluation metrics
- Were duplicates or superseded by newer versions of the same work

### H3: Deep-Read Methodology

For each of the 31 sources, the following extraction protocol was applied:

1. **Primary access:** Full-text fetched from arXiv HTML (preferred) or ACL Anthology where available. Paywalled sources (MedARC) were accessed via abstract + PubMed record.
2. **Methodology extraction:** Full algorithmic description recorded, including mathematical notation where present.
3. **Quantitative extraction:** All numeric results recorded verbatim from tables, with benchmark and model labels.
4. **Limitation extraction:** Author-stated limitations recorded separately from reviewer-identified limitations.
5. **API compatibility classification:** Binary (Yes / Partial / No) assessed based on requirements for: token-level logprobs, model attention weights, offline calibration data, auxiliary local models.
6. **Relevance to Cabinet:** Qualitative assessment of applicability to a multi-agent debate system with a synthesis agent using API calls only.

Three deep-read files were produced as primary documentation:
- `/home/user/workspace/deep_read_verbalized_confidence.md` - 8 papers on verbalized confidence and calibration
- `/home/user/workspace/deep_read_confidence_debate.md` - 10 papers on confidence-weighted debate and aggregation
- `/home/user/workspace/deep_read_blackbox_sycophancy.md` - 12 sources on black-box uncertainty and sycophancy

### H4: Synthesis Approach

The synthesis document (`/home/user/workspace/confidence_synthesis_findings.md`) was produced from the three deep-read files using the following structure:

1. **Master Evidence Table (Section 1):** All 31 sources tabulated with core method, key quantitative result, API compatibility, and Cabinet relevance.
2. **Confidence Signal Taxonomy (Section 2):** All confidence signals organized into five categories (verbalized, consistency-based, logprob-based, hybrid, mechanistic). Each category analyzed for strengths, weaknesses, API compatibility, cost.
3. **Aggregation/Weighting Mechanisms (Section 3):** All aggregation methods with quantitative improvements over majority vote baseline, requirements, limitations.
4. **Sycophancy Problem (Section 4):** Prevalence rates, corruption mechanisms, detection methods, mitigation strategies.
5. **DebUnc Gap Analysis (Section 5):** The black-box ceiling problem, ranked black-box alternatives.
6. **Key Contradictions (Section 6):** Genuine empirically incompatible findings across papers.
7. **Literature Gaps (Section 7):** Actual empirical absences identified by systematic comparison.
8. **Proposed Signal Pipeline (Section 8):** Evidence-grounded recommendation for Cabinet's confidence weighting system.

### H5: Date of Searches

All searches and paper accesses were conducted on **April 8, 2026**.

The literature corpus covers publications from 2023 through April 2026. The most recent papers included:
- Zong et al. 2026 (I-CALM): April 5, 2026
- Zhao et al. 2026 (Wired): April 2026
- Mommessin 2026 (LessWrong): April 5, 2026
- Patel 2026 (referenced in Halo study): April 2026

### H6: Databases Searched

| Database / Source | Type | Coverage |
|---|---|---|
| arXiv (cs.CL, cs.AI, cs.LG) | Academic preprints | Primary source for all 2023-2026 papers |
| ACL Anthology | Peer-reviewed NLP proceedings | ACL 2024, EMNLP 2023, EMNLP 2025, EACL 2026, ICLR 2024/2026 |
| Transactions on Machine Learning Research (TMLR) | Peer-reviewed journal | Pedapati 2024, Lin 2023 |
| International Journal of Medical Informatics | Medical AI journal | MedARC (Miao 2025) |
| AI, Ethics, and Society (AIES) 2025 | Conference proceedings | SycEval (Fanous 2025) |
| LessWrong | Practitioner community blog | Mommessin 2026 stability procedure |
| Marktechpost | Technical tutorial site | Mommessin 2026 uncertainty pipeline |

### H7: Limitations of This Review

1. **Cross-benchmark non-comparability:** No paper in the corpus compares all confidence signals head-to-head on the same benchmark with the same models. Rankings in Section 5.3 of the synthesis document and Appendix C are constructed from cross-paper evidence, not controlled comparisons.

2. **Paywall gap:** The full methodology of MedARC (Miao et al. 2025) was not accessible. The confidence-aware aggregation mechanism is documented only in the abstract and PubMed record.

3. **Recency of some findings:** Several 2026 papers (Zhao, Zong, Mommessin) were published within days of the search date. These have not been through extended peer review cycles.

4. **No empirical validation:** This review is a literature synthesis, not an empirical study. The proposed Cabinet pipeline (Appendix C) is based on cross-paper evidence extrapolation, not a controlled experiment. The expected improvement ranges (+2 to +10 pp) are derived from individual paper results and may not transfer to a combined multi-signal pipeline.

5. **Sycophancy-confidence interaction gap:** The most important unresolved gap identified in this review (Gap 1 in Section 7 of the synthesis): no paper has jointly studied confidence-weighted debate aggregation AND sycophancy detection in a single experimental framework. The proposed pipeline's sycophancy discounting mechanisms are therefore untested in combination with confidence weighting.

---

*This appendix accompanies the confidence-weighted synthesis study compiled at Sparse Halo Research, April 2026. Primary source data is preserved in the three deep-read files and synthesis document at `/home/user/workspace/`. Date of compilation: April 8, 2026.*


### [32] Wu, N., et al. (2026)
**Title:** Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus  
**Venue:** arXiv preprint  
**arXiv ID:** arXiv:2604.02923  
**URL:** https://arxiv.org/abs/2604.02923  
**Key Results:** Council Mode achieves 10.7% hallucination rate on HaluEval (35.9% relative reduction vs best individual model at 16.7%). TruthfulQA: 82.6% truthful (+7.8pp over best individual). Bias variance: 0.003 vs 0.021-0.028 for individuals (85-89% reduction). Structured synthesis vs majority vote: 10.7% vs 14.2% hallucination rate. Heterogeneous vs homogeneous: 10.7% vs 15.6%. High-complexity reasoning: 71.2% vs 50.8%.

### [33] Li, W., Lin, Y., Xia, M., & Jin, C. (2025)
**Title:** Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?  
**Venue:** arXiv preprint  
**arXiv ID:** arXiv:2502.00674  
**URL:** https://arxiv.org/abs/2502.00674  
**Key Results:** Self-MoA outperforms standard MoA by 6.6% on AlpacaEval 2.0 and 3.8% average across MMLU, CRUX, MATH. Demonstrates that mixing different LLMs often lowers average quality. Task-dependent: diversity helps for consensus tasks but may hurt for pure generation aggregation.
