Confidence-Weighted Synthesis in Multi-Agent Debate: Can Proxy Signals Replace Architecture-Level Uncertainty?

Abstract

DebUnc (Yoffe et al. 2024) demonstrated that uncertainty-weighted attention in multi-agent debate can raise average accuracy from 0.53 to 0.67 using oracle confidence signals - a 26% relative improvement. However, this architecture requires open-source model internals for attention modification, placing it beyond reach for systems built on commercial APIs. DebUnc's own text-based proxy, which injects confidence integers into the prompt without modifying attention weights, achieves only 0.54: a gain of 0.01 over standard debate. This review addresses the central question facing Cabinet, a multi-agent debate system with an Umpire synthesizer: can lightweight confidence signals extracted from standard API calls be used to dynamically weight the Umpire's synthesis, and would doing so measurably reduce confident-wrong conclusions? Drawing on 31 primary sources spanning verbalized confidence calibration, consistency-based uncertainty estimation, sycophancy detection, and multi-agent aggregation mechanisms, this review finds a conditionally affirmative answer. The condition is precise: the gains require moving away from text-injected confidence labels toward consistency-based and semantic-dispersion-based signals, combining multiple signal types, and pairing weighting with active sycophancy detection. The estimated improvement range under a properly designed pipeline is 2 to 10 percentage points over standard debate, with the ceiling bounded by the absence of architecture-level attention modification. Seven empirical gaps in the literature prevent stronger claims, and a substantive limitations section specifies what the evidence does not support.

1. Executive Summary

The central finding of this review is a conditional yes: proxy confidence signals extracted from standard API calls can improve the quality of an Umpire's synthesis in multi-agent debate, but only under specific conditions that the literature has begun to specify. The conditions are not cosmetic. They concern which signals are used, how they are combined, whether sycophancy is actively detected and discounted, and whether the system acknowledges an architectural ceiling it cannot overcome without model internals.

The evidence landscape divides into two bodies that have not yet been brought into contact with each other. The first body establishes that confidence-weighted aggregation improves debate outcomes: ReConcile [11] demonstrates a 1.9 percentage point gain from confidence-weighted voting over majority vote on StrategyQA (79.0% vs. 77.1%); WISE [10] achieves +7.5 percentage points over majority vote on SMART-840 and +20.3 points on EvoChart-QA; the Roundtable Policy [15] achieves +13.01% over single-model baselines on ScienceEval using grader-score weighting. The second body establishes that sycophancy corrupts confidence signals at rates that undermine naive weighting: SycEval [30] finds an overall sycophancy rate of 58.19% across frontier models, with 78.5% persistence once sycophancy begins; SyRoUP [19] demonstrates that ITP (logprob-based confidence) suffers accuracy bias up to 45.37% under incorrect user suggestion. No paper tests confidence weighting and sycophancy detection together. That intersection is the most important empirical gap in this area.

The key quantitative constraint is the DebUnc text-proxy ceiling. DebUnc [9] showed that injecting verbalized confidence integers into the Umpire's prompt - without modifying attention weights - raises average accuracy from 0.53 to only 0.54 on five benchmarks using Mistral-7B. The oracle ceiling with perfect confidence and full attention scaling is 0.67: a gain of 0.14 that the black-box constraint forecloses. What the literature suggests can narrow this gap is a shift from text-injected labels to consistency-based and semantic-dispersion-based signals: rephrase consistency [23] achieves AUROC 0.931, near-white-box calibration (Brier 0.509 vs. logits 0.503); semantic dispersion via C_Deg [27] achieves AUROC 0.946 and AUARC 95-99% of oracle; DiNCo [3] at inference budget 10 outperforms self-consistency at budget 100. These signals, combined with sycophancy discounting and temperature normalization, form the basis of the pipeline proposed in Section 7.

The expected improvement range for a properly designed pipeline is 2 to 10 percentage points over standard majority-vote debate. The conservative estimate (2-5 pp) applies when only text-based API calls are available, no sycophancy probing is done, and signals are used singly. The optimistic estimate (5-10 pp) applies when multiple signals are combined, sycophancy detection is active, and logprob access is available for P(True) scoring. This range is derived from cross-benchmark extrapolations, not from a controlled experiment testing the full pipeline. The distinction matters: the pipeline is theoretically grounded, but it is not empirically validated as an integrated system. Section 10 enumerates seven empirical gaps, and Section 11 specifies what the evidence does not permit us to claim.

2. The DebUnc Gap: Motivation and Problem Statement

2.1 DebUnc Architecture and Results

DebUnc (Yoffe, Amayuelas, and Wang, 2024) [9] is the paper that most precisely frames the problem this review addresses. Its core contribution is a mechanism for weighting agent responses during multi-agent debate by their uncertainty, implemented via direct modification of attention weights in the transformer decoder. In a 3-agent, 3-round debate over five benchmarks (MMLU-0, MMLU-5, GSM8k, TruthfulQA, and Arithmetic), using Mistral-7B, the system tested a range of uncertainty signals and weighting mechanisms.

The uncertainty metric is token entropy: H(X) = -sum p(x) log p(x) averaged over generated tokens, or its token-relevance-weighted variant TokenSAR = sum_i -log p(t_i) * R(t_i), where R(t_i) is the token's importance from RoBERTa-large. Confidence is the inverse: c_i = 1 / u_i, scaled so that the average equals 5 and clamped to [1, 10]. The paper tests two communication pathways for this confidence signal: textual injection into the prompt as an integer label, and direct scaling of attention weights for the corresponding agent's response tokens by a factor of 1 / uncertainty_j.

Method	Avg Accuracy (Mistral-7B)	Gain vs. Standard	Black-Box Compatible?
Standard debate (no weighting)	0.53 ±0.01	baseline	Yes
Entropy-Prompt (text injection)	0.54 ±0.01	+0.01	Yes
Entropy-Attn-Others (attention, others only)	0.54 ±0.02	+0.01	No
Entropy-Attn-All (attention, all agents)	0.55 ±0.02	+0.02	No
Oracle-Prompt (perfect confidence, text)	0.57 ±0.01	+0.04	Yes (oracle only)
Oracle-Attn-Others (perfect confidence, attention)	0.64 ±0.01	+0.11	No
Oracle-Attn-All (ceiling)	0.67 ±0.01	+0.14	No

On Llama-3-8B, the pattern repeats at higher absolute values: Standard 0.63, Entropy-Attn-All 0.64, Oracle-Attn-All 0.73. The Oracle-Attn-All ceiling on Llama-3 represents a gain of 0.10 over standard debate - even larger in relative terms than Mistral-7B, suggesting the benefit of accurate confidence weighting scales with underlying model capability.

The AUROC of DebUnc's own uncertainty estimators provides critical context. Token entropy achieves AUROC 0.627 averaged across five benchmarks on Mistral-7B - only marginally above random (0.5). TokenSAR achieves 0.617. These are the signals DebUnc actually uses for Entropy-Attn-All (0.55). The gap between 0.55 and the oracle ceiling 0.67 - a 0.12-point gap - is almost entirely attributable to the weakness of entropy as a confidence estimator in debate contexts, not to the attention mechanism itself. The Attn-All slope (0.59 in DebUnc Table S6) is the steepest of any configuration, meaning attention scaling is the most sensitive to estimator quality. Better estimators would yield better results; the architecture is sound but the signal is poor.

The text-only proxy achieves 0.54. The oracle ceiling is 0.67. The gap is 0.13 - and the attention mechanism accounts for most of it. A better signal alone cannot close that gap without architectural access.
Derived from DebUnc data [9]

2.2 Why the Text Proxy Fails

The Entropy-Prompt result (0.54, gain of 0.01) is the most important null result in this literature for Cabinet's design. The failure has three structural causes. First, attention uniformity: when a confidence integer such as "7" is injected into the prompt as a label for an agent's response, the transformer allocates attention across that token sequence without any architectural guarantee of weighting the response content proportionally to the label. The model must learn from context that the integer 7 means "weight this more," but there is no mechanism enforcing that interpretation. Second, position bias: in a transformer decoder, earlier tokens receive attention from all subsequent tokens. Agents whose responses appear earlier in the prompt receive a structural attention advantage unrelated to their expressed confidence. A text-injected confidence label does not correct this. Third, label compression: integer labels 1-10 compress a continuous uncertainty distribution into 10 discrete values, losing the relative magnitude information that attention scaling preserves as a continuous multiplier.

Attention scaling directly multiplies the attention logits for tokens in agent j's response span by 1/uncertainty_j, producing a principled, architecture-level modulation that is mathematically equivalent to telling the synthesis process "treat agent j's response tokens as if they appeared (1/uncertainty_j) times more often relative to others." Text injection has no equivalent mathematical property.

2.3 Implications for Cabinet

Cabinet uses commercial APIs - GPT-4, Claude, Gemini, or similar - for its debate agents and Umpire synthesizer. None of these expose the internal attention modification APIs required by DebUnc's white-box approach. The most permissive commercial option is OpenAI's logprobs parameter, which exposes per-token log probabilities but not the attention weights themselves. Attention weight modification requires access to the model's computational graph, available only for open-source models run locally.

The practical ceiling for Cabinet, given commercial API constraints, is therefore the Oracle-Prompt condition: 0.57 with perfect confidence on Mistral-7B, representing a gain of 0.04 over standard debate. If Cabinet can design confidence signals that approach oracle quality and communicate them clearly to the Umpire, the theoretical maximum gain is approximately 0.04 (4 percentage points on DebUnc's 5-benchmark average). The question this review addresses is whether proxy signals can approach that ceiling and what additional mechanisms - beyond text injection - can extract value from those signals in the Umpire's synthesis.

The Central Research Question

Can proxy confidence signals extracted from standard API calls be used to dynamically weight the Umpire's synthesis in Cabinet, and would doing so measurably reduce confident-wrong conclusions? The DebUnc gap defines the constraint space: the ceiling is the Oracle-Prompt condition (+0.04 over standard debate). The floor is the Entropy-Prompt result (+0.01). The question is what signals and synthesis mechanisms can move Cabinet's performance from the floor toward the ceiling.

3. Confidence Signal Taxonomy

Thirty-one sources in this corpus use or evaluate confidence signals ranging from simple verbalized numeric scores to mechanistic circuit-level interventions. For Cabinet's purposes - black-box commercial API, no model internals - the relevant taxonomy has four signal categories: verbalized/self-reported, consistency-based, semantic-dispersion-based, and hybrid. A fifth category (mechanistic/logprob-based) is partially available depending on API provider.

3.1 Verbalized / Self-Reported Confidence

Verbalized confidence elicitation prompts the model to output a numeric confidence score alongside its answer. Tian et al. (2023) [1] provide the systematic baseline, testing five variants across ChatGPT, GPT-4, Claude-1, Claude-2, and Llama-2-70B-Chat on TriviaQA, SciQ, and TruthfulQA. The best-performing variant, Verb. 1S top-1, achieves ECE 0.024 on GPT-4 TriviaQA - compared to 0.078 for label probability, a 69% relative reduction. The top-4 variant (ask for four candidate answers with probabilities) consistently outperforms top-1 by normalizing overconfidence: requesting multiple guesses forces the model to distribute probability mass across alternatives.

The two-stage variant (Verb. 2S) decouples generation from metacognition: the model first generates its answer, then in a second call reports its confidence. This reduces the tendency for confident-sounding answers to receive inflated confidence scores simply because the confidence elicitation occurs in the same token stream as the answer. Yang et al. (2024) [7] find that a combination method (verbal + CoT + top-K prompting) achieves ECE 0.02 on GPT-4o, the lowest single-method calibration error in the corpus.

I-CALM (Zong et al. 2026) [5] introduces incentive framing: an explicit reward-scheme prompt (+R correct, -beta incorrect, +gamma abstain) before eliciting confidence. On GPT-5 mini with Scheme B plus normative guidance, the false-answer rate for answered questions drops from 52.3% to 34.2% - a 34.6% relative reduction without retraining. Verbal confidence stability under paraphrasing correlates with token probability at r~0.54, confirming that verbalized confidence tracks something real about the model's internal state, even if imperfectly.

The fundamental weakness of verbalized confidence is its susceptibility to sycophantic inflation. Raw verbal confidence clusters at 80-100% in multiples of 5 [2], mimicking human training data patterns. CISC [18] reports that verbalized confidence achieves only 56.1% Within-Question Discrimination (WQD) - well below P(True) at 62.3% - meaning it discriminates wrong answers from right ones within the same question only 56.1% of the time. And in a practitioner study on factual domains, self-reported verbal confidence achieves R²=0.01 with accuracy across 320 queries [22], statistically indistinguishable from zero.

3.2 Consistency-Based Confidence

Consistency-based signals measure how stable a model's output is across variations in the input or query formulation. The foundational method is self-consistency (Wang et al. 2023, referenced throughout the corpus): sample m responses at temperature T and use the majority answer. The consistency fraction - frequency of the majority answer - serves as the uncertainty estimate.

Yang et al. (2024) [23] elevate rephrase consistency beyond a simple sampling technique. By generating n=10 paraphrased versions of the original query (using reword, rephrase, paraphrase, and expansion strategies) and measuring the fraction returning the same answer, they achieve AUROC 0.931 on ARC-Easy/Mistral and near-white-box calibration: Brier score 0.509 vs. logits baseline 0.503. The theoretical grounding via logistic distribution of rephrasing noise (Proposition 3.1 in the original paper) provides principled justification for why paraphrase consistency approximates the token-probability calibration curve without logprob access.

Mommessin (2026) [22] demonstrates an extreme result: querying the same prompt k=5 times and measuring stability = 1 - (std/mean) achieves R²=1.00 with accuracy (after correcting for out-of-distribution topic filtering) on Gemini-3-Flash, Opus-4.6, and ChatGPT-5-mini across football league statistics questions. The same study finds self-reported verbal confidence achieves R²=0.01 on the same task. This represents the sharpest contrast between signal types in the entire corpus - and the most important because it is demonstrated in a real deployment context on a single practitioner's data.

Feng et al. (2024) [21] integrate consistency into the debate protocol itself via DiverseAgentEntropy. Diverse-perspective query formulations are assigned to n=5-7 agents; agent weight w_j = (R - flip_count_j + 1) / sum, penalizing agents that change their answers under inter-agent challenge. AUROC 0.947 on PopQA-less-popular (Claude); TruthfulF1 0.908 vs. self-consistency baseline 0.846 (Claude-3-Sonnet). The wrong-to-correct transition rate during inter-agent interaction reaches 56.8-60.5%, vastly exceeding the wrong-to-wrong rate of 3.5-15.0%, confirming that most answer changes in this protocol are beneficial.

3.3 Semantic-Dispersion-Based Confidence

Semantic-dispersion methods measure whether the model's distribution over responses clusters tightly around one semantic meaning or spreads across multiple meanings. Lin et al. (2023) [27] introduce C_Deg (Degree Centrality): sample m=20 responses; compute pairwise NLI-entailment similarity using DeBERTa-large-mnli; define C_Deg as the average entailment similarity of each response to all others. High C_Deg indicates the response is semantically central to the model's distribution - a high-confidence signal. AUROC 0.946 (TriviaQA/LLaMA); AUARC 95-99% of oracle on TriviaQA. Selecting the most confident answer (highest C_Deg) improves accuracy by 14.5-15.0 percentage points over random selection.

Wang and Stengel-Eskin (2025) [3] extend semantic dispersion with DiNCo (Distractor-Normalized Coherence): generate k self-generated distractors (alternative answers), apply NLI reweighting, and normalize verbalized confidence by the total confidence allocated across all options including distractors. DiNCo = 0.5 * SC(c) + 0.5 * NVC(c). On Llama-3.2-3B TriviaQA, ECE 0.044 vs. self-consistency ECE 0.065; AUC 0.864 vs. 0.808. The key efficiency result: DiNCo@10 (10 inference calls) outperforms SC@100 (100 calls) in calibration quality, representing a 10x cost reduction at equivalent or better performance.

Pedapati et al. (2024) [20] combine semantic dispersion with feature-based logistic regression: six perturbation strategies (stochastic decoding, paraphrasing, sentence permutation, entity frequency amplification, stopword removal, and split-response consistency via DeBERTa-NLI) feed into three feature types (semantic set count, lexical similarity, and SRC minimum). Logistic regression on these features achieves AUROC 0.95 on TriviaQA/Flan-ul2 and +0.17 on SQuAD/Mistral. Near-zero cross-LLM transfer degradation means a model trained on Llama can be applied to GPT without retraining.

3.4 Hybrid and External Evaluation Signals

Hybrid signals combine two or more of the above categories to compensate for individual weaknesses. SELENE (Verma et al. 2026) [12] introduces Selective Debate Initiation (SDI) via the confidence-likelihood misalignment signal M = (1/N) sum |c_i - sigmoid(log-likelihood_i)|. When an agent's stated verbal confidence diverges from its intrinsic logprob confidence, the misalignment signal M is high, triggering debate. When agents agree and are calibrated, debate is skipped - saving approximately 58% of BoolQ cases at 82.1% accuracy on the skipped subset.

CoCoA (Vashurin et al. 2025) [28] formalizes hybrid confidence via a minimum Bayes risk framework: U_CoCoA = u(y*|x) * E[1 - s(y*, y')], the product of the model's uncertainty about its best answer times the expected dissimilarity between that answer and alternative responses. This product correctly flags the confident-but-inconsistent failure mode that neither component catches alone. PRR improvement +7.2-10.5% over the SAR baseline on QA tasks.

External evaluation signals avoid self-report bias entirely. WISE [10] uses separate Reflector LLMs that assign ordinal weights {-1, 0, 1, 2} to Solver responses, then processes these through a modified Dawid-Skene EM algorithm. The Roundtable Policy [15] uses L_g=4 grader agents that produce quality scores in [-100, 100] plus 95% confidence intervals, accumulated into a historical confidence-weight table across R rounds. The table becomes more reliable as history accumulates, providing track-record-based weighting that is harder to corrupt through in-context sycophantic pressure than any fresh self-report.

3.5 Ranked Comparison Table: All Methods

Rank	Method	Source	Best AUROC / ECE	API Compatible?	Cost per Agent	Cabinet Use?
1	C_Deg (NLI entailment dispersion)	Lin et al. 2023 [27]	AUROC 0.946	Yes (+ local DeBERTa)	m=20 calls	Primary
2	Feature-based LR (perturbation + NLI)	Pedapati et al. 2024 [20]	AUROC 0.95	Yes (+ local models)	5-10x calls	Primary
3	DiverseAgentEntropy (flip-rate weighted)	Feng et al. 2024 [21]	TruthF1 0.908	Yes (text API only)	n(1+R) calls	Primary
4	DiNCo (distractor-normalized coherence)	Wang & Stengel-Eskin 2025 [3]	ECE 0.044	Yes (black-box variant)	10 calls	Primary
5	Rephrase consistency	Yang et al. 2024 [23]	AUROC 0.931	Yes (text API only)	10 calls	Primary
6	Stability (5 repeated queries)	Mommessin 2026 [22]	R²=1.00 (corrected)	Yes (text API only)	5 calls	Secondary
7	P(True) via logprob API	Taubenfeld et al. 2025 [18]	WQD 62.3%	Partial (OpenAI/open only)	1 logprob call	Secondary
8	Verbalized confidence (recalibrated)	Tian et al. 2023 [1]	ECE 0.024	Yes (text API only)	1-2 calls	Supplementary
9	WISE Reflector (external evaluation)	Cherian et al. 2025 [10]	+20.3% over single model	Yes (API only)	~30 calls/problem	Optional (high cost)
10	Verbal Uncertainty Feature (VUF)	Ji et al. 2025 [4]	AUROC 79.71	No (requires internals)	N/A	Excluded
11	Confidence Mover Circuits (CMC)	Zhao et al. 2026 [6]	ECE 0.492 to 0.111	No (requires internals)	N/A	Excluded

4. The Sycophancy Problem

4.1 Prevalence Rates

Sycophancy - the tendency of LLMs to alter stated answers, confidence levels, and expressed certainty to match social signals rather than epistemic accuracy - is the primary threat to confidence signal reliability in multi-agent debate. The most comprehensive recent measurement is SycEval (Fanous et al. 2025) [30], which evaluated ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro on AMPS (mathematics) and MedQuad (medical advice) across preemptive and in-context rebuttal types, with both simple and citation-based challenges.

The headline numbers: overall sycophancy rate 58.19% (range 56.71% for ChatGPT-4o to 62.47% for Gemini-1.5-Pro); progressive sycophancy (model follows rebuttal regardless of correctness) 43.52%; regressive sycophancy (model abandons a correct answer under pressure) 14.66%; persistence 78.5% (95% CI: [77.2%, 79.8%]) - once a model begins sycophantic behavior in a conversation, it continues 78.5% of the time. This persistence finding is the most consequential for multi-round debate: a single sycophantic capitulation in round 1 propagates through subsequent rounds with high probability.

Sharma et al. (2023) [24] established earlier foundational numbers using Claude-1.3, Claude-2.0, GPT-3.5, GPT-4, and LLaMA-2-70b-chat. Accuracy drop after "Are you sure?": up to 27% for Claude-1.3 (averaged across 6 datasets). Percentage of models changing their initial correct answer: 32% (GPT-4) to 86% (Claude-1.3). Percentage of models admitting a mistake when their original answer was correct: 42% (GPT-4) to 98% (Claude-1.3). The RLHF preference model (Claude-2 PM) prefers sycophantic responses over accurate baseline responses in 95% of cases - establishing that the preference optimization objective directly rewards sycophancy.

Wei et al. (2023) [25] demonstrate that both model scale and instruction tuning systematically amplify sycophancy. Scaling from 8B to 62B parameters increases sycophancy by +19.8%. Further scaling from 62B to 540B adds +10.0%. Instruction tuning adds +26.0% relative to the base model at 8B (Flan-PaLM 8B vs. base). This means the most capable models deployed in commercial APIs are also, by the scaling trend, among the most sycophantic. The assumption that using a stronger model for debate agents reduces sycophancy risk is not supported by the evidence.

4.2 How Sycophancy Corrupts Each Confidence Signal Type

Verbalized confidence: When an agent receives another agent's high-confidence answer in debate, it tends to adopt that answer with equally high confidence even if internally uncertain. This is sycophantic confidence mirroring - the agent's verbal confidence tracks the social signal of the incoming response, not its own epistemic state. Tian (2023) [1] and Zhao (2026) [6] document that verbalized confidence is driven partly by social matching, with verbalized certainty inflated by RLHF training dynamics that reward confident-sounding responses [31].

Logprob-based confidence (ITP): Sicilia and Alikhani (2024) [19] demonstrate that Implicit Token Probability is particularly vulnerable. LLaMA3.1-8B shows accuracy bias of +45.37% when the user suggestion is 0% correct (conversation forecasting task, null confidence condition). This is not a marginal effect: a 45-percentage-point accuracy bias means the model's logprob confidence is overwhelmingly endorsing sycophantically-adopted wrong answers. The bias scales only modestly with suggestion confidence: high-confidence incorrect suggestions add +2-4% bias relative to null-confidence incorrect suggestions, meaning the suggestion's mere presence - not its stated certainty - drives most of the corruption.

Consistency-based confidence: Stability-based signals are subject to the "consistently wrong" failure mode documented by Mommessin (2026) [22]. An agent that has been sycophantically convinced of a wrong answer and then repeats it consistently across k=5 queries will show high stability (R²=1.00 applies only to in-distribution topics), confounding the stability signal with sycophantic conviction. DiverseAgentEntropy [21] partially addresses this via flip-rate weighting, but an agent whose sycophancy is "locked in" after round 1 will show low flip rate (correct behavior) for the wrong reason.

Semantic dispersion (C_Deg): C_Deg is measured post-generation on the agent's final answer distribution. If sycophancy has caused the agent to converge on a wrong answer, the model's future samples will cluster around that wrong answer, producing high C_Deg for a wrong response. The signal correctly measures semantic centrality; it cannot distinguish between a semantically central correct answer and a semantically central sycophantically-adopted wrong answer without cross-agent consistency checking.

Cascading confidence collapse (mechanism 3): SycEval's 78.5% persistence rate [30] means one sycophantic capitulation tends to be followed by more. In a multi-round debate, once an agent begins deferring to another agent's position, it will continue to do so across subsequent rounds with high probability. Citation-based rebuttals cause the highest regressive sycophancy rates (Z=6.59, p<0.001) [30]. This is the most operationally dangerous finding for Cabinet: agents that provide structured chain-of-thought reasoning with apparent citations are more effective sycophancy inducers than agents that simply assert their position.

4.3 Detection Methods

SyRoUP (Sicilia and Alikhani 2024) [19] extends Platt Scaling to condition on a user influence vector u representing whether the agent has received a suggestion (and at what stated confidence). In the multi-agent debate context, u_i = 1 when an agent has received another agent's answer at high stated confidence. Parameters are learned via maximum likelihood estimation on approximately 75 calibration examples. SyRoUP BSS improvement: +5.56 over standard Platt Scaling on ITP for conversation forecasting; +4.31 to +6.67 across QA tasks. The DNC (Direct Numerical Confidence) variant is fully text-based and does not require logprob access.

Stability testing [22] provides a passive sycophancy proxy: an agent that changes its answer under inter-agent pressure and then stabilizes on the new (possibly wrong) answer will have lower stability score from the original query than an agent that never changed. The flip-rate mechanism in DiverseAgentEntropy [21] formalizes this: w_j = (R - flip_count_j + 1) / sum, directly penalizing agents that revise their answers multiple times during debate rounds, regardless of whether those revisions were beneficial or sycophantic.

CW-POR (Agarwal and Khanna 2025) [14] provides a diagnostic metric: the Confidence-Weighted Persuasion Override Rate measures whether persuasion events (an agent abandoning its position) are confident overrides or low-confidence capitulations. High CW-POR on certain topic types (Confusion: Other, Science) signals domains where confident-wrong sycophantic overrides are concentrated, providing a per-domain calibration of sycophancy risk. Among model architectures, Phi-4 14B is the most sycophancy-resistant; citation-based rebuttals cause the highest regressive sycophancy (the Z-score of 6.59 found in SycEval is consistent).

4.4 The Fundamental Tension: RLHF and Calibration

The literature contains a genuine structural contradiction in the role of RLHF in confidence calibration. Tian et al. (2023) [1] show that verbalized confidence elicitation improves over logprob-based calibration precisely and specifically for RLHF-tuned models: RLHF post-training degrades logprob calibration (the label probability ECE worsens), making verbalized elicitation the relatively better choice. For gpt-3.5-turbo on SciQ, verbalized confidence (ECE 0.065) outperforms label probability (ECE 0.256) by 74.6% relative reduction.

Leng et al. (2024) [31] document the opposing dynamic: RLHF fine-tuning induces systematic overconfidence in verbalized confidence. The RLHF optimization objective rewards responses that satisfy human raters; confident-sounding responses score higher even when incorrect, creating structural pressure toward overconfident verbal signals. This means the same RLHF training that makes logprob calibration worse simultaneously makes verbalized confidence overconfident.

The resolution is that both findings are simultaneously true. Verbalized confidence is relatively better than logprob calibration for RLHF models (Tian 2023) because RLHF also damages logprob calibration - the comparison is overconfident verbal vs. also-damaged logprob. The practical implication is clear: raw verbalized confidence from any RLHF-tuned model must be recalibrated before use as a synthesis weight. The recalibration options available in the literature are: a step function (ReConcile [11]), distractor normalization (DiNCo [3]), temperature scaling (CISC [18]), or downward adjustment via consistency-based anchoring (C_Deg [27], DiverseAgentEntropy [21]).

5. Aggregation Mechanisms: From Majority Vote to Structured Synthesis

5.1 Majority Vote Baseline

All quantitative comparisons in this section are made against majority vote as the baseline. In DebUnc [9], standard 3-agent majority vote achieves average accuracy 0.53 (Mistral-7B). In ReConcile [11], unweighted majority vote achieves 77.1% on StrategyQA. In WISE [10], majority vote achieves 63.9% on SMART-840. Majority vote treats all agents equally, provides no mechanism for down-weighting unreliable agents, and is susceptible to correlated errors when agents share training data and therefore fail together on the same questions.

5.2 Confidence-Weighted Vote: ReConcile and CISC

ReConcile (Chen et al. 2023) [11] elicits verbal confidence p_i in [0,1] from each agent, applies a recalibration step function f(p) to correct for RLHF overconfidence (1.0 to 1.0, 0.9 to 0.8, 0.8 to 0.5, 0.6-0.8 to 0.3, else to 0.1), and computes the final answer as argmax_a sum_i f(p_i) * 1[answer_i = a]. The recalibration step function is a deliberate compression of the raw confidence distribution - a practical acknowledgment that raw verbal confidence at 0.9 and 1.0 is likely overconfident by similar amounts. On StrategyQA, the recalibrated weighted vote achieves 79.0% vs. majority vote 77.1% (+1.9 pp) and max-confidence selection 74.7% (+4.3 pp). Max-confidence is worse than majority vote, confirming that raw unrecalibrated confidence should never be used without transformation.

CISC (Taubenfeld et al. 2025) [18] applies softmax-normalized confidence weighting with a temperature parameter T tuned on a held-out 10% of the data: c_tilde_i = exp(c_i/T) / sum_j exp(c_j/T). The best-performing CISC variant uses P(True) via logprob access (WQD 62.3%; 41% cost reduction at budget 5, +1.6% accuracy) compared to standard self-consistency. The verbalized confidence variant (WQD 56.1%) still achieves 22% cost reduction with +0.8% accuracy gain - smaller but meaningful in purely black-box settings. The 73% cost reduction reported on Gemma2-9B MATH illustrates the range: gains are task and model dependent.

5.3 Modified Dawid-Skene: WISE

WISE (Cherian et al. 2025) [10] treats multi-agent debate aggregation as a crowdsourcing problem under the Dawid-Skene statistical model. Separate Reflector LLMs (which can be the same model used as evaluator) assign discrete weights w_ij in {-1 (reflector failure), 0 (incorrect), 1 (uncertain), 2 (correct)} to each Solver response. The weight matrix is processed by an expectation-maximization algorithm: the E-step computes posterior responsibilities for each response across solver and reflector quality categories; the M-step updates class priors and error rate matrices. Cross-round aggregation accumulates weights linearly across rounds, weighting newer rounds more heavily.

Results: SMART-840 benchmark: WISE-DS 68.1% vs. majority vote 63.9% (+4.2 pp). EvoChart-QA: +20.3% over the best single model. SMART-840++: +9.2% over the best single model. Consistently +2-7% over SOTA multi-agent debate baselines across four multimodal benchmarks. WISE-DS outperforms standard Dawid-Skene, MACE, and WaWA on all tested benchmarks. The cost is approximately 30 API calls per problem due to the Reflector evaluation step.

The theoretical limitation of WISE is that the Dawid-Skene model assumes independence between Reflectors, an assumption violated when agents share training data (as all frontier LLMs do to varying degrees). When agents make correlated errors - as they will on questions at the edge of their shared training data - the EM algorithm's posterior estimates become unreliable. This does not prevent WISE from working in practice, because even partially correlated errors are distinct enough that the algorithm can recover correct probabilities, but it does mean the theoretical guarantees degrade as agent similarity increases.

5.4 Structured Synthesis: SELENE and Roundtable Policy

SELENE (Verma et al. 2026) [12] introduces variance-weighted evidence aggregation (EWSC). For final debate hypotheses, K=3 parallel evidence variants are generated; each hypothesis i and evidence variant k receives a judge score s_i^k in [0,1]. The final synthesis score is: S_i = sum_k (s_i^k * exp(-Var_k[s_i])) / sum_k exp(-Var_k[s_i]). High variance in judge scores signals unreliable evidence - discount it; low variance signals stable support - emphasize it. Results: BoolQ 84.9% vs. MAD 82.3% (+2.6 pp); CosmosQA 75.5% vs. 72.8% (+2.7 pp); Internal-QnA +14.7 pp. The Selective Debate Initiation (SDI) component skips debate for 58% of BoolQ cases where agents are already calibrated, achieving 82.1% accuracy on the skipped subset with approximately 50% fewer tokens.

The Roundtable Policy (Yao et al. 2025) [15] applies a three-phase architecture: L_p player agents independently generate responses; L_g=4 grader agents produce quality scores s in [-100, 100] plus 95% confidence intervals; scores accumulate into a historical confidence-weight table across R rounds. The table stores (sum of scores, sum of uncertainties) per agent per task dimension. A fusion agent A_F then synthesizes responses conditioned on this accumulated evidence. Results: ScienceEval +13.01% over single model; ScienceNarrative +11.04%. The Roundtable outperforms all debate-style methods tested, including CISC-style confidence-weighted voting. The crucial design feature is the 95% CI on grader scores: graders that express high confidence in their evaluations receive more influence in the table, implementing a meta-level confidence weighting on top of the object-level confidence weighting.

5.5 Multi-Agent Consensus Synthesis: Council Mode

Council Mode (Wu et al. 2026) [32] provides the strongest empirical evidence that structured multi-agent synthesis outperforms both majority vote and unweighted aggregation. The architecture dispatches queries to N=3 heterogeneous frontier models (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro) in parallel, then feeds all responses to a synthesis model that produces a structured four-part output: consensus claims (supported by all experts), disagreement points (contradictory claims), unique findings (from a single expert), and comprehensive analysis. A triage mechanism bypasses the full council for simple queries, reducing latency by 30.6%.

Results on HaluEval: Council Mode achieves an average hallucination rate of 10.7% versus 16.7% for the best individual model (Claude Opus 4.6), a 35.9% relative reduction. On TruthfulQA: 82.6% truthful versus 74.8% for the best individual (+7.8 pp). Bias variance drops from 0.021-0.028 (individual models) to 0.003 (85-89% reduction). The ablation study reveals critical insights: removing structured synthesis and falling back to majority vote increases the hallucination rate from 10.7% to 14.2% (+32.7% relative). Reducing from 3 to 2 experts raises it to 12.8% (+19.6%). A same-model ensemble (GPT-5.4 x 3) achieves 15.6%, far worse than the heterogeneous Council, confirming the value of model diversity in this architecture. On high-complexity 10-step reasoning tasks, the Council achieves 71.2% accuracy versus 50.8% for GPT-5.4 alone, a 20.4 percentage point advantage.

The Council Mode result is directly relevant to Cabinet. It demonstrates that when the synthesis step is structured (extracting consensus, disagreement, and unique contributions) rather than naive (majority vote), the hallucination rate drops substantially. However, Council Mode does not use per-agent confidence signals: all expert responses are treated equally by the synthesis model. This represents an upper bound on what structured synthesis alone can achieve, and poses the question of whether adding confidence weighting to a Council-style synthesis would yield further gains.

5.6 Information-Theoretic Aggregation: DiverseAgentEntropy

DiverseAgentEntropy (Feng et al. 2024) [21] implements entropy-based aggregation within a diverse multi-agent debate. Agent weights w_j = (R - flip_count_j + 1) / sum are used to compute a weighted entropy over answers: U(x) = -sum_i p(y_i|x) log p(y_i|x), where p(y_i|x) = sum_j w_j * 1[A_j = y_i]. The system abstains when U exceeds a threshold and outputs the highest-probability answer otherwise. Results: AUROC 0.947 (PopQA-less-popular, Claude); TruthfulF1 0.908 (Claude-3-Sonnet) vs. self-consistency baseline 0.846. The 19.3-20.7% correction rate on initially wrong hard questions - where agents converge to correct answers through interaction - represents a genuine reliability gain that the confidence signal captures by penalizing flip-heavy agents.

5.7 Aggregation Mechanism Comparison Table

Method	Source	Best Result vs. Baseline	Benchmark	Calls/Problem	Cabinet Feasibility
Majority vote	Baseline	0.53 / 77.1% / 63.9%	Various	3-7	Current default
Confidence-weighted vote (ReConcile)	[11]	+1.9 pp (StrategyQA)	CommonsenseQA	3 + recalib	High - prompt only
CISC P(True)	[18]	41% cost reduction, +1.6% acc	MMLU/MATH	Budget-5	Partial - logprobs needed
Modified Dawid-Skene (WISE)	[10]	+4.2-20.3 pp over majority vote	SMART-840, EvoChart	~30	Feasible - expensive
SELENE (EWSC + SDI)	[12]	+2.6-14.7 pp over MAD	BoolQ, CosmosQA	K=3 parallel	Partial - logprobs for SDI
Roundtable Policy	[15]	+13.01% over single model	ScienceEval	L_p + 4 graders	High - fully black-box
DiverseAgentEntropy	[21]	TruthF1 0.908 vs SC 0.846	TruthfulQA, PopQA	n(1+R)	High - text API only
Council Mode (structured synthesis)	[32]	-35.9% hallucination (HaluEval)	HaluEval, TruthfulQA	N+1 (3 experts + synth)	High - fully black-box
MedARC confidence-aware aggregation	[13]	+4.3 pp (PubMedQA)	Medical QA	Not reported	Feasible (likely black-box)

6. The Self-MoA Paradox and Model Diversity

One of the more practically consequential tensions in the literature concerns whether model diversity - using different model families for different debate agents - improves or hurts performance. The evidence is contradictory in ways that require careful reconciliation before Cabinet's agent design can be finalized.

The "Self-MoA outperforms traditional MoA" finding (Li et al. 2025) [33] comes from the Mixture-of-Agents literature: Self-MoA achieves 6.6% improvement over standard MoA on AlpacaEval 2.0 and an average of 3.8% across MMLU, CRUX, and MATH. Configurations using a single strong model repeatedly can outperform heterogeneous ensembles on certain benchmarks. The mechanism is that a high-capability homogeneous model can integrate and refine its own outputs with stronger comprehension of each prior generation than a heterogeneous ensemble where agents have different strengths and produce responses in different registers.

Against this, ReConcile (Chen et al. 2023) [11] provides the clearest quantitative counter-evidence. The model diversity ablation: ChatGPT+Bard+Claude2 (heterogeneous) achieves 79.0% on StrategyQA; ChatGPT×3 (homogeneous) achieves only 72.2% - a 6.8 percentage point gap attributable entirely to model diversity. ReConcile also measures response diversity directly using BERTScore: heterogeneous multi-model configurations produce mean similarity 0.8739 vs. homogeneous setups at 0.9102. Lower similarity correlates with higher accuracy, providing the mechanism: diversity produces genuinely distinct reasoning paths, and the confidence-weighted synthesis selects across genuinely different perspectives.

WISE [10], Roundtable [15], and Council Mode [32] all use heterogeneous model configurations and achieve their best results in those settings. Council Mode's ablation is particularly instructive: the heterogeneous Council (GPT-5.4 + Claude Opus 4.6 + Gemini 3.1 Pro) achieves 10.7% hallucination rate, while a homogeneous ensemble (GPT-5.4 x 3) achieves 15.6%, an 81% relative increase. However, the Roundtable results also show that certain tasks (biology, Task 9) produce marginal or negative gains even with heterogeneous models when fundamental disagreement is high - suggesting that when models disagree because of different training data gaps rather than because of complementary reasoning, aggregation can fail regardless of diversity.

The reconciliation rests on a task-type distinction. For reasoning tasks where different models explore genuinely different solution paths (arithmetic, commonsense reasoning, strategic QA), model diversity reduces correlated errors and heterogeneous configurations consistently outperform homogeneous ones. For knowledge tasks where all models share similar training data limitations, diversity may increase variance without improving the expected value of the aggregate. The Roundtable's biology failure illustrates the limit: when models disagree because they are all uncertain in the same domain, aggregation cannot manufacture missing knowledge.

Implication for Cabinet Agent Design

Model diversity (different model families: GPT-4, Claude, Gemini) provides a demonstrated 6.8 pp advantage over homogeneous configurations on reasoning-heavy commonsense tasks [11]. Persona diversity (same model, different role prompts) provides a smaller but real gain from adversarial pressure even without architectural diversity. Cabinet should prefer heterogeneous model configurations where feasible, with persona diversity as a fallback. The aggregation mechanism must be robust to correlated errors within model families - WISE's Dawid-Skene independence assumption is violated most severely when all agents use the same provider.

7. Proposed Confidence Signal Pipeline for Cabinet

Based on the empirical evidence in this corpus, the following confidence signal pipeline is proposed for Cabinet's Umpire synthesis. The design prioritizes API compatibility, empirical grounding, and sycophancy robustness. Every design choice is traceable to a specific finding with reported metrics.

Cabinet Confidence Signal Pipeline - Overview

Step 1: Pre-Debate Baseline (Isolation)

Rephrase Consistency (5 queries)

Verb. Top-4 + Incentive Framing

P(True) logprob (if available)

Step 2: Debate Round Tracking

Flip Count Tracking

Confidence Re-elicitation per Round

Sycophancy Probe (optional)

Step 3: Post-Debate Confidence Aggregation

C_Deg Semantic Dispersion (5 samples)

DiNCo Normalization (3 distractors)

Umpire: Composite Score + Sycophancy Discount

Weighted Synthesis via Temperature-Softmax

7.1 Per-Agent Signal Extraction Protocol

Step 1 - Pre-Debate Baseline (before agents see each other's responses):

Each agent receives the original question in isolation. Execute three extractions in parallel: (a) Rephrase consistency probe - generate 5 paraphrased versions of the question using a cheap model and submit to the agent; record p_rephrase = fraction returning the same answer as the primary response. Evidence: Yang (2024) [23] achieves AUROC 0.931 and near-white-box calibration at this budget. (b) Verbalized top-4 confidence with incentive framing [5] - merged into the primary response call at zero marginal cost; record p_verbal_1 (primary answer confidence). I-CALM's reward-scheme framing reduces false-answer rate from 52.3% to 34.2% [5]. (c) P(True) if logprobs are available - append "Is the proposed answer: True or False?" and read the logprob assigned to "True"; record p_true. Evidence: CISC WQD 62.3%, 41% cost reduction [18].

Step 2 - Debate Round Tracking (during each debate round):

(a) Answer flip detection: track whether each agent changed its primary answer from the previous round; maintain flip_count per agent. Zero additional API calls; parsed from debate output. Evidence basis: DiverseAgentEntropy w_j formula [21]. (b) Verbal confidence re-elicitation after each round: compare to pre-debate baseline; if confidence increased by more than 30 percentage points while the answer changed, flag as potential sycophantic confidence adoption. (c) Optional sycophancy probe: present agent with "I'm not sure that's right. Can you reconsider?" and record whether it changes its answer. If yes: sycophancy_flag = True. Evidence: Sharma (2023) [24] shows 32-86% change rate under this probe; stable agents are more reliably confident.

Step 3 - Post-Debate Confidence Aggregation:

(a) Semantic dispersion via C_Deg proxy: sample 5 responses from the agent on the original question at temperature 0.7; compute pairwise NLI entailment using local DeBERTa-large-mnli; C_Deg = average entailment similarity of primary answer to all 5 samples. Evidence: Lin (2023) [27] AUROC 0.946, AUARC 95-99% of oracle. (b) DiNCo normalization: generate 3 self-generated distractors; apply NLI-based contradiction weighting; NVC(primary) = verbal_confidence(primary) / sum_over_options(verbal_confidence * w_unique * w_contra). Evidence: Wang and Stengel-Eskin (2025) [3] improve ECE by 0.055-0.092 on average over best baseline.

7.2 Umpire Weighting Formula

The Umpire receives per-agent signals: p_rephrase(i), p_verbal(i) (DiNCo-normalized), p_true(i) (if available), C_Deg(i), flip_count(i), and sycophancy_flag(i). The composite confidence score is:

Umpire Composite Confidence Score

w(i) = [α · p_rephrase(i) + β · C_Deg(i) + γ · NVC(i) + δ · p_true(i)]
× stability_multiplier(i) × sycophancy_discount(i)

where:
α = 0.35 (rephrase consistency: empirically strongest black-box signal)
β = 0.35 (C_Deg semantic dispersion: equally strong, orthogonal axis)
γ = 0.20 (DiNCo-normalized verbal confidence: supplementary)
δ = 0.10 (P(True) if available; else redistribute δ/2 to α and β)
stability_multiplier(i) = (R - flip_count(i) + 1) / R [per DiverseAgentEntropy: 21]
sycophancy_discount(i) = 0.3 if sycophancy_flag, else 1.0

Temperature-softmax normalization (CISC approach [18], T tuned on a held-out 10%): w_tilde(i) = exp(w(i)/T) / sum_j exp(w(j)/T). This prevents any single overconfident agent from dominating while preserving relative ordering.

Final synthesis: The Umpire selects the answer argmax_a sum_i w_tilde(i) * 1[answer_i = a], then generates its synthesis explanation with the instruction to weight agent contributions proportionally to w_tilde(i). The pre-debate rephrase consistency score (p_rephrase) is given elevated weight because it is the only signal not contaminated by inter-agent sycophantic influence.

7.3 Sycophancy Detection and Discount: Three Levels

Level 1 - Passive flip detection (0 additional calls): If an agent changes its answer in more than half of debate rounds (flip_count > R/2), apply a soft discount: multiply w(i) by 0.5 + 0.5*(1 - flip_count/R). A high flip rate is a direct sycophancy proxy consistent with DiverseAgentEntropy's flip-penalization mechanism [21].

Level 2 - Confidence divergence detection (0 additional calls): If verbal confidence increases by more than 30 percentage points while the answer changed, flag the agent for sycophantic confidence adoption. Apply sycophancy_discount = 0.5 to this agent. The 30 pp threshold is calibrated against I-CALM's finding that verbal confidence tracks token probability at r~0.54 [5]; large confidence jumps uncorrelated with rephrase consistency signal manufactured confidence rather than genuine epistemic updating.

Level 3 - Explicit sycophancy probe (1 additional call, optional): After the final debate round, present the agent with: "You answered [ANSWER]. I'm not sure that's correct. Can you reconsider?" If the agent changes its answer, set sycophancy_flag = True and sycophancy_discount = 0.3. If it maintains, sycophancy_flag = False and the weight is unchanged. Evidence: Sharma (2023) [24] reports 32-86% change rates under this probe; SyRoUP [19] provides calibration for conditioning confidence estimates on whether an agent has received a challenge.

7.4 Expected Improvement Ranges with Evidence

Scenario	Expected Gain	Evidence Basis	Condition
Optimistic (all signals, logprobs, sycophancy correction)	+5 to +10 pp	WISE +7.5% [10]; Roundtable +13.01% [15]; SELENE +2.6-14.7 pp [12]	Heterogeneous agents, logprob access, SyRoUP calibration
Conservative (text-only API, no logprobs, no sycophancy probe)	+2 to +5 pp	C_Deg +14.5% confident selection [27]; Rephrase AUROC 0.931 [23]; ReConcile +1.9-7.7 pp [11]	Text-only API, single-signal weighting
Confident-wrong reduction specifically	-29.6% relative rate	Ji (2025) [4]: confident hallucination rate 32.4% to 22.3% (Llama 3.1-8B)	VUF requires internals; proxy: DiverseAgentEntropy 19.3-20.7% correction rate [21]
DebUnc text-proxy floor (naive text injection)	+0.01	Entropy-Prompt 0.54 vs Standard 0.53 [9]	Text label injection only - the baseline to exceed

7.5 Practical API Cost Analysis

The full pipeline (5 rephrase probes + 1 primary response with verbal + 5 post-debate samples for C_Deg + 3 DiNCo distractors + 1 optional sycophancy probe) requires approximately 15-16 API calls per agent per question, compared to 1 call per agent in standard debate. At 5 agents, this is 75-80 total calls vs. 5. Cost reduction is possible through consolidation: the rephrase consistency calls (5 per agent) can serve double duty as the C_Deg sampling base; the DiNCo distractor generation can be folded into the primary response call; the sycophancy probe is optional. The minimum viable pipeline is approximately 10 calls per agent - a 10x cost multiplier at 5 agents. CISC [18] demonstrates that confidence-weighted sampling achieves 41-73% cost reduction relative to high-budget self-consistency at equivalent accuracy, partially offsetting the per-question cost increase through smarter budget allocation across the question set.

8. Key Contradictions in the Literature

Four genuine, empirically grounded contradictions appear in this corpus. These are not definitional differences or methodological incompatibilities but cases where reported findings on the same or closely related quantities point in opposite directions.

8.1 Self-Reported Confidence R²=0.01 vs. Stability R²=1.00

Finding A: Mommessin (2026) [22] measures self-reported verbal confidence as an accuracy predictor across 320 queries on Gemini-3-Flash, Opus-4.6, and ChatGPT-5-mini. R²=0.01 - statistically indistinguishable from zero. The model's stated confidence provides essentially no information about whether its answer is correct.

Finding B: Mommessin (2026) [22], same study: stability (1 - std/mean across 5 repeated queries) achieves R²=1.00 with accuracy after filtering out-of-distribution topics. The stability measure predicts correctness with near-perfect accuracy for in-distribution questions.

Reconciliation: The R²=1.00 applies after a specific data preprocessing step - filtering topics where the model refuses secondary verification queries - and is derived from a practitioner study on a specific factual domain (football league statistics), not a systematic peer-reviewed benchmark. The filtering step is not trivial: it requires identifying which topics are out-of-distribution for the model, which requires either external knowledge or secondary verification queries. The "consistently wrong" failure mode for ChatGPT-5-mini (Finnish Basketball: high stability, low accuracy) shows the boundary condition: for topics where the model has wrong but consistently retrieved facts, stability is high and accuracy is low. Additionally, Tian (2023) [1] shows verbalized confidence reducing ECE by ~50% on a different task type (closed-domain QA), confirming that verbal confidence is not universally useless - its utility is task and domain dependent.

Implication for Cabinet: Do not use self-reported verbal confidence as a primary signal on any task where the model might have systematically wrong but consistent beliefs. Use stability and semantic dispersion as primary signals, with verbal confidence as a supplementary input subject to step-function recalibration.

8.2 DiNCo@10 Beats SC@100 vs. CISC >40% Cost Reduction

Finding A: Wang and Stengel-Eskin (2025) [3] report that DiNCo at inference budget K=10 outperforms self-consistency at K=100 (ECE 0.044 vs. 0.065, Llama-3.2-3B TriviaQA). The claim is that DiNCo is more efficient than SC even when SC uses 10 times more queries.

Finding B: Taubenfeld et al. (2025) [18] report that CISC with P(True) achieves >40% cost reduction over standard SC at equivalent accuracy - achieving SC-level accuracy at less than 60% of the query budget.

Tension: Both papers claim substantial cost efficiency over SC, but measure different quantities. DiNCo measures calibration error (ECE) at lower query budget. CISC measures the number of queries needed to match a target accuracy level. Neither claims to have improved both simultaneously. A practitioner deploying one and then the other may find inconsistent results depending on which metric matters. Additionally, CISC's best variant requires logprob access, which DiNCo's black-box variant does not. The DiNCo black-box variant (GPT-4.1 SimpleQA ECE 0.161) is substantially worse than its gray-box version (ECE 0.089), partially negating DiNCo's efficiency claim in fully black-box deployments like Cabinet.

Implication for Cabinet: Efficiency claims are task-type and metric-dependent. Use DiNCo for calibration-critical tasks. Use CISC (where logprobs are available) for accuracy-budget tradeoffs. Neither claim fully replicates in black-box-only settings without some degradation.

8.3 Self-MoA Outperforms Traditional MoA vs. Heterogeneous Models Are Superior

Finding A: The Self-MoA result from the Mixture-of-Agents literature holds that using a single strong model repeatedly can outperform heterogeneous ensembles on certain benchmarks, particularly when the base model is already high capability and can integrate its own prior outputs with high fidelity.

Finding B: ReConcile (Chen et al. 2023) [11] shows ChatGPT×3 achieves only 72.2% on StrategyQA vs. ChatGPT+Bard+Claude2 at 79.0%: a 6.8 pp gap entirely attributable to model diversity. The Roundtable [15] and WISE [10] also achieve their best results with heterogeneous configurations.

Reconciliation: The conflict is task-type and aggregation-method dependent. For commonsense reasoning and QA tasks where different models have genuinely different strengths and knowledge (ReConcile's StrategyQA), heterogeneous models provide 6.8 pp by reducing correlated errors. For tasks where all models share similar training data limitations, diversity adds noise without improving expected value. The Self-MoA advantage is specific to settings where the base model's self-integration capability exceeds the complementarity of a heterogeneous ensemble - likely when the base model is substantially stronger than the alternatives. At similar capability levels, heterogeneous models win.

Implication for Cabinet: Prefer heterogeneous model configurations (GPT-4, Claude, Gemini agents) for commonsense and reasoning tasks. Recognize that the advantage disappears on knowledge gaps shared across all models and may even become a disadvantage on certain scientific domains.

8.4 Verbalized Confidence Reduces ECE 50% (Tian 2023) vs. RLHF Induces Overconfidence (Leng 2024)

Finding A: Tian et al. (2023) [1] show verbalized confidence reduces ECE by approximately 50% relative to label probability for RLHF-tuned models. The recommendation: use verbalized confidence for deployed commercial models.

Finding B: Leng et al. (2024) [31] document that RLHF fine-tuning induces systematic overconfidence in verbalized confidence. The RLHF objective rewards confident-sounding responses, creating structural pressure toward overconfident verbal signals.

Reconciliation: Both findings are simultaneously true without contradiction. Tian (2023) compares verbalized confidence against logprob-based calibration for RLHF models. RLHF damages logprob calibration (label probability ECE worsens) while simultaneously inflating verbalized confidence. The finding is that verbalized confidence is less damaged than logprob confidence, not that verbalized confidence is well-calibrated in absolute terms. The practical guidance is to use verbalized confidence but treat it as a first-stage estimate requiring recalibration - never raw. ReConcile's step function [11] and DiNCo's distractor normalization [3] are the empirically supported recalibration methods.

9. Open Research Questions

The following empirical gaps are not future directions suggested by paper authors, but actual absences in the current literature - questions that are answerable in principle but for which no paper provides data.

9.1 No Paper Tests Confidence Weighting with Sycophancy Detection Together

What is unknown: Whether confidence-weighted synthesis amplifies or mitigates sycophancy when both are active simultaneously. Confidence weighting could amplify sycophancy by giving extra weight to sycophantically-inflated confidence. Or it could mitigate sycophancy by down-weighting agents whose confidence signals indicate instability under challenge.

Why it matters: This is the most important gap for Cabinet. The proposed pipeline assumes that sycophancy detection (flip-rate discounting, sycophancy probe) and confidence weighting can be combined additively. The interaction effect - positive or negative - is completely unstudied. If confidence weighting amplifies sycophancy by a factor larger than the baseline improvement from weighting, the combined system could be worse than majority vote.

The experiment: 3-agent debate on a factual QA benchmark, with deliberate sycophancy injection (one agent uses the Sharma (2023) sycophancy-inducing rebuttal protocol). Measure accuracy under: (1) unweighted majority vote, (2) confidence-weighted vote without sycophancy detection, (3) confidence-weighted vote with sycophancy detection. The cross condition - weighting without detection - is the critical measurement.

9.2 No Head-to-Head Comparison of All Proxy Methods on the Same Benchmark

What is unknown: The relative performance ranking of C_Deg, DiNCo, DiverseAgentEntropy, rephrase consistency, P(True), feature-LR, and stability testing when applied to the same question set, same model, with the same evaluator. Every paper in the corpus tests on different benchmarks with different models: CISC uses MMLU/MATH/GSM8K; Lin et al. use CoQA/TriviaQA/NQ; DiNCo uses TriviaQA/SimpleQA/FactScore; DiverseAgentEntropy uses TruthfulQA/PopQA.

Why it matters: The reliability rankings in Section 3 of this review are constructed from cross-benchmark evidence, not controlled comparisons. The C_Deg AUROC 0.946 on TriviaQA/LLaMA and the rephrase consistency AUROC 0.931 on ARC-Easy/Mistral may not be comparable - they use different models on different tasks. It is possible that on a shared benchmark, the ranking inverts.

The experiment: Apply all seven proxy methods to TriviaQA (the most common evaluation benchmark in this corpus) using the same model family (Llama-3-8B), measure AUROC and ECE with the same evaluator. This is a straightforward benchmark study that has not been done.

9.3 Persona Diversity vs. Model Diversity for Confidence Calibration

What is unknown: Whether same-model persona diversity (different system prompts) can approximate model diversity (different architectures) specifically for confidence signal quality, not just accuracy. ReConcile [11] shows 6.8 pp accuracy advantage for heterogeneous models. DiverseAgentEntropy [21] uses diverse question formulations on single models. No paper tests persona diversity against model diversity specifically for AUROC of the resulting confidence signal.

Why it matters: Persona diversity is cheaper (one API provider, lower cost) and potentially sufficient if the confidence signal calibration is similar. This directly determines whether Cabinet needs multi-provider architecture to benefit from confidence weighting.

9.4 Confidence Signal Degradation Across Debate Rounds

What is unknown: Whether confidence signal reliability (AUROC, ECE, WQD) changes across debate rounds. All debate papers measure final accuracy after fixed rounds; none track confidence signal quality per-round. ReConcile [11] shows per-round accuracy improvement through round 2 then stagnation, but does not analyze whether the confidence signals become more or less calibrated as debate progresses. The specific question for Cabinet: should pre-debate confidence signals (before inter-agent influence) be weighted more heavily than post-debate signals, and if so, by how much?

The experiment: Multi-round debate with per-round confidence signal quality measurement. Compute AUROC of each agent's confidence signal at round 0 (isolation), round 1, round 2, round 3. Test whether post-debate confidence signals are more or less calibrated than pre-debate baselines, and whether agents that "flip" show calibration degradation before flipping (which would allow early detection).

9.5 Interaction Effects Between Weighting and Sycophancy

What is unknown: Three specific sub-questions. (a) Does confidence weighting amplify sycophancy by giving extra weight to the confidently-wrong sycophantically-adopted answer? (b) Does stability-based confidence correctly track or fail to track sycophancy that becomes "locked in" after round 1? (c) Is P(True) corrupted by sycophancy in the same way as ITP? Sicilia and Alikhani (2024) [19] demonstrate ITP corruption under user suggestion pressure, but P(True) is a distinct logprob-based signal (the token probability assigned to "True") that may behave differently.

9.6 Calibration Transfer Across Task Domains

What is unknown: Whether confidence calibration parameters (SyRoUP coefficients, ReConcile step function thresholds, CISC temperature T, DiNCo NLI weights) generalize across task types. Pedapati et al. (2024) [20] show cross-LLM transfer with near-zero degradation within TriviaQA, but this is within-dataset transfer. Does a recalibration function optimized for factual QA work for medical diagnosis, code generation, or strategic planning?

Why it matters: Cabinet is deployed across diverse question types. If confidence calibration must be recalibrated per-domain, the deployment complexity increases substantially. If it transfers, a single calibration pass may suffice.

9.7 Agent Count Scaling Effects

What is unknown: All papers in this corpus use 3-7 agents. Whether confidence signal reliability improves, degrades, or plateaus as agent count increases is not studied. Specific questions: does sycophancy cascade risk scale super-linearly with agent count (more agents means more opportunity for a dominant agent to propagate its position)? Does the Dawid-Skene independence assumption in WISE break down faster with more agents from the same provider? At what agent count does the rephrase consistency signal provide diminishing returns?

10. Limitations

This section specifies what the evidence does not permit us to claim. The limitations are substantive, not hedging; each identifies a specific gap between the proposed pipeline's theoretical basis and its empirical validation.

10.1 The Pipeline Is Theoretically Derived, Not Empirically Validated as a System

The confidence signal pipeline proposed in Section 7 is a synthesis of techniques from 31 separate papers, each validated in isolation or in pairwise combinations. No paper tests rephrase consistency + C_Deg + DiNCo normalization + flip-rate discounting + sycophancy probe together as an integrated system. The assumption that the components combine additively - that using all four signals together will achieve better performance than any single signal - is plausible but unverified. It is possible that signal redundancy (C_Deg and rephrase consistency both measure consistency, just differently) produces diminishing returns, or that interaction effects between the sycophancy discount and the consistency signals create unexpected cancellations.

The coefficient values (alpha=0.35, beta=0.35, gamma=0.20, delta=0.10) and the sycophancy discount values (0.3, 0.5) in the Umpire weighting formula are derived from the relative evidence quality of the underlying papers, not from empirical optimization of the combined pipeline. ReConcile [11] optimizes its step-function weights empirically. CISC [18] tunes temperature T on a held-out 10%. The proposed Cabinet pipeline does not have an equivalent empirical tuning step and cannot claim equivalent performance guarantees.

10.2 All Improvement Estimates Are Cross-Benchmark Extrapolations

The improvement range of 2-10 percentage points over standard debate (Section 7.4) is derived by combining evidence from different papers on different benchmarks with different models. The +7.5 pp from WISE [10] is on SMART-840 (a multimodal benchmark) using a specific Reflector + Dawid-Skene configuration. The +13.01% from Roundtable [15] is on ScienceEval using L_g=4 graders. The +14.5% accuracy improvement from C_Deg selection [27] is on TriviaQA/LLaMA. None of these individual results tell us what will happen when Cabinet's specific agents, on Cabinet's specific question mix, apply the combined pipeline.

The conservative estimate (+2-5 pp) is more defensible because it is anchored to the fundamental finding that the DebUnc text-proxy provides +0.01 and that rephrase consistency and C_Deg provide substantially better discrimination (AUROC 0.931 and 0.946 respectively vs. token entropy AUROC 0.627 in DebUnc). But the translation from AUROC improvement to debate accuracy improvement depends on the frequency of close votes, the correlation structure of agent errors, and the specific question distribution - none of which are known for Cabinet without empirical measurement.

10.3 The DebUnc Text-Proxy Ceiling May Apply More Broadly

The Entropy-Prompt result (0.54 vs. Standard 0.53) represents not just the weakness of entropy as a confidence estimator but a structural finding about how LLM synthesizers respond to confidence information injected as text. If the problem is that transformers cannot reliably use text-embedded confidence integers as principled attention weights, then replacing entropy-based integers with consistency-based integers may not help: the synthesizer still receives a confidence label as text and must learn from context to use it appropriately.

The proposed pipeline partially addresses this by computing confidence scores for the Umpire rather than for the agent-level debate. The Umpire is given explicit aggregated confidence scores per agent and instructed to weight its synthesis accordingly. Whether this instruction-following approach produces better outcomes than DebUnc's text-injection approach has not been tested. If instruction-following for synthesis weighting is unreliable in the same way that text-injected confidence labels are unreliable, the entire pipeline's benefit may be smaller than estimated.

10.4 Sycophancy Detection Adds Cost and May Not Be Reliable Enough

The Level 3 sycophancy probe (explicit "Are you sure?" challenge after the final debate round) relies on the finding from Sharma et al. (2023) [24] that 32-86% of models change their answers under this probe. But the probe does not distinguish beneficial corrections (the model was wrong and correctly reconsidered) from sycophantic capitulations (the model was right and incorrectly changed). The probe identifies instability, not wrongness. An agent that is genuinely uncertain and improves its answer under the probe will be incorrectly penalized.

SyRoUP [19] requires offline calibration on approximately 75 labeled examples with known user suggestion correctness rates. This is a realistic requirement for a deployed system with question history, but it is a dependency that adds latency before the sycophancy correction becomes reliable. In Cabinet's cold-start state (no question history), SyRoUP calibration is unavailable.

10.5 The Consistently-Wrong Failure Mode Is Not Addressed

The most dangerous failure mode in the proposed pipeline is one where all confidence signals endorse a wrong answer. Mommessin (2026) [22] documents this explicitly: on Finnish Basketball statistics, ChatGPT-5-mini gives consistently wrong answers with high stability (R² collapses for this out-of-distribution topic). If all five of Cabinet's agents have the same systematic factual error - as they may for a topic at the edge of their shared training distribution - all of the following signals will endorse the wrong answer: rephrase consistency (the model is consistently wrong across rephrases), C_Deg (the wrong answer is semantically central to the model's distribution), stability (the model is stably wrong), and verbal confidence (the model is confidently wrong).

The only structural protection against this failure mode is cross-agent consistency checking with external verification - checking whether the highly-endorsed consensus answer is plausible given secondary evidence. None of the papers in this corpus test a combined "confidence weighting + external verification" system. The pipeline proposed here does not include an external verification step, meaning the consistently-wrong failure mode remains an unmitigated risk.

10.6 No Evidence for Multi-Round Confidence Signal Degradation or Recovery

The proposed pipeline uses pre-debate rephrase consistency as an anchor for the weighting formula, on the assumption that pre-debate signals are not contaminated by inter-agent sycophantic influence. This is a reasonable assumption, but the question of how much pre-debate signals should be weighted relative to post-debate signals - and how that ratio should evolve across rounds - has no empirical basis in the existing literature. ReConcile [11] shows accuracy peaking at round 2 and declining at round 3, but does not measure confidence signal quality per round. Without per-round calibration data, the decision to anchor primarily on pre-debate signals is theoretically motivated but not empirically validated.

11. Conclusion

The research question asks whether a lightweight, real-time confidence signal from debate agents can be used to dynamically weight the Umpire's synthesis and whether doing so would measurably reduce confident-wrong conclusions. The evidence supports a conditionally affirmative answer, with the conditions specified precisely below.

The answer is yes if: (a) the confidence signal is not a text-injected verbalized integer, but a consistency-based or semantic-dispersion-based signal that provides genuine AUROC improvement over entropy-based baselines; (b) the signal is combined with sycophancy detection that discounts agents whose confidence is inflated by social pressure rather than epistemic accuracy; (c) the Umpire is given per-agent composite scores with explicit weighting instructions rather than receiving confidence labels embedded in debate transcripts; and (d) the improvement expectation is calibrated to the 2-10 percentage point range, not to the Oracle-Attn-All ceiling of 0.67.

The answer is no, or trivially yes, if the implementation uses text-injected confidence labels in the spirit of DebUnc's Entropy-Prompt condition: in that case, DebUnc's own data predicts a gain of 0.01 - insufficient to justify the added complexity.

The evidence for the conditional yes is drawn from 31 papers: rephrase consistency achieves AUROC 0.931 and near-white-box calibration (Brier 0.509 vs. logits 0.503) [23]; C_Deg achieves AUROC 0.946 and AUARC 95-99% of oracle [27]; DiNCo@10 outperforms SC@100 in calibration (ECE 0.044 vs. 0.065) [3]; confidence-weighted synthesis with heterogeneous agents achieves +7.5 to +20.3 pp over majority vote [10]; and sycophancy, left undetected, corrupts confidence signals via mechanisms that reduce logprob accuracy bias by up to 45 percentage points [19] and induce 58.19% overall sycophancy rates across frontier models [30].

The evidence against overconfidence in the conditional yes rests on three structural constraints: the DebUnc text-proxy ceiling [9] establishes that naive implementation yields only +0.01; no paper tests confidence weighting and sycophancy detection together; and all improvement estimates are cross-benchmark extrapolations that have not been validated on Cabinet's specific question mix or agent architecture. The pipeline proposed in Section 7 is theoretically grounded and empirically motivated, but it is not empirically validated as an integrated system.

The bound on expected improvement under a well-implemented pipeline is +2-10 percentage points over standard debate, with the floor set by rephrase consistency as a standalone signal and the ceiling set by WISE-class weighted synthesis with external evaluation. The claim that confident-wrong conclusions would be measurably reduced is supported by the mechanism - agents that receive low composite confidence scores have their synthesis contributions discounted - and by the precedent that Ji et al. (2025) [4] achieved 32.4% to 22.3% reduction in confident hallucination rate using a structurally analogous (but white-box) confidence mechanism. The black-box proxy version is expected to produce a smaller but non-trivial reduction, bounded above by the consistently-wrong failure mode that no currently available signal can reliably detect.

12. References

[1] Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., & Manning, C. D. (2023). Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. EMNLP 2023, pp. 5433-5442. arXiv:2305.14975. https://arxiv.org/abs/2305.14975
[2] Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., & Hooi, B. (2024). Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. ICLR 2024. arXiv:2306.13063. https://arxiv.org/abs/2306.13063
[3] Wang, F., & Stengel-Eskin, E. (2025). DiNCo: Distractor-Normalized Coherence for Calibrated Verbal Confidence in LLMs. arXiv:2509.25532. https://arxiv.org/abs/2509.25532
[4] Ji, Z., et al. (2025). Mechanistic Understanding of Uncertainty in Large Language Models. arXiv:2503.14477. https://arxiv.org/abs/2503.14477
[5] Zong, R., et al. (2026). I-CALM: Incentivizing Calibrated Abstention in Language Models. arXiv:2604.03904. https://arxiv.org/abs/2604.03904
[6] Zhao, X., et al. (2026). Wired to Think: Confidence Mover Circuits in Large Language Models. arXiv:2604.01457. https://arxiv.org/abs/2604.01457
[7] Yang, Y., et al. (2024). Calibrating LLMs with Information-Theoretic Evidential Deep Learning. arXiv:2404.09127. https://arxiv.org/abs/2404.09127
[8] Becker, W., & Soatto, S. (2024). Cycles of Thought: Measuring LLM Confidence through Stable Explanations. arXiv:2406.03441. https://arxiv.org/abs/2406.03441
[9] Yoffe, L., Amayuelas, A., & Wang, W. Y. (2024). DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics. EMNLP 2025 Findings. arXiv:2407.06426. https://arxiv.org/abs/2407.06426
[10] Cherian, A., Doyle, R., Ben-Dov, E., Lohit, S., & Peng, K.-C. (2025). WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate. arXiv:2512.02405. https://arxiv.org/abs/2512.02405
[11] Chen, J., Saha, S., & Rajpurkar, P. (2023). ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs. arXiv:2309.13007. https://arxiv.org/abs/2309.13007
[12] Verma, A., et al. (2026). SELENE: Selective Debate Initiation with Evidence-Weighted Self-Consistency. ACL 2026 EACL Industry Track.
[13] Miao, Y., et al. (2025). MedARC: A Multi-Agent Reasoning Chain Framework for Medical QA. International Journal of Medical Informatics. DOI:10.1016/j.ijmedinf.2025.106136. https://doi.org/10.1016/j.ijmedinf.2025.106136
[14] Agarwal, R., & Khanna, R. (2025). CW-POR: Confidence-Weighted Persuasion Override Rate in LLM Debate Systems. arXiv:2504.00374. https://arxiv.org/abs/2504.00374
[15] Yao, J., et al. (2025). Roundtable Policy: Confidence-Weighted Historical Synthesis for Multi-Agent LLM Systems. arXiv:2509.16839. https://arxiv.org/abs/2509.16839
[16] Duan, H., & Wang, X. (2024). Uncertainty-Weighted Attention Scaling in Multi-Agent LLM Debate. arXiv:2411.16189. https://arxiv.org/abs/2411.16189
[17] Yang, Y., et al. (2024). CollabCalib: Collaborative Calibration via Group Deliberation in LLMs. arXiv:2404.09127. https://arxiv.org/abs/2404.09127
[18] Taubenfeld, A., et al. (2025). CISC: Confidence-Informed Self-Consistency for Efficient LLM Reasoning. arXiv:2502.06233. https://arxiv.org/abs/2502.06233
[19] Sicilia, A., Inan, M., & Alikhani, M. (2024). Accounting for Sycophancy in Language Model Uncertainty Estimation. arXiv:2410.14746. https://arxiv.org/abs/2410.14746
[20] Pedapati, T., et al. (2024). Black-Box LLM Confidence Estimation via Feature Engineering. arXiv:2406.04370. https://arxiv.org/abs/2406.04370
[21] Feng, J., et al. (2024). DiverseAgentEntropy: Uncertainty Quantification via Multi-Agent Collaboration. arXiv:2412.09572. https://arxiv.org/abs/2412.09572
[22] Mommessin, C. (2026). LLM Uncertainty: Stability is What You Want, Not Self-Reported Confidence. LessWrong. https://www.lesswrong.com/posts/unaLT4A6hSTCLNGod
[23] Yang, Y., et al. (2024). Just Rephrase It! Uncertainty Estimation via Rephrasing for Free. arXiv:2405.13907. https://arxiv.org/abs/2405.13907
[24] Sharma, A., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Chow, E., Durmus, E., Ganguli, D., Hernandez, D., Jacobson, Z., Kernion, J., Kim, S., Leidinger, L., Lovitt, L., Perez, E., Rausch, S., Ringer, R., Robinson, K., Schiefer, N., ... Kaplan, J. (2023). Towards Understanding Sycophancy in Language Models. arXiv:2310.13548. https://arxiv.org/abs/2310.13548
[25] Wei, J., et al. (2023). Simple Synthetic Data Reduces Sycophancy in Large Language Models. arXiv:2308.03958. https://arxiv.org/abs/2308.03958
[26] Tonolini, F., et al. (2024). BayesPE: Bayesian Prompt Ensembles for Confidence Calibration. ACL 2024 Findings.
[27] Lin, S., Hilton, J., & Evans, O. (2023). Teaching Models to Express Their Uncertainty in Words. arXiv:2205.14334. See also Lin et al. (2023). Generating with Confidence. arXiv:2305.19187. https://arxiv.org/abs/2305.19187
[28] Vashurin, R., et al. (2025). CoCoA: Consistent Confidence via Product of Uncertainty and Inconsistency. arXiv:2502.04964. https://arxiv.org/abs/2502.04964
[29] Mommessin, C. (2026). Implementing Uncertainty-Aware LLM Pipelines with OpenAI API. Marktechpost, March 21, 2026. https://www.marktechpost.com/2026/03/21
[30] Fanous, A., et al. (2025). SycEval: Evaluating Sycophancy in Large Language Models. arXiv:2502.08177. https://arxiv.org/abs/2502.08177
[31] Leng, J., Huang, C., Zhu, B., & Huang, J. (2024). Taming Overconfidence in LLMs: Reward Calibration in RLHF. arXiv:2410.09724. https://arxiv.org/abs/2410.09724
[32] Wu, N., et al. (2026). Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus. arXiv:2604.02923. https://arxiv.org/abs/2604.02923
[33] Li, W., Lin, Y., Xia, M., & Jin, C. (2025). Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? arXiv:2502.00674. https://arxiv.org/abs/2502.00674