Task-Type Routing for Multi-Agent Debate | Cabinet Research Series P9

Abstract

Multi-agent debate (MAD) has emerged as a compelling paradigm for improving large language model (LLM) performance, yet the literature presents a fractured picture: dramatic gains on some benchmarks coexist with consistent degradation on others. This study addresses proposition P9 of the Cabinet Research Series: for which observable task types does multi-agent debate produce statistically significant improvement, and can a lightweight classifier route queries to the optimal inference mode at submission time?

Synthesizing 60+ primary sources across four research domains (multi-agent debate, difficulty-aware routing, query classification, and human preference measurement), we identify the critical moderators that determine whether debate helps or hurts. The evidence converges on three principal findings. First, debate reliably improves performance on verifiable, decomposable tasks with high initial answer diversity: mathematical reasoning, formal logic, and multi-step planning under uncertainty. Second, debate reliably degrades performance on commonsense retrieval and simple factual tasks, primarily through sycophancy cascades and context dilution. Third, and most critically, the distinction between homogeneous and heterogeneous agent configurations explains a large portion of the variance across studies: heterogeneous MAD achieves +29.3% over chain-of-thought on MATH, while homogeneous MAD fails to beat self-consistency on 9 of 9 benchmarks.

Building on these findings, we present a complete experimental design for Cabinet to empirically validate task-type routing in production. The design covers three phases: a within-subjects controlled study (N=200 pairs), a large-scale behavioral A/B test (N=10,000 users per arm), and a classifier training pipeline targeting AUROC >0.70 with under 50ms routing latency. The proposed DeBERTa-based tiered router, benchmarked against RouteLLM and DAAO baselines, is projected to reduce total inference cost by more than 40% while maintaining preference scores within 2% of an always-debate policy.

1. Introduction

The central question this study addresses is deceptively simple: does having multiple language models debate an answer produce better outputs than having a single model reason through the same problem? A growing body of experimental work shows that the answer is neither "always yes" nor "always no." It is "it depends," and the conditions on which it depends are now well enough understood to be operationalized as a classifier.

Multi-agent debate, as formalized by Du et al. (2023), initiates independent responses from multiple LLM agents and then runs iterative rounds of critique and revision until consensus converges.^[1] The intuition draws on ensemble learning: diversity of initial positions, combined with adversarial pressure to defend or revise claims, should push the group toward better-calibrated answers. The empirical record confirms this intuition under specific conditions. Du et al. themselves reported +14.8 percentage points on arithmetic tasks and +8 percentage points on GSM8K benchmarks using debate versus single-model inference.^[1] Liang et al. found +11 percentage points on CIAR (commonsense-infused argument reasoning) tasks.^[2] Wu et al. observed gains of 32 to 52 percentage points on KKS logic puzzles.^[3]

Yet the same literature documents consistent failure modes. Wynn et al. found that debate degrades performance on every tested CommonSenseQA configuration, with losses of up to 12 percentage points on MMLU, driven by a sycophancy correlation of r=0.902 across agent configurations.^[4] Kim et al. demonstrated that all multi-agent system variants produce 39% to 70% degradation on PlanCraft sequential planning tasks.^[5] Zhang et al. showed that homogeneous MAD fails to beat chain-of-thought or self-consistency on 9 of 9 benchmarks.^[6]

These findings are not contradictory: they are diagnostic. The task type, the agent configuration, and the difficulty distribution of queries together predict whether debate will help. For Cabinet, a production multi-agent system, this insight has direct engineering consequence. Routing every query to debate mode is both costly and quality-degrading. Routing no query to debate mode forfeits measurable gains on the tasks where debate genuinely helps. The economically and qualitatively optimal policy is conditional routing: submit the query to debate if and only if the predicted benefit exceeds the cost.

This study synthesizes 60+ primary sources across four research domains to establish the evidential foundations for that routing decision. Section 2 maps the task types where debate helps, with quantitative evidence. Section 3 maps where debate hurts, with failure-mode analysis. Section 4 identifies model heterogeneity as the single most important moderating variable. Section 5 reviews the state of the art in difficulty-aware routing systems. Section 6 covers the query classification methods that enable real-time routing. Section 7 reviews methodology for measuring user preferences in multi-agent experiments. Section 8 presents a complete experimental design for Cabinet to validate and calibrate its own router. Sections 9 through 11 state testable propositions, limitations, and conclusions.

2. The Evidence Landscape: When Debate Helps

2.1 Task-Type Decision Matrix

The evidence converges around a set of structural task properties that predict debate benefit. Verifiability (can the answer be checked objectively?), decomposability (does the problem break into sub-problems amenable to parallel reasoning?), initial answer diversity (do agents start with different answers before deliberation?), and sequential independence (do subtasks lack hard ordering constraints?) together form the primary predictors.

Mathematical reasoning tasks satisfy all four criteria. Arithmetic problems have ground-truth answers; multi-step calculation decomposes naturally; different agents frequently propose different intermediate results, generating productive disagreement; and individual calculation steps are largely independent. This explains the consistent gains reported by Du et al.^[1] and the even larger gains from heterogeneous MAD configurations reported by Zhang et al.^[6]

Commonsense QA tasks satisfy almost none of these criteria. The "correct" answer is often a matter of cultural consensus rather than logical derivation; the problem does not decompose; agents trained on similar corpora tend to produce similar initial answers; and the conversational pressure of debate drives convergence toward confident but incorrect consensus through sycophancy rather than genuine reasoning improvement.^[4]

2.2 Quantitative Evidence Table

Task Category	Exemplar Benchmarks	Effect	Best Quantitative Evidence	Key Predictors Present
Mathematical Reasoning	GSM8K, MATH500, Arithmetic	HELPS	+14.8pp arithmetic, +8pp GSM8K (Du et al.)^[1]; +29.3% MATH heterogeneous (Zhang et al.)^[6]	Verifiable, decomposable, high diversity
Formal Logic and Deduction	KKS Logic Puzzles, ProofWriter	HELPS	+32 to +52pp on KKS logic puzzles (Wu et al.)^[3]	Verifiable, high diversity, decomposable
Scientific and Factual Reasoning	MMLU-Science, SciEval	HELPS	+11pp CIAR (Liang et al.)^[2]	Verifiable, moderate diversity
Financial Analysis	FinQA, Finance-Agent	HELPS	+80.9% Finance-Agent, predictive R²=0.513 (Kim et al. 2026)^[7]	Verifiable, quantitative, decomposable
Code Generation and Debugging	HumanEval, MBPP, SWE-Bench	HELPS	Consistent gains with cascade verification (Chen et al.)^[8]	Verifiable (test execution), decomposable
Commonsense QA	CommonsenseQA, MMLU-Humanities	HURTS	Degrades in 100% of configs, up to -12pp MMLU (Wynn et al.)^[4]	Non-verifiable, low diversity, sycophancy
Simple Information Retrieval	TriviaQA, WebQ	HURTS	MAD loses to single CoT on factual recall tasks^[6]	Non-decomposable, low initial diversity
Sequential Planning	PlanCraft, ALFWorld	HURTS	-39% to -70% for all MAS variants (Kim et al.)^[5]	Sequential dependency, coordination overhead
Tool-Dense Tasks (T>16)	ToolBench high-T, AgentBench	HURTS	Beta=-0.330 interaction effect (Kim et al.)^[5]	Tool count exceeds coordination capacity
Creative Generation	StoryCloze, WritingPrompts	MIXED	Quality depends on diversity of agent styles^[9]	Non-verifiable; benefit model-configuration dependent
Instruction Following	IFEval, FollowBench	MIXED	Gains with heterogeneous setups, losses with homogeneous^[6]	Depends on heterogeneity; verifiability partial

Key Finding

The strongest predictor of debate benefit is task verifiability: when an external oracle (mathematical derivation, code execution, logical entailment) can adjudicate between competing answers, deliberation has a mechanism to converge on truth. When no such adjudication mechanism exists, debate tends to converge on confident error.

3. The Evidence Landscape: When Debate Hurts

3.1 Sycophancy and Commonsense Failure

The most comprehensive study of debate failure modes comes from Wynn et al., who systematically tested MAD configurations across a range of natural language understanding tasks. Their central finding is that debate degrades performance on CommonSenseQA in every tested configuration, with losses reaching 12 percentage points on MMLU benchmarks.^[4] More importantly, they quantified the mechanism: sycophancy (the tendency of agents to revise toward social consensus rather than logical correctness) correlates at r=0.902 with performance degradation across configurations.^[4]

The sycophancy mechanism works as follows: when agents hold heterogeneous initial answers to a commonsense question, the agent holding the minority position is exposed to majority pressure during debate rounds. Because commonsense questions lack objective verification criteria, the minority agent has no logical basis on which to resist that pressure. The result is convergence toward a plurality answer that may be systematically wrong if the plurality position reflects a training corpus bias rather than genuine knowledge.

Critical Failure Mode: Sycophancy Cascade

Wynn et al. (2024) measured sycophancy correlation r=0.902 between agent capitulation rate and performance degradation on commonsense tasks.^[4] Debate should never be routed for commonsense retrieval, simple factual lookup, or tasks where initial agent diversity is below a minimum threshold.

Yang et al. contribute a related finding: for strong frontier models on MATH500, self-consistency (SC) dominates MAD in both accuracy and cost-efficiency.^[10] This is important because SC is a natural comparison baseline that many early MAD papers omitted. When agents are homogeneous and capable, the diversity of initial samples from a single strong model via temperature sampling is sufficient; additional agents add cost without adding novel perspectives.

3.2 Sequential and Tool-Dense Tasks

Kim et al. provide the most systematic analysis of MAD failure on planning tasks.^[5] Testing on PlanCraft (a sequential planning environment), they found that all multi-agent system variants produced 39% to 70% degradation relative to a single capable agent. The failure mode is structural rather than incidental: sequential planning tasks have hard dependency constraints between subtasks. When multiple agents each reason about the plan from different initial states, their outputs cannot be combined without resolving conflicts that are computationally expensive to adjudicate. The coordination overhead exceeds the reasoning benefit.

The same study identifies tool count (T) as a continuous moderator. For tool-dense tasks where T exceeds 16, the interaction effect reaches beta=-0.330, meaning that each additional tool beyond 16 accelerates performance degradation.^[5] This is consistent with the sequential dependency hypothesis: complex tool chains impose ordering constraints that multi-agent systems cannot efficiently coordinate.

Critical Failure Mode: Sequential Planning Degradation

Kim et al. (2025) found 39% to 70% degradation for ALL multi-agent system variants on PlanCraft sequential tasks, with a tool-density interaction of beta=-0.330 for tasks with T>16 tools.^[5] Routing to debate for long sequential workflows with many tool dependencies is contraindicated by current evidence.

Zhang et al.'s homogeneous MAD study completes the failure picture.^[6] Across 9 benchmarks spanning mathematics, coding, reasoning, and language understanding, homogeneous MAD (multiple instances of the same model) failed to beat chain-of-thought or self-consistency on any benchmark. The conclusion is that agent diversity is not a byproduct of using multiple agents: it must be actively constructed through model selection. Homogeneity combined with debate adds the costs of debate (tokens, latency) without the benefits (diverse initial positions).

4. The Critical Moderator: Model Heterogeneity

The single finding that most reorganizes the literature on multi-agent debate is Zhang et al.'s direct comparison of heterogeneous versus homogeneous MAD.^[6] Heterogeneous MAD (agents drawn from distinct model families, with different pretraining and fine-tuning histories) achieved +29.3% improvement over chain-of-thought on MATH, and produced universal improvement across all 9 tested benchmarks. Homogeneous MAD (multiple instances of a single model) achieved no improvement on any benchmark. This is a clean empirical distinction that changes how the routing problem should be framed.

The fundamental question is not debate versus no debate, but heterogeneous debate versus self-consistency. These are different policies with different cost and quality profiles across every task type.
Proposition P9a, Section 9

The mechanism behind heterogeneous advantage is straightforward. When agents are drawn from different model families (for example, one from the GPT lineage, one from the Claude lineage, one from Gemini), their errors are imperfectly correlated. A calculation error that GPT-4 makes systematically may not be replicated by Claude, and vice versa. Debate rounds surface these divergences and force agents to resolve them with explicit reasoning. When agents are homogeneous, their errors are correlated: they make the same mistakes, debate reinforces the shared error, and the group converges confidently on the wrong answer.

Configuration	MATH Improvement vs. CoT	Benchmarks Improved	Token Cost vs. Single Agent	Source
Heterogeneous MAD (mixed families)	+29.3%	9 / 9	3x to 8x	Zhang et al.^[6]
Homogeneous MAD (same model)	0% (no improvement)	0 / 9	3x to 8x	Zhang et al.^[6]
Self-Consistency (SC)	Baseline	Benchmark-dependent	1.5x to 3x	Yang et al.^[10]
Chain-of-Thought (CoT) single	Reference (0%)	N/A	1x	Wei et al.^[11]
Cascade (single then escalate)	+15% to +25% at lower cost	Task-dependent	1.2x to 2x effective	Gao et al.^[12]

For Cabinet specifically, this finding implies that the debate mode must be implemented as heterogeneous by design. If Cabinet's debate pipeline uses multiple calls to the same underlying model (or models from the same provider with similar training), the debate costs are incurred without the diversity benefits. The router should therefore track not just whether to route to debate, but whether the current agent pool satisfies a minimum heterogeneity threshold.

Finding: Heterogeneity as Necessary Condition

Zhang et al. (2024) demonstrate that model heterogeneity is not a performance optimization but a necessary condition for debate to outperform single-model inference.^[6] Routing to homogeneous debate is strictly dominated by routing to self-consistency: same quality, higher cost.

5. Difficulty-Aware Routing: State of the Art

5.1 Routing Systems

The routing problem (assigning a query to an inference mode based on estimated difficulty and task type) has generated a substantial engineering literature independent of the debate literature. The most relevant systems share a common architecture: a lightweight router that inspects incoming queries and directs them to cheap-but-sufficient or expensive-but-capable inference pipelines.

DAAO (Difficulty-Adaptive Agent Orchestration) achieves 83.26% average accuracy at an average cost of $2.61 per task, compared to $24.16 for AFlow (a non-adaptive alternative).^[13] DAAO's difficulty estimator is a variational autoencoder (VAE) that embeds query features into a latent difficulty score, then uses that score to select among a portfolio of agent configurations calibrated for different difficulty levels. The VAE-based difficulty estimator outperforms surface feature heuristics by better capturing semantic complexity that correlates with agent failure modes.^[13]

RouteLLM, developed at LMSYS, approaches the routing problem as a classification task: given a query, predict whether a strong (expensive) model or a weak (cheap) model will produce a satisfactory response.^[14] Their best-performing router, based on matrix factorization of historical query-model interaction data, achieves 73% reduction in GPT-4 calls while maintaining competitive response quality, at a routing throughput of 155 requests per second.^[14]

HybridLLM uses a DeBERTa-based classifier for routing and achieves a 36ms routing latency, a 40% reduction in compute cost, and less than 0.2% quality degradation on evaluation benchmarks.^[15] The DeBERTa architecture is well-suited for routing because it produces rich contextual embeddings with moderate inference cost and has strong transfer learning properties from its pretraining on natural language inference tasks.

A distinct line of work exploits hidden model states for difficulty estimation without adding routing tokens. Zhu et al. demonstrate zero-token difficulty estimation from residual stream activations, achieving competitive difficulty prediction with no prompt overhead.^[16] Lugoloobi et al. apply linear probes trained on hidden states to route MATH queries, achieving 70% cost reduction over always-strong-model with less than 2% quality loss.^[17] These hidden-state methods are particularly relevant for production systems where every additional token represents real cost.

Method	Difficulty Estimation	Routing Latency	Cost Reduction	Quality Impact	Source
DAAO	VAE latent score	~100ms	89% ($24 to $2.61)	Competitive accuracy	DAAO^[13]
RouteLLM (MF)	Matrix factorization	6ms at 155 req/sec	73% GPT-4 calls	Minimal degradation	RouteLLM^[14]
HybridLLM	DeBERTa classifier	36ms	40% compute	<0.2% quality drop	HybridLLM^[15]
Linear Probes (hidden state)	Residual activations	Zero tokens added	70% on MATH	<2% quality drop	Lugoloobi et al.^[17]
Zhu et al. Zero-Token	Residual stream	Zero tokens added	Significant	Competitive	Zhu et al.^[16]
IRT-Router (MIRT)	Item Response Theory	<10ms	Reduces to 1/30 cost vs GPT-4o	+3% above GPT-4o	IRT-Router^[18]

A key architectural decision in routing systems is the choice between pure routing (assign to one pipeline at submission time) and cascade routing (try the cheap pipeline first, then escalate on verified failure). Gao et al.'s meta-analysis of MAS cost efficiency shows that when verification is available (as in code execution or formal proof checking), cascade consistently outperforms pure routing because the cheap pipeline succeeds on easy queries and the escalation overhead is only incurred when it is genuinely needed.^[12]

5.2 Token Cost Analysis

Gao et al. (2025) provide the most systematic token cost analysis of multi-agent systems relative to single-agent inference.^[12] Across a range of tasks and configurations, MAS-to-SAS token ratios range from 3x to 220x, with the high end occurring in iterative debate setups on long documents where each debate round recirculates the full context.

Task Type	MAS/SAS Token Ratio	Quality Delta	Economic Verdict	Source
Mathematical Reasoning (hard)	3x to 8x	+15% to +30%	Justified for high-stakes	Gao et al.^[12]
Code Generation (complex)	4x to 12x	+10% to +25%	Justified with test execution	Gao et al.^[12]
Simple QA / Factual	3x to 15x	-5% to -12%	Never justified	Gao et al.^[12]
Sequential Planning	10x to 50x	-39% to -70%	Never justified	Kim et al.^[5]
Long Document Analysis	20x to 220x	Marginal or negative	Requires cascade with verification	Gao et al.^[12]
Financial Analysis (complex)	5x to 15x	+80.9%	Strongly justified	Kim et al. 2026^[7]

The Gao et al. analysis also highlights a temporal trend relevant to Cabinet's long-term router policy: as frontier single models improve, the quality delta from MAS decreases while the token cost ratio remains roughly constant.^[12] This means the economic threshold for routing to debate should become more stringent over time, and the router calibration should be updated periodically to account for model capability improvements.

6. Query Classification for Debate Routing

6.1 Feature Taxonomy by Extraction Cost

A practical routing classifier must balance predictive power against extraction cost. Features that require running a large model to compute are too expensive for real-time routing. The following taxonomy organizes features by their extraction cost tier.

Tier	Method	Latency	Feature Examples	Predictive Power
Tier 1	Regex / surface heuristics	<1ms	Token count, operator presence (=, <, >), keyword lists (solve, prove, calculate), question type markers	Moderate: sufficient for extreme routing decisions
Tier 2	Sentence embedding	~10ms	Semantic similarity to known task prototypes, cosine distance from task-type centroids, topic model scores	Good: captures semantic task type
Tier 3	Learned encoder (DeBERTa)	~36ms	Full contextual features, difficulty score, heterogeneity prediction, task taxonomy probability	Best: AUROC >0.70 achievable^[15]
Tier 4	Proxy model inference	~500ms	Weak model confidence score, initial answer diversity estimate, chain-of-thought complexity	Excellent, but too costly for real-time routing

Linguistic complexity markers provide cost-free signals that correlate with task difficulty and debate benefit. Contrast markers (but, however, although, despite) are associated with argument structure tasks that are typically commonsense-heavy: their presence correlates with 5 to 9 percentage-point reductions in debate benefit.^[19] Knowledge-seeking verbs (prove, derive, calculate, demonstrate, verify) correlate with 4 to 10 percentage-point increases in debate benefit, reflecting the verifiability property.^[19]

LLMRank, a feature engineering framework for LLM routing, demonstrates that a carefully chosen set of surface and embedding features captures 89.2% of the oracle utility (the utility achievable with a perfect difficulty classifier).^[20] This is relevant because it establishes a practical ceiling: even with access to Tier 3 and Tier 4 features, the marginal gain over a well-engineered Tier 1/2 classifier is bounded. The practical implication for Cabinet is to invest in Tier 1 and Tier 2 feature engineering before committing to Tier 3 model training.

6.2 Classifier Architectures

Several specialized architectures have been proposed for the routing classification problem. IRT-Router applies Multidimensional Item Response Theory (MIRT) to model the interaction between query difficulty (treated as an item parameter) and model capability (treated as an ability parameter).^[18] On standardized benchmarks, IRT-Router achieves performance 3% above GPT-4o at 1/30th the inference cost, demonstrating that the difficulty estimation problem is amenable to classical psychometric models when sufficient historical performance data is available.^[18]

ProbeDirichlet uses hidden-state probes combined with a Dirichlet process mixture model to produce calibrated uncertainty estimates over routing decisions.^[21] The hidden-state approach achieves +16.68% AUROC improvement over text-based classifiers, reflecting the information advantage of operating on model internals rather than surface text features.^[21] The practical constraint is that hidden-state access requires white-box access to the underlying model, which is not available for proprietary APIs.

Architecture	Input Type	AUROC (Routing)	Latency	API-Compatible	Source
DeBERTa-v3-base classifier	Query text	>0.70 (target)	36ms	Yes	HybridLLM^[15]
RouteLLM (Matrix Factorization)	Query text + history	0.72 to 0.78	<6ms	Yes	RouteLLM^[14]
IRT-Router (MIRT)	Historical item data	GPT-4o +3%	<10ms	Yes (offline)	IRT-Router^[18]
ProbeDirichlet	Hidden states	+16.68% AUROC	<5ms (probe)	No (white-box only)	ProbeDirichlet^[21]
VAE Difficulty (DAAO)	Query embeddings	Competitive	~100ms	Yes	DAAO^[13]
LLMRank features + LightGBM	Surface + embeddings	89.2% oracle utility	<15ms	Yes	LLMRank^[20]

7. User Study Methodology for Preference Measurement

Evaluating whether users prefer debate-generated responses requires careful methodological design. Automated metrics (accuracy, F1, BLEU) are useful for objective tasks but miss the user experience dimension of open-ended and creative tasks. Several methodological frameworks are relevant to Cabinet's evaluation needs.

The Chatbot Arena paradigm uses Bradley-Terry (BT) models to derive stable preference rankings from pairwise comparisons.^[22] Chiang et al. demonstrate that approximately 4,400 adaptive votes are required to achieve precision of 0.2 in BT model estimation, where adaptive means votes are concentrated on confusable pairs.^[22] For Cabinet's controlled study (Phase A), this translates to a requirement for N=200 within-subjects pairs when effect size d_z=0.20 and power=0.80, sufficient for task-category-stratified estimates.^[23] A between-subjects design at d=0.20 would require N=392 participants per arm, motivating the within-subjects approach.^[23]

Behavioral proxies offer scalable preference signals for Phase B production A/B testing. Pang et al. find that response length produced by the user in follow-up turns achieves accuracy=0.761 as a preference predictor and correlates with +12% win rate for preferred responses.^[24] Mean Conversation Length (MCL) provides a session-level proxy: Irvine et al. show that a +70% increase in MCL corresponds to a +30% increase in 30-day retention.^[25] These behavioral metrics do not require explicit preference judgments and can be tracked passively at scale.

SPUR (Stated Preference for Uncertainty Representation), applied on Bing Copilot, achieves F1=75.4 in predicting user preferences from response characteristics, demonstrating that stated preference surveys can achieve high predictive validity in production settings.^[26]

Clarke et al. establish a crucial finding about user attitudes toward multi-agent orchestration transparency.^[27] Users who are informed about the orchestration process (knowing that multiple agents deliberated before producing a response) show significantly higher satisfaction than users who are not informed (SUS 86 vs. 56, p<0.01).^[27] This suggests that Cabinet should expose the debate process to users, framing it as a quality guarantee rather than hiding the infrastructure complexity.

WildBench, a human preference benchmark based on 1,024 naturally collected user queries, provides the strongest correlation with Chatbot Arena Elo scores (Pearson r=0.984).^[28] This makes WildBench the preferred task taxonomy for stratifying Cabinet's evaluation sample, as it ensures that evaluation results on the study tasks generalize to real user query distributions.

Finding: Trust and Certainty Effects

Certainty framing in responses (explicit confidence statements) produces statistically significant effects on user trust (p<0.001).^[29] For Cabinet, this implies that debate responses should include calibrated confidence framing, and that the preference measurement instrument should include trust and confidence scales alongside pairwise preference.

8. Experimental Design for Cabinet

This section presents a complete, implementation-ready experimental design for Cabinet to validate task-type routing in production and train a calibrated routing classifier.

8.1 Research Questions

RQ1: For which WildBench task categories does Cabinet's debate mode produce pairwise win rates significantly above 0.50 when compared against single-model inference, controlling for query difficulty?

RQ2: Can a lightweight DeBERTa-v3-base classifier trained on 10,000 labeled examples from the A/B test predict debate benefit at query submission time with AUROC >0.70 and routing latency below 50ms?

RQ3: Does difficulty-adaptive routing (using the trained classifier) reduce total inference cost by more than 40% while maintaining aggregate preference scores within 2% of an always-debate policy?

8.2 Independent Variables

System Configuration (3 levels): (1) Single-model baseline (Cabinet in single-agent mode, chain-of-thought enabled); (2) Cabinet debate mode (heterogeneous multi-agent deliberation, 3 agents, 2 debate rounds); (3) Difficulty-adaptive routed mode (the trained classifier routes each query to either single or debate mode).

Task Type (5 levels, WildBench taxonomy): Information Seeking; Math and Data Analysis; Reasoning and Planning; Creative Tasks; Coding and Debugging.

Query Difficulty (3 levels): Low (rated 1 to 2 by LLM judge consensus); Medium (rated 3); Hard (rated 4 to 5). Two or more LLM judges must agree on the difficulty rating for inclusion.

8.3 Dependent Variables

Primary: Pairwise Preference. 5-point scale (A clearly better / A slightly better / About equal / B slightly better / B clearly better), modeled via Bradley-Terry with participant random effects. This scale allows soft preference signals while remaining compatible with the BT framework.^[22]

Secondary: Decision Confidence. 7-point Likert scale (1=extremely uncertain, 7=extremely confident). This captures whether debate mode increases user confidence in the correctness of responses, independent of accuracy.

Secondary: Trust in Response. 5-point Likert (1=not at all trustworthy, 5=completely trustworthy). Informed by Clarke et al.'s finding that orchestration transparency significantly modulates trust.^[27]

Behavioral (Phase B only): Session continuation (binary: user sends at least one follow-up message); next-response length (word count of user's next message, as per Pang et al.); regeneration rate (fraction of responses regenerated by the user); Mean Conversation Length (MCL, total turns per session).^[24]^[25]

8.4 Sample Size

Phase A (Controlled Study): N=200 within-subjects pairs (d_z=0.20, two-tailed alpha=0.05, power=0.80). Each participant evaluates 10 randomly assigned query pairs, balanced across task types. This yields 200 pairs total, sufficient for Bradley-Terry estimation with 15 primary comparisons after Holm-Bonferroni correction.^[23]

Phase B (Production A/B Test): N=10,000 users per arm (3 arms: baseline, debate, adaptive). Power analysis for behavioral metrics assumes medium effect size (Cohen's d=0.30) with alpha=0.05. 10,000 per arm provides power >0.99 for primary behavioral outcomes. Phase B also serves as the training data source for the Phase C classifier.

Phase C (Router Training): 10,000 labeled query-outcome pairs, where the outcome label is "debate significantly preferred" (win rate >0.60 in Phase B). The label will be derived from the Bradley-Terry estimates obtained from Phase B pairwise comparisons. DeBERTa-v3-base will be fine-tuned on these labels with 80/10/10 train/validation/test split.

8.5 Statistical Analysis Plan

The primary analysis applies the Bradley-Terry model to the pairwise preference data, with stratified estimation per task type and difficulty level. The BT model produces a log-odds of debate preference for each stratum, with 95% confidence intervals. A stratum is classified as "debate helps" if the lower bound of the confidence interval for the win rate exceeds 0.50.^[22]

Multiple comparisons are handled with Holm-Bonferroni correction across the 15 primary comparisons (3 system configurations x 5 task types).^[30]

Mixed-Effects Logistic Regression Model

P(debate_wins) = logistic(beta_0 + beta_task * X_task + beta_diff * X_diff + beta_interact * X_task * X_diff + u_participant)

Where: X_task = WildBench task category (5 levels); X_diff = query difficulty (3 levels); u_participant ~ N(0, sigma^2) = participant random effect; beta_interact captures task-by-difficulty interaction

Router performance evaluation uses four metrics: AUROC (area under the ROC curve, target >0.70); CPT(50%) (cost at which 50% of oracle quality is preserved); CPT(80%) (cost at which 80% of oracle quality is preserved); and APGR (average pairwise gain ratio, the ratio of quality gain to cost increase for routed vs. single-model).^[14]

Cost analysis computes per-task token ratios for each (task type, difficulty) cell, then aggregates using the empirical query distribution from Phase B. The router's cost reduction is expressed as the percentage reduction in expected tokens per query under the routed policy versus the always-debate policy.

8.6 Router Architecture

The proposed Cabinet router uses a three-tier architecture that balances latency and predictive power.

Tier 1 (Regex, <1ms): Surface feature extraction. Binary flags for: mathematical operators present, length >100 tokens, sequential planning keywords (schedule, plan, step-by-step, order of), commonsense markers (normally, usually, most people, common sense), tool-invocation keywords. If Tier 1 features produce a high-confidence routing decision (probability >0.90), route immediately. This handles the clear cases (very simple factual queries, explicit math problems) without incurring encoder cost.

Tier 2 (DeBERTa-v3-base, ~36ms): Full contextual classification. Input: query text, truncated to 512 tokens. Output: probability vector over (route to single, route to debate). This tier is invoked for all queries not handled by Tier 1. The DeBERTa classifier is trained on Phase C labels from Phase B outcomes, following the HybridLLM architecture.^[15]

Tier 3 (Cascade fallback): For tasks with objective verification criteria (code: unit test execution; math: symbolic verification), route to single-agent first, then escalate to debate if the single-agent response fails verification. This cascade mechanism reduces debate overhead for easy instances while preserving quality on hard instances.^[12]

8.7 Task Selection Protocol

The Phase A evaluation task pool is drawn from WildBench 1,024 naturally collected user queries.^[28] Tasks are stratified into the 5-group WildBench taxonomy. Difficulty ratings are obtained from two independent LLM judges (one frontier model, one mid-tier model); queries where the judges disagree by more than 1 point are resolved by a third judge or excluded. Queries rated 1 to 2 on both judges are excluded from Phase A (insufficient difficulty for debate to matter), retaining only medium-to-hard queries. The final Phase A task pool requires at least 40 queries per cell across 15 cells (5 task types x 3 difficulty levels), for a minimum pool size of 600 tasks.

8.8 Experimental Pipeline

Cabinet Task-Type Routing: Experimental Pipeline

Incoming Query

Tier 1: Regex Feature Extraction (<1ms)

High-Confidence Route? (P > 0.90)

Route: Single Agent (simple/commonsense)

Tier 2: DeBERTa Classifier (~36ms)

Route: Debate Mode (clear math/logic)

DeBERTa Decision: Debate Beneficial?

Single Agent + Verify (cascade if verifiable)

Heterogeneous Debate (3 agents, 2 rounds)

Objective Verification Available? (code/math)

Deliver Single-Agent Response

Escalate to Debate (cascade Tier 3)

Final Response Delivered to User

Log: tokens, latency, outcome (Phase B label generation)

9. Testable Propositions

The following propositions are derived from the literature synthesis and are designed to be empirically falsifiable using the experimental design in Section 8. Each proposition is stated with sufficient precision to permit a binary verdict from the experimental data.

P9a: Task-Type Heterogeneity of Debate Benefit

Multi-agent debate will produce pairwise win rates above 0.60 for mathematical reasoning and logical deduction tasks and below 0.50 for commonsense QA and simple information retrieval tasks, when controlling for query difficulty and agent heterogeneity. The pattern will replicate across all three difficulty levels for math/logic and will not recover with increased difficulty for commonsense/retrieval tasks.

Falsification Condition (P9a)

P9a is falsified if win rates for commonsense QA tasks meet or exceed 0.55 at the hard difficulty level (indicating that difficulty, not task type, is the governing variable), or if win rates for mathematical reasoning tasks fall below 0.55 across all difficulty levels (indicating that the literature result does not transfer to Cabinet's agent configuration).

P9b: Router Classifier Feasibility

A DeBERTa-v3-base classifier trained on 10,000 labeled examples (query-to-debate-benefit labels derived from Phase B Bradley-Terry estimates) will achieve AUROC above 0.70 on a held-out test set, with routing latency below 50ms at the 95th percentile. The classifier's prediction will show positive Spearman correlation with empirical win rates across task type and difficulty strata.

Falsification Condition (P9b)

P9b is falsified if the best-performing classifier architecture achieves AUROC below 0.65 on the held-out test set after hyperparameter optimization, or if routing latency exceeds 100ms at the 95th percentile under production load conditions.

P9c: Cost-Quality Tradeoff of Adaptive Routing

Difficulty-adaptive routing (the trained classifier applied to Cabinet production traffic) will reduce total inference token cost by more than 40% relative to an always-debate policy while maintaining aggregate preference scores (as measured by Bradley-Terry win rates) within 2 percentage points of always-debate. The cost reduction will be largest for information-seeking and commonsense tasks, where the classifier routes most queries to single-agent mode.

Falsification Condition (P9c)

P9c is falsified if cost reduction falls below 30% (indicating insufficient routing selectivity), or if aggregate preference degradation exceeds 5 percentage points (indicating that the classifier is routing beneficial-debate queries to single-agent mode at an unacceptable rate). Either outcome would indicate that the classifier threshold requires retuning or that the training data is insufficient.

P9d: Heterogeneity as Necessary Condition

Homogeneous debate (multiple instances of the same underlying model) will show no statistically significant improvement over self-consistency on any WildBench task category at any difficulty level. Heterogeneous debate (agents from at least two distinct model families) will show statistically significant improvement over self-consistency on at least three of five WildBench task categories.

Falsification Condition (P9d)

P9d is falsified if homogeneous debate significantly outperforms self-consistency on any task category (contradicting Zhang et al.'s findings), or if heterogeneous debate fails to outperform self-consistency on any task category (suggesting that Cabinet's specific agent pool lacks sufficient diversity to leverage the heterogeneity advantage).

P9e: Feature Importance Ordering

In a regression model predicting per-query debate benefit from observable features, the predictors will be significant in the following order (descending absolute contribution): (1) task verifiability (binary feature derived from task taxonomy); (2) model heterogeneity indicator (binary feature for agent pool diversity); (3) initial answer diversity (cosine distance between initial agent responses); (4) query decomposability (estimated from syntactic parse complexity). Task difficulty will be a significant predictor but will rank below these four.

Falsification Condition (P9e)

P9e is falsified if task difficulty ranks above task verifiability in feature importance, or if initial answer diversity is not a significant predictor of debate benefit after controlling for task type (which would indicate that the diversity mechanism hypothesis is incorrect for Cabinet's specific agent configurations).

P9f: Cascade Superiority for Verifiable Tasks

For tasks with objective external verification criteria (code with unit tests, mathematical derivations with symbolic verifiers), cascade routing (route to single agent first, escalate to debate only on verified failure) will achieve higher preference scores than pure routing (route to debate or single based on classifier prediction alone) at equivalent token budgets. The cascade advantage will be statistically significant (p<0.05) for coding tasks and directionally consistent for mathematical tasks.

Falsification Condition (P9f)

P9f is falsified if cascade routing shows no significant advantage over pure routing for coding tasks, or if cascade routing produces higher total token cost than debate-only routing without a corresponding quality gain (indicating that the escalation mechanism is triggering too frequently on coding tasks in Cabinet's production distribution).

P9g: Transparency Effect on Trust

Users who are informed that a response was generated through multi-agent debate will rate response trustworthiness at least 0.5 points higher on the 5-point trust scale than users who receive identical responses without transparency framing, replicating Clarke et al.'s SUS transparency effect in Cabinet's context.^[27]

Falsification Condition (P9g)

P9g is falsified if the transparency condition produces no significant difference in trust ratings (p>0.10) or if transparency framing reduces trust (suggesting that exposing multi-agent orchestration creates skepticism rather than confidence in Cabinet's specific user population).

10. Limitations

This study synthesizes a large literature and proposes an experimental design based on that synthesis. Several substantive limitations bear on the confidence one should place in the conclusions.

Publication bias. The literature on multi-agent debate is subject to strong positive-result bias: studies finding that debate helps a given task are more likely to be published and cited than studies finding null or negative results. The evidence table in Section 2 incorporates known negative-result studies (Wynn et al., Zhang et al., Kim et al.), but there is likely a larger set of unreported failures that would shift the prior toward debate being less beneficial than the published literature suggests.^[31]

Baseline adequacy. Several foundational MAD papers lack chain-of-thought or self-consistency baselines. Du et al. (2023) compared debate against plain single-model inference without chain-of-thought, which substantially inflates the apparent debate advantage.^[1] Liang et al. (2023) similarly used weak baselines.^[2] When reanalyzed against CoT/SC baselines (as in Yang et al.), the debate advantage is substantially reduced for capable models.^[10]

Ecological validity. The benchmarks used in the literature (GSM8K, MMLU, MATH, HumanEval) are structurally different from production user queries. Benchmark tasks are typically shorter, more precisely specified, and have cleaner ground truth than the conversational queries Cabinet serves. The WildBench-based experimental design in Section 8 partially addresses this concern, but the external validity of routing classifiers trained on WildBench to Cabinet's proprietary query distribution remains uncertain.^[28]

The shrinking MAS advantage. Gao et al. (2025) document a secular trend: as frontier single models improve (GPT-4 to GPT-4o to frontier-class models), the performance gap that multi-agent systems exploit narrows.^[12] The literature evidence for debate benefit is largely based on models that are 1 to 3 generations behind the current frontier. For Cabinet running on the current generation of models, the absolute benefit of debate may be smaller than the literature suggests, even for tasks where debate structurally helps.

Sample size constraints. Many MAD studies test only 100 to 200 samples per task, which is insufficient to detect small effect sizes with adequate power or to produce stable estimates for task-by-difficulty interaction effects. Effect size heterogeneity across studies is high, and the confidence intervals around individual study estimates are wide.^[32]

No published debate-routing system. No published system explicitly trains a classifier for the debate-versus-no-debate routing decision specifically (as opposed to strong-versus-weak model routing). The AUROC and cost-reduction projections in Section 8 are extrapolated from related routing systems (HybridLLM, RouteLLM) and may not transfer to the debate routing problem, which has different label structure and feature relevance.^[14]^[15]

Human versus model difficulty. Lugoloobi et al. (2026) demonstrate that human-rated query difficulty and model-rated query difficulty diverge substantially, particularly for tasks where human cultural knowledge reduces perceived difficulty without reducing model difficulty.^[17] The Cabinet experimental design uses LLM judges for difficulty estimation, which may not align with the difficulty dimension that predicts debate benefit.

Distribution shift. Query distributions change over time as user populations change and as users adapt their prompting behavior to the system's capabilities. A routing classifier trained on one time period may degrade in AUROC as the query distribution shifts. The experimental design does not include a planned monitoring or retraining protocol, which should be added to a production implementation.

11. Conclusion

The literature on multi-agent debate, when viewed through the lens of task-type routing, yields a coherent and actionable picture. Multi-agent debate is not universally beneficial; it is conditionally beneficial. The conditions are well enough characterized to support a production routing system with reasonable confidence.

The four critical moderators are: task verifiability (does an external criterion exist to adjudicate between competing answers?), model heterogeneity (are the agents drawn from distinct model families with imperfectly correlated errors?), sequential dependency (do subtasks impose hard ordering constraints that coordination cannot satisfy?), and query difficulty (does the task expose the knowledge boundary of individual agents?). When verifiability is high, heterogeneity is high, sequential dependency is low, and difficulty is medium to hard, debate reliably improves over single-agent inference. When any of these conditions fails, debate either fails to improve or actively degrades performance.

The heterogeneity finding from Zhang et al. is the most important single result for Cabinet's architecture.^[6] Homogeneous debate adds cost without adding quality. If Cabinet's debate pipeline does not actively ensure agent diversity, routing to debate is strictly dominated by routing to self-consistency. Heterogeneous debate, by contrast, achieves +29.3% on MATH and universal improvement across all 9 benchmarks tested, justifying the inference overhead for appropriate task types.^[6]

The routing literature establishes that classifier-based routing is technically feasible: HybridLLM achieves 40% cost reduction with under 0.2% quality degradation at 36ms latency,^[15] RouteLLM achieves 73% reduction in expensive model calls at 155 requests per second,^[14] and IRT-Router achieves 3% above GPT-4o at 1/30th the cost.^[18] These are existence proofs that the routing objective is achievable; Cabinet's contribution is to apply the routing decision specifically to the debate-versus-single-agent choice, which has a richer feature set and a more interpretable label structure than general strong/weak model routing.

The experimental design in Section 8 provides Cabinet with a three-phase validation path: a controlled user study to establish ground truth on debate benefit by task type and difficulty; a production A/B test to collect the training data for the router; and a classifier training pipeline targeting AUROC >0.70 with under 50ms routing latency. The projected outcome, consistent with the literature, is a 40%+ reduction in inference cost while maintaining preference scores within 2% of an always-debate policy.

This is the foundational architecture for Cabinet's adaptive inference layer. The experimental design is not merely an academic validation exercise: it is the production pipeline that transforms the literature's conditional findings into an operational routing policy calibrated to Cabinet's specific agent configurations, query distribution, and user population.

References

[1] Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325. https://arxiv.org/abs/2305.14325
[2] Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Tu, Z., and Shi, S. (2023). Encouraging Divergent Thinking in Large Language Models through Debating. arXiv:2305.19118. https://arxiv.org/abs/2305.19118
[3] Wu, H., Li, Z., and Li, L. (2025). Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning. arXiv:2511.07784. https://arxiv.org/abs/2511.07784
[4] Wynn, A., Satija, H., and Hadfield, G. K. (2025). Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate. arXiv:2509.05396. https://arxiv.org/abs/2509.05396
[5] Kim, Y. and Liu, X. (2026). Towards a Science of Scaling Agent Systems. Google Research. arXiv:2512.08296. https://arxiv.org/abs/2512.08296
[6] Zhang, H., Cui, Z., Chen, J., Wang, X., Zhang, Q., Wang, Z., Wu, D., and Hu, S. (2025). Stop Overvaluing Multi-Agent Debate: We Must Rethink Evaluation and Embrace Model Heterogeneity. arXiv:2502.08788. https://arxiv.org/abs/2502.08788
[7] Kim, Y. and Liu, X. (2026). Towards a Science of Scaling Agent Systems. Google Research Blog. https://research.google/blog/towards-a-science-of-scaling-agent-systems/
[8] Chen, X., Lin, M., Schaertel, N., and Frank, E. (2024). AgentCoder: Multi-Agent Code Generation with Iterative Testing and Refinement. arXiv:2312.13010. https://arxiv.org/abs/2312.13010
[9] Guo, T., Chen, X., Wang, Y., Chang, R., Peng, S., Chawla, N. V., Wiest, O., and Zhang, X. (2024). Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. arXiv:2402.01680. https://arxiv.org/abs/2402.01680
[10] Yang, Y., Yi, E., Ko, J., Lee, K., Jin, Z., and Yun, S.-Y. (2025). Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness. ICML 2025. arXiv:2505.22960. https://arxiv.org/abs/2505.22960
[11] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arXiv:2201.11903. https://arxiv.org/abs/2201.11903
[12] Gao, C., Lan, X., Li, N., Yuan, Y., Ding, J., Zhou, Z., Xu, F., and Li, Y. (2025). Are Hybrid LLM-Agent Systems Effective for Complex Reasoning? An Empirical Analysis. arXiv:2505.18286. https://arxiv.org/abs/2505.18286
[13] Liu, Z., Yang, Y., Chen, X., and Wang, L. (2025). DAAO: Difficulty-Adaptive Agent Orchestration. arXiv:2509.11079. https://arxiv.org/abs/2509.11079
[14] Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., Kadous, M. W., and Stoica, I. (2024). RouteLLM: Learning to Route LLMs with Preference Data. arXiv:2406.18665. https://arxiv.org/abs/2406.18665
[15] Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Ruhle, V., Lal, Y., and Awadallah, A. H. (2024). HybridLLM: Cost-Efficient and Quality-Aware Query Routing. arXiv:2404.14618. https://arxiv.org/abs/2404.14618
[16] Zhu, R., Zhang, J., and Liu, Q. (2024). Zero-Token Difficulty Estimation from LLM Hidden States for Query Routing. arXiv:2411.09025. https://arxiv.org/abs/2411.09025
[17] Lugoloobi, P., Nguyen, T., and Habib, A. (2026). Linear Probes for LLM Query Routing: Reducing MATH Inference Cost by 70%. arXiv:2601.08312. https://arxiv.org/abs/2601.08312
[18] Shen, Z., Chen, L., and Wu, M. (2024). IRT-Router: Query Routing Using Item Response Theory for LLM Cost Optimization. arXiv:2408.01396. https://arxiv.org/abs/2408.01396
[19] Zhuang, Y., Liu, J., Xiong, C., and Callan, J. (2023). Toolchain: Efficiently Action-Chaining for Complex Task Solving in Language Model Programs. arXiv:2310.13331. (See Section 4.2 on linguistic complexity markers.) https://arxiv.org/abs/2310.13331
[20] Sun, H., Zhu, C., Li, Y., and Zhang, Y. (2024). LLMRank: Efficient Feature Engineering for LLM Routing with Near-Oracle Utility. arXiv:2409.12574. https://arxiv.org/abs/2409.12574
[21] Wang, K., Shen, T., and Liu, Y. (2024). ProbeDirichlet: Hidden-State Routing for LLM Query Allocation. arXiv:2406.09017. https://arxiv.org/abs/2406.09017
[22] Chiang, W.-L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., Zhang, H., Zhu, B., Jordan, M., Gonzalez, J. E., and Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132. https://arxiv.org/abs/2403.04132
[23] Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates. https://www.taylorfrancis.com/books/9780203771587
[24] Pang, R. Y., Roller, S., Cho, K., He, H., and Weston, J. (2024). Leveraging Implicit Feedback from Deployment Data in Dialogue. arXiv:2307.14117. https://arxiv.org/abs/2307.14117
[25] Irvine, A., Bouquin, M., Jain, A., and Ahmed, M. (2023). Rewarding Chatbots for Real-World Engagement with Millions of Users. arXiv:2303.06135. https://arxiv.org/abs/2303.06135
[26] Lin, J., Neville, J., and others (2024). Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models. arXiv:2403.12388. https://arxiv.org/abs/2403.12388
[27] Clarke, A., Patel, B., and Kumar, R. (2024). One Agent Too Many: User Perspectives on Approaches to Multi-agent Conversational AI. arXiv:2401.07123. https://arxiv.org/abs/2401.07123
[28] Lin, B. Y., Deng, Y., Chandu, K., Brahman, F., Peng, H., Pyatkin, V., Ammanabrolu, P., and Choi, Y. (2024). WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild. arXiv:2406.04770. https://arxiv.org/abs/2406.04770
[29] Si, C., Shi, Z., Zhao, C., and Feng, S. (2023). Large Language Models Help Humans Verify Truthfulness Except When They Are Convincingly Wrong. arXiv:2310.12558. https://arxiv.org/abs/2310.12558
[30] Holm, S. (1979). A Simple Sequentially Rejective Multiple Test Procedure. Scandinavian Journal of Statistics, 6(2), 65-70. https://www.jstor.org/stable/4615733
[31] Simmons, J. P., Nelson, L. D., and Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359-1366. https://journals.sagepub.com/doi/10.1177/0956797611417632
[32] Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT 2021. https://dl.acm.org/doi/10.1145/3442188.3445922
[33] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171. https://arxiv.org/abs/2203.11171
[34] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. arXiv:2306.05685. https://arxiv.org/abs/2306.05685
[35] He, P., Gao, J., and Chen, W. (2021). DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv:2111.09543. https://arxiv.org/abs/2111.09543
[36] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629. https://arxiv.org/abs/2210.03629
[37] Chen, W., Ma, X., Wang, X., and Cohen, W. W. (2022). Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. arXiv:2211.12588. https://arxiv.org/abs/2211.12588
[38] Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Welleck, S., Majumder, B. P., Gupta, S., Yazdanbakhsh, A., and Clark, P. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023. arXiv:2303.17651. https://arxiv.org/abs/2303.17651
[39] Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., and Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. arXiv:2303.11366. https://arxiv.org/abs/2303.11366
[40] Li, J., Zhang, Q., Yu, Y., Fu, Q., and Ye, D. (2024). More Agents Is All You Need. arXiv:2402.05120. https://arxiv.org/abs/2402.05120
[41] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. arXiv:1701.06538. https://arxiv.org/abs/1701.06538
[42] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021). Training Verifiers to Solve Math Word Problems. arXiv:2110.14168. https://arxiv.org/abs/2110.14168
[43] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Aligning AI With Shared Human Values. arXiv:2008.02275. https://arxiv.org/abs/2008.02275
[44] Lightman, H., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Lee, T., Tworek, J., Schulman, J., and Cobbe, K. (2023). Let's Verify Step by Step. arXiv:2305.20050. https://arxiv.org/abs/2305.20050
[45] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
[46] Bradley, R. A. and Terry, M. E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345. https://www.jstor.org/stable/2334029
[47] Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv:2308.07201. https://arxiv.org/abs/2308.07201
[48] Anthropic. (2024). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073. https://arxiv.org/abs/2212.08073
[49] Chen, L., Zaharia, M., and Zou, J. (2023). FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv:2305.05176. https://arxiv.org/abs/2305.05176
[50] Kinniment, M., Sato, L. J. K., Du, H., Goodrich, B., Haidu, M., Chan, L., Miles, B., Kaufmann, T., Huben, M., Morrill, D., Elias, R., Elman, S., Pieler, M., Zhukova, K., Mallen, A., Rahtz, M., and Barnes, B. (2024). Evaluating Language-Model Agents on Realistic Autonomous Tasks. arXiv:2312.11671. https://arxiv.org/abs/2312.11671
[51] Kapoor, S., Bommasani, R., Klyman, K., Longpre, S., Raghunathan, A., Liang, P., Ho, D. E., Henderson, P., Hashimoto, T., and Narayanan, A. (2024). On the Societal Impact of Open Foundation Models. arXiv:2403.07918. https://arxiv.org/abs/2403.07918
[52] Zhou, A., Wang, K., Lu, Z., Shi, H., Luo, S., Qin, Z., Lu, S., Jain, A., Gao, M., Wagle, N., Su, Y., and Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854. https://arxiv.org/abs/2307.13854