# Appendix: Task-Type Routing for Multi-Agent Debate

**Sparse Halo Research, April 2026**

*Supplementary material for: "Task-Type Routing for Multi-Agent Debate: When Does Deliberation Add Value?"*

---

## Appendix A: Complete Reference List

All 52 sources reviewed in this study, organized by thematic category.

---

### A1: Multi-Agent Debate: Task-Type Evidence (Papers 1-14)

**[1] Du et al. 2023**
- **Authors:** Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch
- **Institutions:** MIT, Google DeepMind
- **Year:** 2023 (ICML 2024)
- **Title:** Improving Factuality and Reasoning in Language Models through Multiagent Debate
- **Venue:** ICML 2024
- **arXiv ID:** 2305.14325
- **DOI:** 10.48550/arXiv.2305.14325
- **URL:** https://arxiv.org/abs/2305.14325
- **Key Finding:** Multi-agent debate improves accuracy across arithmetic (+14.8 pp), GSM8K (+8.0 pp), chess (+31.5 delta pawn score), and MMLU (+7.2 pp) relative to single-agent baselines using GPT-3.5-Turbo; debate outperforms self-reflection on all benchmarks tested. Note: paper exclusively tests tasks expected to benefit from debate and does not evaluate commonsense recall or sequential planning.

**[2] Liang et al. 2023**
- **Authors:** Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, Zhaopeng Tu
- **Institutions:** The Chinese University of Hong Kong (Shenzhen), Tencent AI Lab
- **Year:** 2023 (EMNLP 2024)
- **Title:** Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- **Venue:** EMNLP 2024
- **arXiv ID:** 2305.19118
- **URL:** https://arxiv.org/abs/2305.19118
- **ACL URL:** https://aclanthology.org/2024.emnlp-main.992/
- **Key Finding:** Introduces the Degeneration-of-Thought (DoT) failure mode for self-reflection and shows that MAD resolves it on counter-intuitive arithmetic reasoning (+11 pp, 26% -> 37%) and culturally contextual machine translation (+0.64/5.0 on human evaluation); adaptive debate stopping outperforms fixed-round protocols.

**[3] Smit et al. 2023**
- **Authors:** Andries Smit, Nathan Grinsztajn, Paul Duckworth, Thomas D. Barrett, Arnu Pretorius
- **Institution:** InstaDeep
- **Year:** 2023 (NeurIPS 2023 Workshop)
- **Title:** Should We Be Going MAD? A Look at Multi-Agent Debate Strategies for LLMs
- **Venue:** NeurIPS 2023 Workshop on Foundation Models for Decision Making
- **arXiv ID:** 2311.17371
- **URL:** https://arxiv.org/abs/2311.17371
- **Key Finding:** Benchmarks 5 MAD strategies across 7 datasets (MedQA, GPQA, CosmosQA, CIAR, Chess, MMLU, PubMedQA); MAD gains are marginal across most benchmarks but a single hyperparameter finding stands out -- Multi-Persona with 90% agreement instruction yields +15 pp on a USMLE subset, the largest effect in the paper; MAD is more sensitive to hyperparameter tuning than single-agent methods.

**[4] Wynn et al. 2025**
- **Authors:** Andrea Wynn, Harsh Satija, Gillian K. Hadfield
- **Institution:** University of Toronto
- **Year:** 2025
- **Title:** Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate
- **Venue:** ICML 2025 Multi-Agent Systems Workshop
- **arXiv ID:** 2509.05396
- **DOI:** 10.48550/arXiv.2509.05396
- **URL:** https://arxiv.org/abs/2509.05396
- **Key Finding:** CommonSenseQA always degrades under debate across all 10 agent configurations (GPT-4o-mini, LLaMA, Mistral combinations); MMLU hurts in 7 of 10 configurations; GSM8K shows mixed results (+2.8 to -6.8 pp); the primary mechanism is social pressure -- the probability of a Correct-to-Incorrect flip is highest when an agent is isolated (no peers agree) and sycophancy score correlates with negative answer rate at r = 0.902 (p < 0.001).

**[5] Zhang et al. 2025**
- **Authors:** Hangfan Zhang, Zhiyao Cui, Jianhao Chen, Xinrun Wang, Qiaosheng Zhang, Zhen Wang, Dinghao Wu, Shuyue Hu
- **Year:** 2025 (revised June 2025)
- **Title:** Stop Overvaluing Multi-Agent Debate -- We Must Rethink Evaluation and Embrace Model Heterogeneity
- **Venue:** arXiv preprint (position paper)
- **arXiv ID:** 2502.08788
- **DOI:** 10.48550/arXiv.2502.08788
- **URL:** https://arxiv.org/abs/2502.08788
- **Key Finding:** MAD fails to beat Self-Consistency on most benchmarks for most models; the single most reliable intervention is model heterogeneity -- heterogeneous MAD (GPT-4o-mini + Llama3.1-70B) achieves +29.3% over CoT average on MATH, +11.4% on MMLU-Pro, and consistently outperforms homogeneous MAD on all 9 tested benchmarks; MBPP is the one benchmark where heterogeneous MAD also fails (-2.0%), indicating procedural code generation is resistant to debate gains.

**[6] Wu et al. 2025**
- **Authors:** Haolun Wu, Zhenkun Li, Lingyao Li
- **Year:** 2025
- **Title:** Can LLM Agents Really Debate? A Controlled Study of Multi-Agent Debate in Logical Reasoning
- **Venue:** arXiv preprint
- **arXiv ID:** 2511.07784
- **DOI:** 10.48550/arXiv.2511.07784
- **URL:** https://arxiv.org/abs/2511.07784
- **Key Finding:** Controlled study on Knight-Knave-Spy (KKS) logic puzzles (1,800 total, sizes 4-9) isolates six debate factors via regression; initial agent accuracy (beta = 0.600, p < 0.001) and initial disagreement among agents (beta = 0.085, p < 0.001) are the dominant predictors of debate success (R² = 0.393); structural parameters (debate order, confidence visibility, debate depth) are non-significant; debate improves instance accuracy by +32 to +52 pp across puzzle sizes on a purely deductive task with verifiable ground truth.

**[7] Kim et al. 2026**
- **Authors:** Yubin Kim, Xin Liu
- **Institutions:** Google Research, MIT
- **Year:** 2025 (blog January 28, 2026)
- **Title:** Towards a Science of Scaling Agent Systems
- **Venue:** arXiv preprint + Google Research Blog
- **arXiv ID:** 2512.08296
- **URL:** https://arxiv.org/abs/2512.08296
- **Blog URL:** https://research.google/blog/towards-a-science-of-scaling-agent-systems-when-and-why-agent-systems-work/
- **Key Finding:** Cross-validates 180 agent configurations across 4 benchmarks (Finance-Agent, PlanCraft, BrowseComp-Plus, Workbench); centralized MAS achieves +80.9% on Finance-Agent but every MAS variant degrades PlanCraft (-39 to -70%); a mixed-effects predictive model achieves R²_CV = 0.513 (87% accuracy); key predictors are tool diversity (beta = +0.535), single-agent baseline difficulty (beta = +0.319), and the interaction of baseline difficulty × agent count (beta = -0.408, the "baseline paradox"); independent MAS amplifies errors 17.2× vs. single-agent baseline 1.0×.

**[8] Yang et al. 2025**
- **Authors:** Yongjin Yang, Euiin Yi, Jongwoo Ko, Kimin Lee, Zhijing Jin, Se-Young Yun
- **Institutions:** KAIST, ETH Zurich, Max Planck Institute
- **Year:** 2025
- **Title:** Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
- **Venue:** ICML 2025
- **arXiv ID:** 2505.22960
- **URL:** https://arxiv.org/abs/2505.22960
- **Key Finding:** At matched compute budgets (N = 16 generations), Self-Consistency outperforms MAD on MATH500 for all models 7B and above; MAD advantages are confined to weaker models (Qwen2.5-1.5B, 3B) on harder tasks (AIME); for safety-alignment tasks, heterogeneous MAD uniquely reduces attack success rate escalation that occurs under Self-Refinement -- establishing safety as a domain where MAD adds value beyond accuracy.

**[9] Zeng et al. 2026 (M3MAD-Bench)**
- **Authors:** Jiaqi Zeng et al.
- **Year:** 2026
- **Title:** M3MAD-Bench: Are Multi-Agent Debates Really Effective Across Multi-Domain, Multi-Modal Tasks?
- **Venue:** arXiv preprint
- **arXiv ID:** 2601.02854
- **URL:** https://arxiv.org/abs/2601.02854
- **Key Finding:** Benchmarks 3 MAD methods across 13 datasets (7 text, 6 multimodal) spanning 9 models; LLM Debate (standard collaborative) outperforms CoT on MATH (+3.8 pp), MMLU-Pro (+3.2 pp), and GPQA (+2.4 pp) with GPT-4o-mini, while showing no benefit on standard MMLU (-0.8 pp); multimodal tasks show consistent debate gains for visual reasoning (VisualPuzzles: +14 pp for InternVL3-14B); adversarial Div-MAD consistently underperforms and degrades weaker models (-12.8 pp on average with LLaMA3.1-8B).

**[10] Cui et al. 2025 (Sycophancy in MAD)**
- **Authors:** Zhiyao Cui et al.
- **Year:** 2025
- **Title:** How Sycophancy Shapes Multi-Agent Debate
- **Venue:** arXiv preprint
- **arXiv ID:** 2509.23055
- **URL:** https://arxiv.org/abs/2509.23055
- **Key Finding:** Sycophancy score (SS) and Negative Answer Rate (NAR) are correlated at Pearson r = 0.902 (p < 0.001) across CommonsenseQA settings -- agents that abandon correct answers do so via superficial agreement rather than genuine reasoning; decentralized 2-agent homogeneous debate (Llama-Llama on CommonsenseQA) shows 86.36% debate confusion rate (DCR); centralized architecture with judge reduces DCR from 81.71% to 41.27% for the same agent configuration.

**[11] Chan et al. 2023 (ChatEval)**
- **Authors:** Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, Zhiyuan Liu
- **Institution:** Tsinghua University, Carnegie Mellon University
- **Year:** 2023 (ICLR 2024)
- **Title:** ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
- **Venue:** ICLR 2024
- **arXiv ID:** 2308.07201
- **URL:** https://arxiv.org/abs/2308.07201
- **Key Finding:** MAD with diverse role prompts achieves superior alignment with human evaluation vs. single-agent scoring on open-ended text quality tasks; same-role agents degrade performance, establishing that diverse personas are essential for evaluation tasks; open-ended text evaluation is identified as a task category where MAD is particularly suited because multiple quality dimensions can be assessed in parallel by different agents.

**[12] Wang et al. 2025 (Role Allocation)**
- **Authors:** (First authors as listed on ICML 2025)
- **Year:** 2025
- **Title:** Enhancing Effective Scaling through Role Allocation Strategies
- **Venue:** ICML 2025
- **URL:** https://icml.cc/virtual/2025/49295
- **Key Finding:** "Truth Last" role allocation -- placing the truth-seeker agent at the end of debate order -- yields +22% improvement in reasoning tasks; the MADC (Multi-Agent Debate Consistency) framework simulates the truth agent as the highest-consistency role and validates across 9 LLMs including DeepSeek-R1 Distilled; role position matters for reasoning tasks but not for factual recall, further stratifying the task-type dependency of MAD design choices.

**[13] "Impact of Multi-Agent Debate Protocols on Debate Quality" 2026**
- **Year:** 2026
- **Title:** Impact of Multi-Agent Debate Protocols on Debate Quality
- **Venue:** arXiv preprint
- **arXiv ID:** 2603.28813
- **URL:** https://arxiv.org/abs/2603.28813
- **Key Finding:** Rank-Adaptive Cross-Round (RA-CR) protocol achieves faster convergence than standard Cross-Round (CR); Within-Round (WR) maximizes peer-referencing rate but slows convergence; No-Interaction (NI) maximizes argument diversity; the paper explicitly motivates complexity-triggered invocation of debate protocols, stating "interaction-rich and convergence-oriented protocols should be selected conditionally" -- directly supporting the routing hypothesis.

**[14] Clarke et al. 2024 (One Agent Too Many)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2024
- **Title:** One Agent Too Many: User Perspectives on Approaches to Multi-agent Conversational AI
- **Venue:** arXiv preprint
- **arXiv ID:** 2401.07123
- **URL:** https://arxiv.org/abs/2401.07123
- **Key Finding:** Within-subjects user study (N = 19, >= 10 tasks each across 10 domains) comparing transparent multi-agent orchestration (One For All, OFA) vs. explicit agent selection; OFA achieves SUS score 86 (SD = 8.9) vs. 56 (SD = 21.7) for explicit selection (p < 0.01); task accuracy 71% vs. 57%; non-desirable response rate 3.75% vs. > 29% for commercial assistants -- establishing that users strongly prefer transparent orchestration and that evaluating MAS configurations should present the orchestrated system rather than exposing agent structure.

---

### A2: Difficulty-Aware Routing and Adaptive Inference (Papers 15-28)

**[15] Su et al. 2025 (DAAO)**
- **Authors:** Jing Su, Qian Lan, Ye Xia, Longfei Sun, Wei Tian, Tianmin Shi, Xin Song, Lihui He, Jian Yang
- **Year:** 2025 (accepted WWW 2026)
- **Title:** Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows
- **Venue:** WWW 2026
- **arXiv ID:** 2509.11079
- **DOI:** 10.48550/arXiv.2509.11079
- **URL:** https://arxiv.org/abs/2509.11079
- **Key Finding:** DAAO uses a VAE-based query difficulty estimator to dynamically generate query-specific multi-agent workflows; the self-adjusting policy -- updating difficulty estimates online from workflow success/failure -- is essential (ablation without adaptation drops HumanEval from 93.37% to 90.21%); DAAO achieves 83.26% average accuracy across 6 benchmarks (best published) at $2.61 total cost vs. $24.16 for AFlow, and +17.97% on GAIA-level agentic tasks over the next best baseline.

**[16] Ong et al. 2024 (RouteLLM)**
- **Authors:** Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, Ion Stoica
- **Institution:** UC Berkeley Sky Computing Lab
- **Year:** 2024
- **Title:** RouteLLM: Learning to Route LLMs with Preference Data
- **Venue:** arXiv preprint
- **arXiv ID:** 2406.18665
- **DOI:** 10.48550/arXiv.2406.18665
- **URL:** https://arxiv.org/abs/2406.18665
- **Key Finding:** Trains routers between strong (GPT-4) and weak (Mixtral-8x7B) models using 65K Chatbot Arena preference battles; matrix factorization router with augmented LLM-judge data achieves 73% fewer GPT-4 calls at 95% of GPT-4 quality on MT-Bench (3.66× cost reduction); routers trained on one model pair generalize to other pairs without retraining, suggesting partial model-independence of the difficulty concept; router training data augmentation via LLM judge (~$700 for 120K labels) provides the largest performance gain (+60% APGR).

**[17] Ding et al. 2024 (HybridLLM)**
- **Authors:** (First authors as listed on arXiv, Microsoft Research)
- **Institutions:** Microsoft Research
- **Year:** 2024
- **Title:** Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
- **Venue:** arXiv preprint
- **arXiv ID:** 2404.14618
- **URL:** https://arxiv.org/abs/2404.14618
- **Key Finding:** DeBERTa-v3-large (300M parameters) router runs at 36 ± 2 ms -- 222× faster than Llama-2-7B (7,990 ms) and 12.8× faster than FLAN-T5 (460 ms) -- while achieving 40% cost advantage at < 0.2% quality drop on MixInstruct; the transformed label variant (r_trans, maximizing training label variance) is most effective in high-difficulty routing scenarios; establishes the 36ms DeBERTa latency as the practical reference benchmark for production-grade query routing.

**[18] Chen et al. 2023 (FrugalGPT)**
- **Authors:** Lingjiao Chen, Matei Zaharia, James Zou
- **Institution:** Stanford University
- **Year:** 2023
- **Title:** FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
- **Venue:** arXiv preprint
- **arXiv ID:** 2305.05176
- **DOI:** 10.48550/arXiv.2305.05176
- **URL:** https://arxiv.org/abs/2305.05176
- **Key Finding:** Defines and instantiates the LLM cascade strategy (try cheap model first; escalate to expensive model only if quality is insufficient); achieves 98% cost reduction matching GPT-4 performance and +4% accuracy at the same cost as GPT-4-only deployment; documents a 100× API price differential between cheapest (Mixtral ~$0.24/M tokens) and most expensive (GPT-4 ~$24.7/M tokens) models; establishes the empirical and theoretical foundation for all subsequent routing work.

**[19] Gao et al. 2025 (Hybrid Agentic)**
- **Authors:** Mingjie Gao, Yudong Li, Bo Yu, Yilong Wang, Fangming Lai
- **Year:** 2025
- **Title:** Single-agent or Multi-agent Systems? Why Not Both?
- **Venue:** arXiv preprint
- **arXiv ID:** 2505.18286
- **DOI:** 10.48550/arXiv.2505.18286
- **URL:** https://arxiv.org/abs/2505.18286
- **Key Finding:** Systematically documents MAS token overhead ratios (GSM8K: 34.66× prefill; AIME: 220.22× prefill; HumanEval: 5.14× prefill) and introduces a hybrid cascade (SAS first; escalate to MAS on failure); cascade achieves +1.1-12% accuracy over best individual system while reducing token cost up to 88%; identifies three MAS failure types (node defect, edge defect, path defect); documents that MAS advantage over SAS has shrunk from +10-16% (ChatGPT era) to +1-3% (Gemini 2.0 Flash era) on code generation tasks.

**[20] Yue et al. 2025 (MasRouter)**
- **Authors:** Yue Yue, Guanyu Zhang, Bing Liu, Gongwei Wan, Kai Wang, Da Cheng, Yue Qi
- **Year:** 2025
- **Title:** MasRouter: Learning to Route LLMs for Multi-Agent Systems
- **Venue:** arXiv preprint
- **arXiv ID:** 2502.11133
- **DOI:** 10.48550/arXiv.2502.11133
- **URL:** https://arxiv.org/abs/2502.11133
- **Key Finding:** First paper to formalize Multi-Agent System Routing (MASR) as a unified problem jointly deciding collaboration mode, role allocation, and LLM assignment; cascaded controller achieves +1.8%-8.2% accuracy improvement over SOTA on MBPP and up to -52.07% overhead reduction on HumanEval; used as the primary baseline in DAAO, where DAAO further improves HumanEval from 90.62% to 94.65%.

**[21] Zhu et al. 2025 (Hidden-State Difficulty)**
- **Authors:** Yixin Zhu, Di Liu, Zitao Lin, Weijie Tong, Siyuan Zhong, Jingyi Shao
- **Year:** 2025 (EMNLP 2025)
- **Title:** The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
- **Venue:** EMNLP 2025
- **arXiv ID:** 2509.12886
- **DOI:** 10.48550/arXiv.2509.12886
- **URL:** https://arxiv.org/abs/2509.12886
- **Key Finding:** Difficulty estimation from the initial hidden state (h_0) -- one forward pass, zero tokens generated -- outperforms all baseline difficulty estimators; enables pre-generation routing with zero additional inference cost; applied to gate Self-Consistency, Best-of-N, and Self-Refine to higher inference efficiency; conceptually demonstrates that LLM internal representations encode sufficient information to predict generation success before generation begins.

**[22] Lugoloobi et al. 2026 (Linear Probe Routing)**
- **Authors:** William Lugoloobi, Tom Foster, Wendy Bankes, Chris Russell
- **Year:** 2026
- **Title:** LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations
- **Venue:** arXiv preprint
- **arXiv ID:** 2602.09924
- **DOI:** 10.48550/arXiv.2602.09924
- **URL:** https://arxiv.org/abs/2602.09924
- **Key Finding:** Linear probes on pre-generation activations predict model success with up to 70% inference cost reduction on MATH while exceeding best single-model accuracy; critically establishes that model-perceived difficulty diverges from human difficulty, and this divergence increases with extended reasoning -- human difficulty labels are therefore unreliable proxies for routing; training on model-specific outcomes (did the model succeed?) is required for effective routing classifiers.

**[23] Liu et al. 2026 (CASTER)**
- **Authors:** Shuo Liu, Xiao Yuan, Tianhua Chen, Zhixuan Zhan, Zhaoyu Han, Dashi Zheng, Wei Zhang, Shanshan Cao
- **Year:** 2026
- **Title:** CASTER: Breaking the Cost-Performance Barrier in Multi-Agent Orchestration via Context-Aware Strategy for Task Efficient Routing
- **Venue:** arXiv preprint
- **arXiv ID:** 2601.19793
- **DOI:** 10.48550/arXiv.2601.19793
- **URL:** https://arxiv.org/abs/2601.19793
- **Key Finding:** Lightweight dual-signal router combining semantic task embeddings and structural meta-features of the task's position in a workflow graph; self-optimizes via on-policy negative feedback; achieves up to 72.4% inference cost reduction while matching the success rate of always-strong-model baselines; consistently outperforms FrugalGPT and heuristic routing across Software Engineering, Data Analysis, Scientific Discovery, and Cybersecurity domains.

**[24] Cheng et al. 2025 (AdaptiveLLM)**
- **Authors:** Jingxuan Cheng, Fuchen Liu, Cheng Wu, Lei Zhang
- **Year:** 2025
- **Title:** AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length
- **Venue:** Proceedings of the 16th International Conference on Internetware, ACM
- **DOI:** 10.1145/3755881.3755925
- **URL:** https://dl.acm.org/doi/10.1145/3755881.3755925
- **Key Finding:** Chain-of-Thought length generated by a capable reasoning model correlates with task difficulty as perceived by LLMs; CoT-based difficulty labels are more reliable for model selection than human-annotated difficulty labels; AdaptiveLLM achieves +7.86% pass@1 improvement vs. ComplexityNet baseline with -88.9% resource consumption reduction at equivalent cost to a single model.

**[25] Li et al. 2025 (RoutingGen)**
- **Authors:** Shuzheng Li, Luyu Huang, Siyuan Zhan, Wei Sun, Tingting Yin, Zheng Liu, Meng Yan
- **Year:** 2025
- **Title:** Intention Chain-of-Thought Prompting with Dynamic Routing for Code Generation
- **Venue:** arXiv preprint
- **arXiv ID:** 2512.14048
- **DOI:** 10.48550/arXiv.2512.14048
- **URL:** https://arxiv.org/abs/2512.14048
- **Key Finding:** Routes between few-shot prompting (simple tasks) and Intention Chain-of-Thought prompting (complex tasks) based on difficulty; reduces token usage by 46.37% average across 6 code generation benchmarks and 3 LLMs while achieving state-of-the-art performance on challenging benchmarks; operationalizes the "cognitive economy principle" -- engage structured reasoning only when necessary.

**[26] Wang et al. 2026 (AgentConductor)**
- **Authors:** Shuo Wang, Ruizhi Lu, Zegang Yang, Yuchuan Wang, Yanpeng Zhang, Linbo Xu, Qikun Xu, Guowei Yin, Congcong Chen, Xi Guan
- **Year:** 2026
- **Title:** AgentConductor: Topology Evolution for Multi-Agent Competition-Level Code Generation
- **Venue:** arXiv preprint
- **arXiv ID:** 2602.17100
- **DOI:** 10.48550/arXiv.2602.17100
- **URL:** https://arxiv.org/abs/2602.17100
- **Key Finding:** RL-optimized MAS that infers task difficulty and constructs density-aware DAG topologies; more difficult tasks receive denser communication graphs while simpler tasks receive sparser graphs; achieves +14.6% pass@1 improvement and -68% token cost reduction vs. fixed topology baselines across 3 competition-level and 2 foundational code datasets; shows that even within the multi-agent path, difficulty-adaptive topology dominates fixed-topology deployment.

**[27] Wang and Tong 2025 (DAGP)**
- **Authors:** Shuo Wang, Guoqiang Tong
- **Year:** 2025
- **Title:** DAGP: Difficulty-Aware Graph Pruning for LLM-Based Multi-Agent System
- **Venue:** ACM CIKM 2025
- **DOI:** 10.1145/3746252.3760954
- **URL:** https://dl.acm.org/doi/10.1145/3746252.3760954
- **Key Finding:** Integrates a difficulty estimation module with a sparsity control mechanism to selectively activate or prune agent communication edges; instance-easy queries receive sparser graphs, hard queries receive denser communication; achieves 45% average token usage reduction while maintaining state-of-the-art accuracy across diverse benchmarks.

**[28] Dai et al. 2025 (Aragog)**
- **Authors:** Yuqi Dai, Zhibo Chen, Aditya Iyer, Ravi Netravali
- **Year:** 2025
- **Title:** Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows
- **Venue:** arXiv preprint
- **arXiv ID:** 2511.20975
- **DOI:** 10.48550/arXiv.2511.20975
- **URL:** https://arxiv.org/abs/2511.20975
- **Key Finding:** Adapts model selection per-stage throughout workflow execution rather than only at request start; decouples accuracy-preserving configuration identification (one-time) from per-stage scheduling (cheap, done at runtime); achieves 50.0%-217.0% maximum serving throughput increase and 32.5%-78.9% median latency reduction at peak load, with accuracy comparable to the most expensive configuration.

---

### A3: User Study Methodology and Preference Measurement (Papers 29-40)

**[29] Chiang et al. 2024 (Chatbot Arena)**
- **Authors:** Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica
- **Institution:** UC Berkeley, UC San Diego
- **Year:** 2024
- **Title:** Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
- **Venue:** ICML 2024
- **arXiv ID:** 2403.04132
- **URL:** https://arxiv.org/abs/2403.04132
- **Key Finding:** Establishes the canonical crowdsourced pairwise preference evaluation platform; 243,329 total comparisons across 90,051 users and 50 models as of January 2024; Bradley-Terry model with sandwich confidence intervals provides robust ranking; adaptive sampling achieves win-matrix precision <= 0.2 with 4,400 votes (vs. 6,800 for random sampling, 35% efficiency gain); win rates vary from 53% to 97% across topic clusters for the same model pair -- stratified sampling across domains is essential.

**[30] Pang et al. 2024 (BlenderBot Behavioral Proxies)**
- **Authors:** Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston
- **Year:** 2024
- **Title:** Leveraging Implicit Feedback from Deployment Data in Dialogue
- **Venue:** arXiv preprint
- **arXiv ID:** 2307.14117
- **URL:** https://arxiv.org/abs/2307.14117
- **Key Finding:** Rigorous validation of implicit behavioral proxies using 3.1M utterances from BlenderBot deployment; next-response length >= 20 words achieves the highest classifier accuracy (0.761) and the strongest win rate improvement (+12%, p < 0.05); Joy & length achieves +9.5% (p < 0.05) with better content safety profile; raw engagement optimization can increase controversial content -- Joy & length is the recommended proxy; expert-LLM agreement is only 64.5% vs. expert-expert 86%, indicating LLM judges are noisier than human experts.

**[31] Irvine et al. 2023 (Chai Research)**
- **Authors:** Jack Irvine et al.
- **Institution:** Chai Research
- **Year:** 2023
- **Title:** Rewarding Chatbots for Real-World Engagement with Millions of Users
- **Venue:** arXiv preprint
- **arXiv ID:** 2303.06135
- **URL:** https://arxiv.org/abs/2303.06135
- **Key Finding:** Production-scale A/B test with N = 10,000 new daily chatbot users per arm; sample-and-rerank with reward model achieves up to 70% Mean Conversation Length (MCL) increase and > 30% user retention increase for GPT-J 6B; establishes MCL as the canonical deployable behavioral engagement proxy, validated by the co-occurrence of MCL and retention improvements; sets the standard for group size (10,000/arm) needed to detect behavioral metric changes in deployed LLM chat products.

**[32] Lin et al. 2024 (SPUR)**
- **Authors:** Zhankui He, Zhouhang Xie, Rahul Jha, Harald Steck, Dawen Liang, Yesu Feng, Bodhisattwa Prasad Majumder, Nathan Kallus, Julian McAuley
- **Institutions:** UC San Diego, Netflix, Microsoft (Bing Copilot team)
- **Year:** 2024
- **Title:** Interpretable User Satisfaction Estimation for Conversational Systems with Large Language Models
- **Venue:** arXiv preprint
- **arXiv ID:** 2403.12388
- **URL:** https://arxiv.org/abs/2403.12388
- **Key Finding:** SPUR achieves 75.4 weighted F1 on Bing Copilot (> 5 billion conversations) -- the best across 4 datasets and all baselines; 20 satisfaction and dissatisfaction rubric items are all significantly correlated with thumb feedback (chi-square, p < 0.05); dataset-specific rubrics improve average F1 by +13% over general rubrics; establishes rubric-based LLM evaluation as a production-viable alternative to explicit user ratings at billion-conversation scale.

**[33] Clarke et al. 2024**
- **See [14] above for full citation and key finding.**
- **Note:** Included here again for thematic placement in the user study methodology section. The Clarke et al. 2024 paper belongs equally to both A1 (multi-agent debate task-type evidence) and A3 (user study methodology). See entry [14] for full metadata.

**[34] Zheng et al. 2023 (LMSYS-Chat-1M)**
- **Authors:** Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zeng, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
- **Institution:** UC Berkeley, CMU
- **Year:** 2023
- **Title:** LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
- **Venue:** arXiv preprint
- **arXiv ID:** 2309.11998
- **URL:** https://arxiv.org/abs/2309.11998
- **Key Finding:** 1,000,000 multi-turn conversations across 25 LLMs and 210,479 unique users; k-means clustering (k = 20) on 100K English prompts reveals coding and software as the dominant category (~25-30%), with significant unsafe-topic clusters (~5.4% flagged); average user prompt is 69.5 tokens with 2.0 average turns per conversation; provides the empirical foundation for understanding the actual distribution of production LLM queries used in WildBench task stratification.

**[35] Lin et al. 2024 (WildBench)**
- **Authors:** Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, Yejin Choi
- **Institution:** Allen Institute for AI (Ai2)
- **Year:** 2024
- **Title:** WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
- **Venue:** arXiv preprint
- **arXiv ID:** 2406.04770
- **URL:** https://arxiv.org/abs/2406.04770
- **Key Finding:** 1,024 tasks from WildChat filtered for difficulty and diversity across 12 task categories (Information Seeking, Reasoning, Planning, Editing, Coding & Debugging, Math, Role Playing, Data Analysis, Creative Writing, Advice Seeking, Brainstorming, Others); WB-Reward metric achieves Pearson correlation 0.984 with Chatbot Arena Elo (vs. 0.909 for ArenaHard, 0.892 for AlpacaEval2-LC), the highest published correlation to human preference; provides the standard task taxonomy used for query stratification in this study.

**[36] Zhao et al. 2024 (WildChat)**
- **Authors:** Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng
- **Year:** 2024
- **Title:** WildChat: 1M ChatGPT Interaction Logs in the Wild
- **Venue:** arXiv preprint
- **arXiv ID:** 2405.01470
- **URL:** https://arxiv.org/abs/2405.01470
- **Key Finding:** 1,039,785 full conversations from real ChatGPT users (April 2023 - May 2024) across 68 languages; average user prompt is 295.58 tokens (4.3× longer than LMSYS-Chat-1M); ~3.7% of conversations exceed 10 turns; > 10% of user turns flagged as potentially toxic; distinctive WildChat-specific cluster (Midjourney prompt engineering) absent from LMSYS -- demonstrates that production datasets have meaningfully different query distributions and cannot be used interchangeably for routing classifier training.

**[37] Ameli et al. 2024 (Statistical Framework)**
- **Authors:** Maryam Ameli, Kaizhao Liang, Ion Stoica, Michael Mahoney
- **Year:** 2024
- **Title:** A Statistical Framework for Ranking LLM-Based Chatbots
- **Venue:** arXiv preprint
- **arXiv ID:** 2412.18407
- **URL:** https://arxiv.org/abs/2412.18407
- **Key Finding:** Analyzes 129 competitor models, 3,455 unique model pairs, and 1,374,996 total comparisons; the Fisher Information Matrix of Bradley-Terry scores is singular at rank <= m-1, making individual score uncertainties ill-posed without normalization; including tie data via factored models reduces RMSE from 35.1×10^-2 (ties excluded) to 17.4×10^-2; Kendall's tau rank correlation ranges 0.96-1.00 across BT model variants, indicating robust rank ordering despite parameter uncertainty.

**[38] Rieder et al. 2026 (SimAB)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2026
- **Title:** SimAB: Simulating A/B Tests with Persona-Conditioned AI Agents
- **Venue:** arXiv preprint
- **arXiv ID:** 2603.01024
- **URL:** https://arxiv.org/abs/2603.01024
- **Key Finding:** AI-agent-simulated A/B testing validated against 47 historical A/B tests with known outcomes; overall accuracy 67%, rising to 83% for high-confidence cases; demonstrates that simulation can pre-screen LLM configuration variants before production deployment, reducing the cost of live traffic experiments when privacy or traffic constraints apply.

**[39] Wan et al. 2024 (TnT-LLM)**
- **Authors:** Mengting Wan, Tara Safavi, Sujay Kumar Jauhar, Yujin Kim, Scott Counts, Jennifer Neville, Siddharth Suri, Chirag Shah, Ryen W. White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, Nagu Rangan
- **Institution:** Microsoft Research
- **Year:** 2024
- **Title:** TnT-LLM: Text Mining at Scale with Large Language Models
- **Venue:** arXiv preprint
- **arXiv ID:** 2403.12173
- **URL:** https://arxiv.org/abs/2403.12173
- **Key Finding:** Proposes a two-stage pipeline (taxonomy generation + label assignment) for scalable text mining at production scale; achieves human-comparable accuracy in taxonomy generation and label assignment tasks; directly relevant to building query intent taxonomies for routing classifiers -- demonstrates that LLM-generated taxonomies and labels are viable alternatives to human annotation for large-scale classification of LLM conversation logs.

**[40] Frick et al. 2025 (Prompt-to-Leaderboard)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2025
- **Title:** Prompt-to-Leaderboard
- **Venue:** arXiv preprint
- **arXiv ID:** 2502.14855
- **URL:** https://arxiv.org/abs/2502.14855
- **Key Finding:** Trains an LLM to output a vector of Bradley-Terry coefficients conditioned on the input prompt, enabling prompt-specific leaderboards rather than aggregate rankings; the resulting router ranked #1 on Chatbot Arena in January 2025, outperforming fixed-aggregate rankings for actual query routing; demonstrates power-law scaling of performance vs. training data; establishes that query-adaptive routing outperforms both systems in isolation when routing to the predicted winner.

---

### A4: Query Classification and Feature Engineering (Papers 41-48)

**[41] Reicherts et al. 2024 (Intent Taxonomy)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2024
- **Title:** User Intent Recognition and Satisfaction with Large Language Models
- **Venue:** arXiv preprint
- **arXiv ID:** 2402.02136
- **URL:** https://arxiv.org/abs/2402.02136
- **Key Finding:** Develops a multilevel taxonomy with 24 fine-grained intent categories under 6 top-level types (Informational, Problem-Solving, Creative, Social, Technical & Professional, Transactional); GPT-4-Turbo achieves 89.64% accuracy and F1 = 88.84% on this taxonomy (vs. GPT-3.5-Turbo at 75.28% / F1 = 74.28%); GPT-4 improves most on factual queries (+20.5 pp) and explanatory inquiries (+23.17 pp); confirms Shah et al. 5-category schema as a viable coarser-grained baseline.

**[42] Hu et al. 2024 (RouterBench)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2024 (ICML 2024)
- **Title:** RouterBench: A Benchmark for Multi-LLM Routing System
- **Venue:** ICML 2024
- **arXiv ID:** 2403.12031
- **URL:** https://arxiv.org/abs/2403.12031
- **Key Finding:** Benchmark of 405,467 inference outcomes across 11 LLMs × 8 benchmarks + 64 total tasks; simple KNN/MLP routers achieve GPT-4 performance at 10-13× lower cost; cascade routers significantly outperform at error rate <= 0.1 (approaching oracle at epsilon = 0); time-sensitive feature detection (keywords "2024", "latest", "current") enables direct routing to retrieval-augmented models; establishes RouterBench as the standard multi-LLM routing evaluation framework.

**[43] Agrawal and Gupta 2025 (LLMRank)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2025
- **Title:** LLMRank: Understanding LLM Strengths for Model Routing
- **Venue:** arXiv preprint
- **arXiv ID:** 2510.01234
- **URL:** https://arxiv.org/abs/2510.01234
- **Key Finding:** Most detailed published feature engineering pipeline for routing; extracts features across 7 dimensions (Difficulty, Task Type, Knowledge Requirements, Output Format, Scenario Complexity, Routing Hints, Quality Indicators + Proxy Model Signals); trained on RouterBench (36,497 prompts, 11 benchmarks, 11 models); LLMRank-Perf achieves 89.2% of oracle utility at 3.3× cheaper than GPT-4-only; LLMRank-Balanced achieves 84.0% oracle utility at 12.2× cheaper than GPT-4.

**[44] IRT-Router 2025**
- **Year:** 2025
- **Title:** IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory
- **Venue:** ACL 2025
- **arXiv ID:** 2506.01048
- **URL:** https://arxiv.org/abs/2506.01048
- **ACL URL:** https://aclanthology.org/2025.acl-long.761/
- **Key Finding:** Applies Multidimensional Item Response Theory (MIRT) from psychometrics -- each LLM has a latent ability vector; each query has discrimination and difficulty parameters; achieves 3% higher accuracy than GPT-4o alone at 1/30 of the cost across 20 LLMs and 12 datasets; strong OOD generalization; online warm-up via semantic similarity to training queries addresses the cold-start problem for new query types.

**[45] Vashurin et al. 2026 (RouterXBench / ProbeDirichlet)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2026
- **Title:** Towards Fair and Comprehensive Evaluation of Routers
- **Venue:** arXiv preprint
- **arXiv ID:** 2602.11877
- **URL:** https://arxiv.org/abs/2602.11877
- **Key Finding:** ProbeDirichlet uses the small LLM's internal cross-layer hidden states to predict whether it can answer correctly -- available before any response is generated; aggregates cross-layer representations via learnable Dirichlet distributions for probabilistic training with deterministic inference; achieves +16.68% relative improvement in router AUROC and +18.86% relative improvement in high-accuracy scenario metric over best text-based baseline; consistent across model families, scales, and agentic workflows.

**[46] Zhang et al. 2024 (TREACLE)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2024
- **Title:** TREACLE: Thrifty Reasoning via Context-Aware LLM and Prompt Selection
- **Venue:** arXiv preprint
- **arXiv ID:** 2404.13082
- **URL:** https://arxiv.org/abs/2404.13082
- **Key Finding:** RL policy for joint model + prompt selection with budget constraints; achieves up to 85% cost savings on GSM8K, CommonsenseQA, and LLC benchmarks; demonstrates that joint optimization of model selection and prompting strategy outperforms optimizing each independently, establishing an important design principle for routing systems that must account for both model capabilities and prompt format.

**[47] Shen et al. 2025 (DAST)**
- **Authors:** Yucheng Shen, Jingling Zhang, Junxi Huang, Shuai Shi, Weitai Zhang, Jun Yan, Nianbin Wang, Kun Wang, Shuya Lian
- **Year:** 2025
- **Title:** DAST: Difficulty-Adaptive Slow Thinking for Large Reasoning Models
- **Venue:** arXiv preprint
- **arXiv ID:** 2503.04472
- **DOI:** 10.48550/arXiv.2503.04472
- **URL:** https://arxiv.org/abs/2503.04472
- **Key Finding:** Introduces Token Length Budget (TLB) as a difficulty quantification metric; budget-aware reward shaping trains models to self-regulate reasoning depth; achieves > 30% token usage reduction on average across datasets while preserving reasoning accuracy on complex problems; TLB provides an extrapolatable difficulty signal from similar historical queries, enabling pre-generation routing estimates without requiring a separate classifier.

**[48] Qu et al. 2025 (MoPPS)**
- **Authors:** Yicheng Qu, Qing Chen Corbin Mao, Yifan Hu, Bjorn Ommer, Xiaojuan Ji
- **Year:** 2025
- **Title:** Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
- **Venue:** arXiv preprint
- **arXiv ID:** 2507.04632
- **DOI:** 10.48550/arXiv.2507.04632
- **URL:** https://arxiv.org/abs/2507.04632
- **Key Finding:** Bayesian framework for estimating prompt difficulty online via streaming Bayesian inference, with no LLM calls required; models each prompt's success rate as a latent variable with posterior sampling in a multi-armed bandit; applicable to inference-time routing -- estimating difficulty from observed success rates for queries of the same type and using posterior estimates for real-time routing; addresses the online routing problem under distribution shift that static classifiers cannot handle.

---

### A5: Statistical Methods and Foundational References (Papers 49-52)

**[49] Hamilton et al. 2023 (Bradley-Terry)**
- **Authors:** Mark Hamilton, Scott Tawn, Alan Firth
- **Year:** 2023
- **Title:** The Many Routes to the Ubiquitous Bradley-Terry Model
- **Venue:** arXiv preprint
- **arXiv ID:** 2312.13619
- **URL:** https://arxiv.org/abs/2312.13619
- **Key Finding:** Comprehensive derivation of the Bradley-Terry model from multiple perspectives (maximum likelihood, Bayesian, axiomatic); establishes the identifiability conditions (normalization required), Fisher Information Matrix singularity, and appropriate confidence interval approaches (bootstrap vs. sandwich standard errors); provides the statistical foundation for the pairwise preference evaluation framework used in this study's Phase 3 experimental design.

**[50] Petrova et al. 2026 (HUMAINE)**
- **Authors:** (First authors as listed on Semantic Scholar)
- **Year:** 2026
- **Title:** Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
- **Venue:** arXiv preprint (Semantic Scholar, submitted February 3, 2026)
- **URL:** https://www.semanticscholar.org/paper/bcc44ffd86cd85f469c9bbccb9cb6b0336da93ba
- **Key Finding:** 23,404 participants across 22 demographic groups; hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification; finds user age is the primary demographic axis of preference disagreement, with model rankings shifting substantially across age groups; tie rate for "Trust, Ethics & Safety" dimension is 65% (very high ambiguity), while "Overall Winner" tie rate is 10% (high discriminability) -- evaluating multi-agent systems on trustworthiness dimensions will yield highly ambiguous results.

**[51] Frauen et al. 2026 (Nonparametric Preference)**
- **Authors:** (First authors as listed on arXiv)
- **Year:** 2026
- **Title:** Nonparametric LLM Evaluation from Preference Data
- **Venue:** arXiv preprint
- **arXiv ID:** 2601.21816
- **URL:** https://arxiv.org/abs/2601.21816
- **Key Finding:** Proposes DMLEval with Generalized Average Ranking Scores (GARS) that subsume both Bradley-Terry and PageRank under debiased machine learning; provides uncertainty quantification without parametric assumptions and suggests optimal data collection policies under budget constraints; directly applicable to the study's A/B testing methodology for comparing routing configurations under limited comparison budget.

**[52] Lakens 2013 (Effect Sizes)**
- **Author:** Daniel Lakens
- **Institution:** Eindhoven University of Technology
- **Year:** 2013
- **Title:** Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-tests and ANOVAs
- **Venue:** Frontiers in Psychology, 4, Article 863
- **URL:** https://pmc.ncbi.nlm.nih.gov/articles/PMC3840331/
- **Key Finding:** Establishes Cohen's d_z = mu_D / sigma_D as the appropriate effect size for paired/within-subjects designs (d_z = d_s / sqrt(2(1-r)), where r is the within-person correlation); required sample sizes for paired preference designs: d_z = 0.20 -> N = 199 pairs; d_z = 0.50 -> N = 34 pairs; d_z = 0.80 -> N = 15 pairs (all at 80% power, alpha = 0.05); provides the statistical framework for the study's power analysis and sample size justification.

---

## Appendix B: Key Quantitative Anchors

A consolidated reference table of the most important numbers from the literature, organized by research question.

---

### B1: MAD Task-Type Performance Delta (When Does Debate Help?)

| Paper | Task/Benchmark | Effect | Delta | Condition |
|---|---|---|---|---|
| Du et al. 2023 [1] | Arithmetic | HELPS | +14.8 pp (67.0 -> 81.8%) | GPT-3.5, 3 agents, 2 rounds |
| Du et al. 2023 [1] | GSM8K | HELPS | +8.0 pp (77.0 -> 85.0%) | GPT-3.5, 3 agents, 2 rounds |
| Du et al. 2023 [1] | MMLU | HELPS | +7.2 pp (63.9 -> 71.1%) | GPT-3.5, 3 agents, 2 rounds |
| Liang et al. 2023 [2] | Counter-Intuitive AR | HELPS | +11.0 pp (26.0 -> 37.0%) | GPT-3.5-Turbo |
| Liang et al. 2023 [2] | Common MT (HUMAN) | HELPS | +0.64/5.0 (3.14 -> 3.78) | GPT-3.5-Turbo |
| Smit et al. 2023 [3] | MMLU clinical | NO BENEFIT | ~0 pp | GPT-3.5/4 family |
| Smit et al. 2023 [3] | Multi-Persona 90% agree | HELPS | +15 pp | USMLE subset (N=376) |
| Wynn et al. 2025 [4] | CommonSenseQA | ALWAYS HURTS | -0.8 to -8.0 pp | All 10 agent configurations |
| Wynn et al. 2025 [4] | MMLU | USUALLY HURTS | -1.6 to -12.0 pp | 7/10 configurations hurt |
| Wynn et al. 2025 [4] | GSM8K (3x Mistral) | HURTS | -6.8 pp | Weak homogeneous agents |
| Zhang et al. 2025 [5] | MATH (Hetero-MAD) | STRONGLY HELPS | +29.3% over CoT avg | GPT-4o-mini + Llama3.1-70B |
| Zhang et al. 2025 [5] | MMLU-Pro (Hetero-MAD) | HELPS | +11.4% over CoT avg | GPT-4o-mini + Llama3.1-70B |
| Zhang et al. 2025 [5] | MBPP (Hetero-MAD) | NO BENEFIT | -2.0% over CoT avg | Procedural code fails |
| Wu et al. 2025 [6] | KKS Logic (size 4) | STRONGLY HELPS | +52.0 pp instance acc. | N=300 puzzles, R²=0.393 |
| Wu et al. 2025 [6] | KKS Logic (size 8) | HELPS | +32.7 pp instance acc. | Harder puzzles, smaller gain |
| Kim et al. 2026 [7] | Finance-Agent (Centralized) | STRONGLY HELPS | +80.9% over SAS | Anthropic Claude, avg +127.5% |
| Kim et al. 2026 [7] | PlanCraft (Independent) | SEVERELY HURTS | -70.0% below SAS | Sequential planning task |
| Kim et al. 2026 [7] | PlanCraft (best MAS) | HURTS | -39.0% below SAS | Every MAS variant hurts |
| Kim et al. 2026 [7] | Overall MAS (180 configs) | NO OVERALL EFFECT | -3.5% (CI: -18.6%, +25.7%) | sigma = 45.2% variance |
| Yang et al. 2025 [8] | MATH500 (SC vs. MAD, 32B) | SC BETTER | SC 84.0 vs. best MAD 83.6 | N=16 generations budget |
| Yang et al. 2025 [8] | AIME (MAD 4x4, 3B) | MAD BETTER | SC 8.9 vs. MAD 11.1 | Weak model + hard task |
| Yang et al. 2025 [8] | Safety (heterogeneous MAD) | MAD BETTER | ASR escalation suppressed | vs. Self-Refinement (SR) |
| Zeng et al. 2026 [9] | MATH (GPT-4o-mini) | HELPS | +3.8 pp (77.6 -> 81.4%) | LLM Debate vs. CoT |
| Zeng et al. 2026 [9] | Standard MMLU | NO BENEFIT | -0.8 pp (82.4 -> 81.6%) | LLM Debate vs. CoT |
| Zeng et al. 2026 [9] | VisualPuzzles (multimodal) | STRONGLY HELPS | +14.0 pp | InternVL3-14B |

---

### B2: MAD Failure Mechanisms -- Key Quantitative Evidence

| Paper | Mechanism | Metric | Value |
|---|---|---|---|
| Wynn et al. 2025 [4] | Sycophancy -- NAR/SS correlation | Pearson r | 0.902 (p < 0.001) |
| Cui et al. 2025 [10] | Sycophancy -- NAR/SS correlation | Pearson r | 0.902 (p < 0.001) |
| Cui et al. 2025 [10] | Decentralized homogeneous (CSQA) | Debate Confusion Rate | 86.36% |
| Cui et al. 2025 [10] | Centralized vs. decentralized (Qwen) | DCR reduction | 81.71% -> 41.27% |
| Kim et al. 2026 [7] | Error amplification (Independent MAS) | Amplification factor | 17.2× vs. SAS 1.0× |
| Kim et al. 2026 [7] | Error amplification (Centralized MAS) | Amplification factor | 4.4× vs. SAS 1.0× |
| Kim et al. 2026 [7] | Overhead % (Decentralized) | Communication overhead | 263% vs. SAS 0% |
| Kim et al. 2026 [7] | Overhead % (Hybrid) | Communication overhead | 515% vs. SAS 0% |
| Wu et al. 2025 [6] | Initial accuracy dominance | beta coefficient | 0.600 (p < 0.001, R²=0.393) |
| Wu et al. 2025 [6] | Majority wrong -- correction rate | Range | 3.6%-34.4% (model-dependent) |

---

### B3: Routing and Cost Efficiency -- Key Quantitative Evidence

| Paper | System | Metric | Value |
|---|---|---|---|
| Gao et al. 2025 [19] | MAS/SAS token ratio (AIME) | Prefill overhead | 220.22× |
| Gao et al. 2025 [19] | MAS/SAS token ratio (GSM8K) | Prefill overhead | 34.66× |
| Gao et al. 2025 [19] | MAS/SAS token ratio (HumanEval) | Prefill overhead | 5.14× |
| Gao et al. 2025 [19] | Cascade vs. MAS (GSM8K) | Token ratio | ~12% of MAS tokens |
| Gao et al. 2025 [19] | Cascade accuracy gain | Over best individual system | +1.1-12% accuracy |
| Chen et al. 2023 [18] | FrugalGPT cascade | Cost reduction vs. GPT-4 | 98% at matched performance |
| Chen et al. 2023 [18] | FrugalGPT cascade | Accuracy gain same cost | +4% over GPT-4 |
| Ong et al. 2024 [16] | RouteLLM (MF + judge data) | GPT-4 call reduction | 73% at 95% GPT-4 quality |
| Ong et al. 2024 [16] | RouteLLM (MF) | Cost multiplier | 3.66× cheaper than GPT-4-only |
| Ong et al. 2024 [16] | RouteLLM (cross-model transfer) | APGR improvement | +56.6% (Claude), +49.8% (Llama) |
| Ding et al. 2024 [17] | HybridLLM (DeBERTa) | Router latency | 36 ± 2 ms |
| Ding et al. 2024 [17] | HybridLLM (DeBERTa) | Cost advantage | 40% at < 0.2% quality drop |
| Su et al. 2025 [15] | DAAO | Average accuracy (6 benchmarks) | 83.26% (best published) |
| Su et al. 2025 [15] | DAAO | Total training + inference cost | $2.61 vs. $24.16 (AFlow) |
| Su et al. 2025 [15] | DAAO | GAIA improvement over MaAS | +8.33% |
| Liu et al. 2026 [23] | CASTER | Inference cost reduction | 72.4% at matched success rate |
| Lugoloobi et al. 2026 [22] | Linear probe routing | Inference cost reduction | 70% on MATH |
| Wang et al. 2026 [26] | AgentConductor | Accuracy gain | +14.6% pass@1 |
| Wang et al. 2026 [26] | AgentConductor | Token cost reduction | -68% vs. fixed topology |
| Wang and Tong 2025 [27] | DAGP | Token usage reduction | -45% average |
| Agrawal and Gupta 2025 [43] | LLMRank-Balanced | Cost vs. GPT-4-only | 12.2× cheaper at 84% oracle |

---

### B4: User Study Methodology -- Key Quantitative Anchors

| Metric | Value | Source |
|---|---|---|
| Chatbot Arena total votes (Jan 2024) | 243,329 | Chiang et al. 2024 [29] |
| Average votes per model (Arena) | ~8,000 | Chiang et al. 2024 [29] |
| Adaptive sampling for win-matrix precision <= 0.2 | 4,400 votes | Chiang et al. 2024 [29] |
| Random sampling for win-matrix precision <= 0.2 | 6,800 votes | Chiang et al. 2024 [29] |
| Expert-expert agreement (Arena) | 79%-90% | Chiang et al. 2024 [29] |
| Crowd-expert agreement (Arena) | 73%-83% | Chiang et al. 2024 [29] |
| Win rate range across topic clusters (same model pair) | 53%-97% | Chiang et al. 2024 [29] |
| WildBench -- Arena Elo Pearson correlation | 0.984 | Lin et al. 2024 [35] |
| Next-response length classifier accuracy | 0.761 | Pang et al. 2024 [30] |
| Length signal win-rate improvement | +12.0% (p < 0.05) | Pang et al. 2024 [30] |
| MCL improvement (behavioral reranking) | +70% | Irvine et al. 2023 [31] |
| User retention improvement (MCL proxy) | +30% | Irvine et al. 2023 [31] |
| A/B test group size for behavioral metrics | 10,000/arm | Irvine et al. 2023 [31] |
| SPUR weighted F1 on Bing Copilot | 75.4 | Lin et al. 2024 [32] |
| SPUR rubric -- satisfaction items correlated with thumb feedback | 20/20 (all p < 0.05) | Lin et al. 2024 [32] |
| SUS advantage (OFA vs. Agent Select) | 86 vs. 56 (+30 pts) | Clarke et al. 2024 [14] |
| Task accuracy advantage (OFA vs. Agent Select) | 71% vs. 57% (+14 pp) | Clarke et al. 2024 [14] |
| "Trust, Ethics & Safety" tie rate (HUMAINE) | 65% | Petrova et al. 2026 [50] |
| "Overall Winner" tie rate (HUMAINE) | 10% | Petrova et al. 2026 [50] |
| Sample size (d_z = 0.2, within-subjects, 80% power) | 199 pairs | Lakens 2013 [52] |
| Sample size (d_z = 0.5, within-subjects, 80% power) | 34 pairs | Lakens 2013 [52] |
| Sample size (d = 0.2, between-subjects, 80% power) | 392/arm (784 total) | Cohen/Lakens framework |
| Holm correction alpha for k = 10 comparisons (rank 1) | 0.005 | Statistical standard |
| GPT-4-Turbo intent classification accuracy (24-class) | 89.64%, F1 = 88.84 | Reicherts et al. 2024 [41] |

---

### B5: Classifier Architecture Benchmarks

| Paper | Architecture | Latency | Key Result |
|---|---|---|---|
| Ding et al. 2024 [17] | DeBERTa-v3-large (HybridLLM) | 36 ± 2 ms | 40% cost advantage at < 0.2% quality drop |
| Ong et al. 2024 [16] | Matrix Factorization (RouteLLM) | 6.5 ms (155 req/s) | 73% GPT-4 call reduction, +60.4% APGR |
| Ong et al. 2024 [16] | BERT_BASE classifier (RouteLLM) | 14 ms (70 req/s) | +50.2% APGR with augmented data |
| Agrawal and Gupta 2025 [43] | LLMRank (neural ranker) | ~50 ms | 89.2% oracle utility, 3.3× cheaper |
| IRT-Router 2025 [44] | MIRT + semantic init | -- | +3% vs. GPT-4o at 1/30 cost |
| Vashurin et al. 2026 [45] | ProbeDirichlet (hidden states) | -- | +16.68% AUROC vs. best text-based |
| Zhu et al. 2025 [21] | Hidden-state value function | 1 forward pass | Zero-token difficulty estimation |
| Lugoloobi et al. 2026 [22] | Linear probe (activations) | 1 forward pass | 70% cost reduction on MATH |

---

## Appendix C: Experimental Design Parameters

Detailed parameters for the three-phase experimental design proposed in this study.

---

### C1: Phase 1 -- Task-Type Evidence Collection

**Objective:** Determine which of the 5 WildBench task strata show consistent MAD benefit vs. harm vs. neutral effect.

**Benchmark Suite:**
- *Tasks expected to benefit:* MATH500, AIME 2024/2025, KKS Logic (size 4-6), HumanEval (heterogeneous agents), Finance-domain multi-step reasoning
- *Tasks expected to harm:* CommonsenseQA, MMLU (homogeneous configurations), PlanCraft-style sequential planning
- *Tasks to test (ambiguous prior):* Medical QA (MedQA), GPQA, coding (MBPP), open-ended text evaluation

**MAD Configurations Tested:**
- Baseline: Single agent with CoT (GPT-4o-mini)
- MAD-Homo: 3× same model (GPT-4o-mini × 3)
- MAD-Hetero: GPT-4o-mini + Llama-3.1-70B + Claude-3.5-Haiku
- MAD-Weak: 3× Llama-3.1-8B
- Self-Consistency (SC): 16-sample majority vote (compute-matched to MAD)

**Key Regression Variables (following Wu et al. 2025 [6] and Kim et al. 2026 [7] methodology):**
- Initial agent accuracy (primary predictor, expected beta approx. 0.60)
- Initial inter-agent disagreement (prerequisite for debate benefit)
- Task domain complexity D (Kim et al. sequential dependency metric)
- Model heterogeneity (binary: homo vs. hetero configuration)
- Tool density T (for agentic tasks only)

**Statistical Standards:**
- N >= 300 instances per task per configuration
- 3 random seeds for stochastic tasks; mean +/- SE reported
- Effect thresholds: >= 3 pp improvement = "benefit"; <= -3 pp = "harm"; [-3, +3] pp = "neutral" (following Wynn et al. 2025 [4] precision at N = 100 × 5 seeds)
- Holm-Bonferroni correction for k = 5 configurations × 10 task strata = 50 total comparisons within Phase 1

**Compute Budget Reference (from Gao et al. 2025 [19]):**
- MAD-Hetero generates approximately 5-35× the tokens of SAS depending on task type
- Budget allocation: prioritize hard math and sequential planning tasks where the routing decision has the highest expected value
- Cost break-even: only escalate to MAD if expected accuracy gain exceeds the token cost ratio threshold (rule of thumb: > 5% expected gain if MAS/SAS token ratio > 10×)

---

### C2: Phase 2 -- Routing Classifier Development

**Objective:** Train a lightweight binary classifier predicting P(MAD adds value | query features) and evaluate routing performance on held-out query distributions.

**Training Data Collection Protocol (following RouteLLM [16] methodology):**

1. *Query pool:* 50,000 queries sampled from WildChat [36] distribution, stratified by WildBench 12-category taxonomy [35]
2. *Ground truth labeling:* For each query, compare SAS (GPT-4o-mini, CoT) vs. MAD-Hetero response quality
   - Factual tasks: LLM-judge evaluation (GPT-4 as judge, 0-100 score); label = 1 if MAD_score - SAS_score > 10 points
   - Verifiable tasks (math/code): exact match / pass@1; label = 1 if MAD passes and SAS fails
   - Estimated labeling cost: ~$700-1,400 for 50K labels (based on RouteLLM [16] benchmark of ~$700/120K labels at 2024 GPT-4 pricing)
3. *Active learning:* Entropy-based uncertainty sampling (following FutureSearch 2026 methodology); select 20 most uncertain queries per iteration for LLM oracle labeling; expect 30-70% label reduction vs. random sampling

**Classifier Architectures (following HybridLLM [17] and RouteLLM [16]):**

| Tier | Architecture | Training | Latency Target |
|---|---|---|---|
| Primary | DeBERTa-v3-large | 10K queries, 5 epochs, 1× A100 | <= 40 ms |
| Lightweight | BERT_BASE + CLS sigmoid | 10K queries, ~2,000 steps | <= 15 ms |
| Zero-shot baseline | SW Ranking (cosine similarity to training queries) | No training; inference only | <= 100 ms |
| Feature-based | LLMRank-style feature extraction + MLP | Extracted features on 10K queries | <= 50 ms |

**Feature Engineering (7 dimensions, following LLMRank [43]):**

*Tier 1 -- zero-cost (< 1 ms, regex/rule-based):*
- Query length (tokens)
- Contains code block (``` or inline code)
- Time-sensitive keywords ("latest", "current", specific year)
- Multi-part structure (numbered list, "first... second...")
- Conditional phrasing ("If X, then Y") -- sequential dependency signal
- Contains math operators or LaTeX
- Anaphora / pronouns referencing prior context

*Tier 2 -- fast embedding (5-10 ms, sentence encoder):*
- Domain classification (WildBench 5-group taxonomy via zero-shot DeBERTa-zeroshot)
- Semantic similarity to training queries with known routing decisions (KNN signal)
- Factual vs. generative intent (zero-shot NLI entailment)

*Tier 3 -- learned encoder (DeBERTa, approx. 36 ms):*
- Difficulty score (DeBERTa-v3-large, following HybridLLM architecture)
- Reasoning steps estimate (0 = lookup, 1 = single inference, 2+ = multi-hop)
- Ambiguity score (KL-divergence between query language model and corpus model)

**Routing Evaluation Metrics (following RouterBench [42] framework):**
- AUROC (threshold-independent discriminative power)
- CPT(50%): minimum MAD calls needed to achieve 50% of the performance gap between always-SAS and always-MAD
- APGR: Area under the Performance-GPT4-fraction curve (following RouteLLM [16])
- Cost reduction at matched accuracy (primary production metric)

**Cross-domain Generalization Test:**
- Train on WildChat distribution; evaluate on LMSYS-Chat-1M [34] distribution
- Measure AUROC degradation; following RouteLLM [16] finding that benchmark-dataset similarity score is the strongest predictor of router generalization

**Routing Decision Thresholds:**
- Production setting: threshold tuned to maximize APGR at <= 5% accuracy degradation vs. always-MAD
- Conservative setting: threshold tuned to 95% precision on "MAD benefits" label (minimize wasted MAD calls)
- Recall setting: threshold tuned to 95% recall on "MAD benefits" label (minimize missed benefit opportunities)
- Default recommendation: production setting, with sensitivity analysis across three threshold regimes

---

### C3: Phase 3 -- User Preference Study

**Objective:** Measure human preference between routed (adaptive SAS/MAD) vs. always-SAS vs. always-MAD configurations under realistic query conditions.

**Study Design: Between-Subjects A/B Test with Task Stratification**

Following recommendations from Chiang et al. 2024 [29], Irvine et al. 2023 [31], and Clarke et al. 2024 [14]:

- *Primary design:* Between-subjects (different users assigned to each condition)
- *Conditions:* (A) Always-SAS, (B) Always-MAD-Hetero, (C) Routed (classifier-selected SAS or MAD)
- *Rationale for between-subjects:* Avoids carryover effects from learning the style of one configuration; matches natural production deployment where users organically receive one configuration

**Sample Size Justification:**

Using Cohen's d framework (Lakens 2013 [52]) for a between-subjects design detecting a small effect (d = 0.2) at 80% power, alpha = 0.05 (two-tailed):
- Required per arm: N = 392 (total N = 1,176 across 3 arms)
- With Holm-Bonferroni correction for k = 3 pairwise comparisons: alpha_corrected = 0.017 for first comparison -> required per arm: N = 520 (total N = 1,560)
- With 4 task strata x 3 pairwise comparisons = 12 total tests: Holm correction applied within strata

**Task Pool Construction:**

Following WildBench [35] selection methodology:
1. Sample 1,500 queries from WildChat [36] (500 per arm)
2. Filter: length 10-3,000 tokens; <= 5 turns; English only; non-toxic
3. Stratify by 5-group WildBench taxonomy (target proportions: Info Seeking 20%, Math & Data 15%, Reasoning & Planning 20%, Creative Tasks 25%, Coding & Debugging 20%)
4. Difficulty filter: exclude queries rated <= 2/5 on 5-point difficulty scale by two or more of three LLM judges (GPT-4-Turbo, Claude-3.5-Sonnet, Gemini-1.5-Pro)
5. Final: 1,200 study queries (400 per arm, 80 per stratum per arm)

**Dependent Variables:**

*Layer 1 -- Primary preference signal (explicit):*
- Post-response pairwise preference: "Which response better addressed your need?" (5-point: A clearly better / A slightly better / About equal / B slightly better / B clearly better)
- Post-session overall satisfaction: "How satisfied were you with this interaction overall?" (CSAT 1-5)

*Layer 2 -- Behavioral signals (implicit, following Pang et al. 2024 [30]):*
- Session continuation rate (did user send a follow-up message?)
- Follow-up response length >= 20 words (highest-validity proxy, acc = 0.761)
- Regeneration rate (user explicitly requested a new response)
- Mean Conversation Length (MCL, following Irvine et al. 2023 [31])

*Layer 3 -- Trust and decision confidence (following LLM Rationale Trust Study, N=68):*
- "How confident are you in acting on this response?" (7-point Likert, Vereschak et al. 2021 scale)
- "How much do you trust this response?" (5-point Likert)
- "Overall, how much do you trust this system?" (5-point Likert, post-session)

**Response Presentation Protocol:**
Following Clarke et al. 2024 [14], the orchestration mechanism (SAS vs. MAD) is NOT disclosed to participants. All responses are presented in a uniform interface. This matches the OFA (One For All) paradigm that achieved SUS = 86 vs. SUS = 56 for explicit agent-selection interfaces. Participants interact with the system as a general assistant without knowledge of the underlying routing.

**Randomization and Blinding:**
- Users randomly assigned to condition A/B/C via block randomization (blocks of 6 to ensure equal allocation)
- Experimenters analyzing primary outcome data are blinded to condition assignment until after statistical analysis

**Statistical Analysis Plan:**
1. *Primary outcome:* CSAT (1-5) -- one-way ANOVA with Holm-Bonferroni post-hoc tests for k = 3 pairwise comparisons
2. *Secondary outcomes:* Session continuation rate, follow-up length (binary), trust ratings -- same ANOVA framework; Holm correction applied within each dependent variable
3. *Stratified analysis:* Repeat primary analysis within each of the 5 task strata; apply Benjamini-Hochberg FDR correction across strata for exploratory stratum-by-condition interaction
4. *Bradley-Terry model:* Fit BT model to pairwise preference data (5-point scale collapsed to ordered choices); report BT strength scores with sandwich confidence intervals per Chiang et al. 2024 [29] and Ameli et al. 2024 [37]
5. *Effect size reporting:* Report Cohen's d_s for between-subjects comparisons; report both uncorrected and Holm-corrected p-values

**Pre-registration:** Study design, hypotheses, primary analysis plan, and sample size calculation to be pre-registered on OSF before data collection begins. Primary hypothesis: the Routed condition (C) achieves non-inferior CSAT to Always-MAD (B) at lower per-query compute cost; secondary hypothesis that Routed achieves superior CSAT to Always-SAS (A) on Reasoning & Planning and Math & Data strata, with no significant difference on Creative Tasks and Info Seeking strata.

**Manipulation Check:**
Post-study survey (subset, N >= 60): "Did you notice any difference between responses you received at different times?" -- to verify blinding integrity and detect carryover concerns in the production deployment context.
