Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning

Abstract

The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (i.e., assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present Router-R1, an reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To guide learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for performance and cost trade-off optimization, opening a pathway toward optimizing performance-cost tradeoffs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms over several strong baselines, achieving superior performance while maintaining robust generalization and cost management.

Router-R1

(a) Single-round Routing: A conventional router assigns each query to a single LLM in isolation via a one-shot decision, without internal reasoning or multi-model coordination. (b) Multi-round Routing (ours): Router-R1 casts multi-LLM routing as a sequential decision process, which leverages an LLM-based router to interleave internal reasoning with external LLM routing and integrates retrieved information into its evolving context. This enables adaptive multi-model coordination for complex tasks, surpassing single-round routing with better performance.

Router-R1 Model Architecture

Comparison Experiments

We evaluate Router-R1 on seven QA benchmarks, covering both general and multi-hop QA. As shown in Table below, several key findings emerge: (1) Router-R1 consistently outperforms all basic baselines, including Direct, CoT, SFT, and RAG, particularly on knowledge-intensive tasks. It also surpasses Search-R1, despite its multi-turn retrieval ability, highlighting the benefits of Router-R1's dynamic multi-round routing. Notably, Router-R1-Llama and Router-R1-Qwen achieve the highest average EM scores of 0.409 and 0.416, respectively. (2) Router-R1 beats all LLM router baselines, such as Prompt LLM, KNN Router, and their enhanced variants. Its core strength lies in using an LLM as the router, enabling interleaved reasoning and routing, which improves coordination across models and tasks. (3) Router-R1 generalizes well to unseen data, achieving strong performance on five out-of-domain datasets, despite being trained only on NQ and HotpotQA. This demonstrates its robust and transferable routing strategies.

Experimental results on seven QA datasets w.r.t. Exact Match.

Analysis of Cost Rewards

We conduct an extensive study on the effect of different cost coefficients α in Router-R1 and compare with various baselines in terms of EM and average raw cost rewards. The results shown in Table below reveal clear trade-offs: with α=0.0, Router-R1 achieves the highest EM across almost all datasets, demonstrating its performance-oriented routing strategy. As α increases, we observe a substantial reduction in cost, accompanied by a decrease in EM, which highlights the inherent trade-off between answer accuracy and computational efficiency. Notably, α=0.6 provides a favorable balance, consistently achieving strong EM while substantially lowering costs compared to α=0.0. Compared to baselines, Router-R1 with moderate α values outperforms others in both accuracy and efficiency, validating the flexibility and effectiveness of our cost-aware reward design.

Extensive analysis of cost rewards on NQ, PopQA, HpQA, and 2wiki datasets w.r.t. Exact Match and raw cost rewards (unnormalized)

Discussion

LLM API Call Count Analysis. To assess the adaptability of Router-R1 to tasks of varying difficulty, we analyze the average number of LLM API calls (i.e., the number of times candidate LLMs within the routing pool are invoked) by Route-R1-Qwen across seven QA benchmarks. As shown in Figure (a) below, Router-R1-Qwen makes significantly more average LLM API calls on multi-hop QA benchmarks (i.e., HotpotQA, 2WikiMultiHopQA, Musique, and Bamboogle) compared to general QA benchmarks (i.e., NQ, TriviaQA, and PopQA). This indicates that Router-R1 can adaptively assess task difficulty and decide whether external LLM routing is necessary, demonstrating its ability to selectively utilize external resources when tasks are more complex. Convergence Analysis of Router-R1 Training. To evaluate the convergence behavior of Router-R1, we show two crucial curves during its policy training: the reward and the policy LLM’s action entropy. Figures (b) and (c) below illustrate that Router-R1 converges within 100 training steps, as evidenced by rising rewards and decreasing policy entropy, indicating rapid and robust convergence. It's worth noting that occasional formatting errors may cause brief drops in reward, but our hierarchical reward design swiftly corrects them, ensuring stable and accelerated learning. In particular, we observe that without format rewards, Router-R1 exhibits greater training instability, frequently generating meaningless or nonsensical text that leads to severe formatting breakdowns in the output.

Analysis of LLM API call count and Router-R1 training convergence.

BibTeX

@article{Router-R1,
  author    = {Haozhen Zhang and Tao Feng and Jiaxuan You},
  title     = {Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning},
  journal   = {arXiv preprint arXiv:2506.09033},
  year      = {2025},
}