IRanker: Towards Ranking Foundation Model

Abstract

Ranking tasks are ubiquitous, encompassing applications such as recommendation systems, LLM routing, and item re-ranking. We propose to unify these tasks using a single ranking foundation model (FM), as it eliminates the need for designing different models for each specific ranking task. However, unlike general supervision tasks in LLMs, ranking tasks do not have clear labels for supervision, posing great challenges to developing a ranking FM. To overcome these challenges, we propose IRanker, a ranking FM framework with reinforcement learning (RL) and iterative decoding. Our insight is to decompose the complex ranking task into an iterative decoding process that eliminates the worst candidate from the candidate pool step by step, which significantly reduces the output combinatorial space and better utilizes the limited context length during RL training. We meticulously train and comprehensively evaluate an IRanker-3B model on nine datasets across three scenarios: recommendation, routing, and passage ranking. The results show that a single IRanker-3B achieves state-of-the-art results on several datasets compared to models of similar size, and even surpasses the performance of larger models on certain datasets. We further demonstrate the effectiveness of our RL design and the robustness of the iterative mechanism across different LLM sizes. Moreover, we conducted both in-domain and out-of-domain zero-shot generalization experiments, which showed that IRanker-3B achieved good generalization on in-domain ranking tasks compared to the base LLM by at least 5% improvement. Surprisingly, on out-of-domain generic LLM tasks, IRanker-3B outperformed the base model by at least 9% on GSM8K, IFEval, and MathQA. In addition, the thoughts generated by IRanker-3B during training could further enhance zero-shot LLM performance.

1. Introduction

Example ranking tasks that a proposed ranking FM can solve.

(a) The recommendation task aims to model the user’s preferences based on their historical behaviors. It will rank the current item candidates and predict which items the user is most likely to prefer.

(b) The routing task is to recommend suitable LLMs to respond to different user queries. The recommendation process takes into account the effectiveness and cost of each LLM’s response, and performs ranking to generate the final recommendation list.

(c) Passage ranking involves retrieving a set of passages from candidate passages based on a given user query for retrieval-augmented generation. It ranks the passages by modeling the relevance between the query and the passages to produce the final list of passages.

2. Framework of our proposed ranking FMs.

Both DRanker and IRanker are RL-enhanced LLM frameworks. They take as input the candidate information in text form, along with user information (such as user history or a query), and utilize the LLM’s reasoning capabilities to produce a final candidate ranking. This ranking is then evaluated by an evaluator to generate a corresponding reward function, which is used to optimize the decision-making of both rankers. The key distinctions are: 1) DRanker generates the full ranking in a single step, whereas IRanker iteratively excludes the least likely item from the candidates to arrive at the final ranking. 2) The reward in DRanker is a ranking reward based on the final candidate ranking, while the reward in IRanker is an exclusion reward given for each individual exclusion decision. 3) DRanker always receives the full set of candidates as input with a fixed size, whereas IRanker’s input candidates are dynamically updated based on the excluded items.

4. Experiments

4.1 Comparison with general baselines

Model performance comparison with general baselines across nine ranking tasks of three scenarios on MRR. Bold and underline denote the best and second-best results. We can observe the following: 1) Compared to the baselines, IRanker-3B achieves state-of-the-art performance in almost all tasks. 2) The comparison between methods with and without RL validates the enhancement effect of RL on ranking tasks. 3) The comparison between iterative-based ranking and direct ranking demonstrates the suitability of the iterative design for models of different sizes.

4.2 Comparison with task-specific baselines

IRanker-3B matches the performance of domain-specific methods across multiple tasks with fewer training samples and smaller model size. We compared the performance of IRanker-3B against three representative SOTA methods and Qwen2.5-3B-Instruct-iter across three scenarios. SOTA-1, SOTA-2, and SOTA-3 correspond to SASRec, BPR, and R1-Rec in the recommendation (Rec) scenario; GraphRouter, RouterBert, and RouterKNN in the routing (Router) scenario; RankLLama-8B, RankBERT, and MonoT5 in the passage ranking (Passage) scenario.

4.3 Zero-shot performance

Zero-shot performance comparison across different ranking tasks on MRR. Bold and underline denote the best and second-best results. The results for each ranking scenario were obtained by training on the data from the other two ranking scenarios and then performing zero-shot testing on the target scenario.

4.4 Effects of thoughts emerged by IRanker

Thoughts emerged by IRanker during training can enhance zero-shot performance of the base model. IRanker-COT-3B is an iterative framework that, for each test query, retrieves similar queries and their corresponding thoughts that emerged during the training of IRanker, using them as thought templates to guide zero-shot responses. We evaluate IRanker-COT-3B on nine tasks and compare its performance with IRanker-3B and Qwen2.5-3B-Instruct-iter. The results show that IRanker-COT-3B consistently outperforms Qwen2.5-3B-Instruct-iter and even surpasses IRanker-3B in the Rec-Game task.

4.5 Generalization on out-of-domain generic LLM tasks

IRanker outperformed the base model on three out-of-domain generic LLM tasks. Bolded values indicate higher performance. This table compares the performance of IRanker-3B and Qwen2.5-3B-Instruct across eight widely-used benchmarks. IRanker-3B leads in five out of eight tasks, especially on math and reasoning-heavy datasets like GSM8K, IFEval, and MathQA. Qwen2.5-3B-Instruct performs better on code generation tasks, including MBPP and HumanEval. The models are nearly tied on general QA tasks like OpenBookQA and HellaSwag. These results highlight IRanker-3B’s strength in structured reasoning, while Qwen2.5-3B-Instruct maintains a slight edge in coding ability.

BibTeX

@misc{feng2025iranker,
  title        = {IRanker: Towards Ranking Foundation Model},
  author       = {Tao Feng and Zhigang Hua and Zijie Lei and Yan Xie and Shuang Yang and Bo Long and Jiaxuan You},
  year         = {2025},
  howpublished = {https://github.com/ulab-uiuc/IRanker},
  note         = {Accessed: 2025-06-03}
}