PersonalizedRouter

Abstract

The growing number of Large Language Models (LLMs) with diverse capabilities and response styles provides users with a wider range of choices, which presents challenges in selecting appropriate LLMs, as user preferences vary in terms of performance, cost, and response style. Current LLM selection methods typically optimize for a single fixed objective, such as performance, cost, or a trade-off between them, and fail to learn individual user preferences from interaction data. To address these limitations in supporting users, we propose PersonalizedRouter, a graph-based framework that models diverse user profiles and performs personalized LLM selection by leveraging interaction data that includes task context, queries, candidate LLMs, and user decisions. To capture contextual information between user queries and optimal LLMs, PersonalizedRouter converts the interaction data into a heterogeneous graph, where the relationships between different types of nodes are represented by edges. To further assess the adaptability for multiple users, we design two strategies to simulate different user interaction data: the multi-cost-efficiency simulation strategy and the LLM-as-a-Judge strategy. The experimental results from two simulation settings demonstrate that our PersonalizedRouter outperforms existing LLM selection methods and surpasses the strongest methods by a large margin of 15.38% and 9.83%. In a larger-scale setting with 1000 users, it consumes less time while still outperforming all baselines, exceeding the best method by 16.19% and 59.69% in the two scenarios, respectively. Moreover, PersonalizedRouter exhibits few-shot learning capabilities, effectively adapting to new users and new LLMs, achieving 64.81% and 85.80% of the fully trained model’s performance, respectively.

Introduction

In recent years, the rapid growth of model scale and advances in training techniques have fueled the explosive emergence of large language models (LLMs), offering users diverse choices such as ChatGPT, Gemini, and LLaMA. Although large-scale language models have shown remarkable performance on many tasks, they tend to be inefficient when dealing with simple problems. In some scenarios, small-scale language models can achieve comparable performance while requiring fewer resources. Moreover, different LLMs excel at different tasks, exhibiting varying performance and cost efficiency on the specific application, and some domain-specific expert models achieve superior results in specialized tasks. In addition to differences in response quality and cost, LLMs also exhibit diverse response styles, which influence users’ understanding of the query. In multi-user scenarios, users often have distinct preferences that are difficult to directly model, making it challenging for a single LLM to serve all users consistently. Therefore, our paper aims to raise attention to this pressing research question: Given multiple user preferences, how can we design an LLM router that is personalized for each individual user?

To comprehensively evaluate the adaptability of LLM selection methods in multi-user scenarios, we design two simulation strategies that model diverse user behaviors to generate corresponding interaction data. Multi-cost-efficiency simulation strategy, which calculates a reward score for each response based on users' varying preferences between performance and inference cost. LLM-as-a-Judge strategy, which leverages a set of system prompts to instruct an LLM to simulate different user groups with various subjective preferences and select the best response from responses.

PersonalizedRouter

We first utilize the candidate LLMs to generate responses based on the multi-task dataset. Next, under two simulation strategies, we obtain the corresponding interaction data. As illustrated in the middle part, PersonalizedRouter transforms the user interaction data into a graph, where nodes represent the user, task, query, and LLM, and the edges capture the relationships between different node types. In the right part, We leverage a GNN to embed both node and edge features, updating and capturing the user’s hidden features. Ultimately, we select the optimal LLM from the predicted probability distribution.

Experiments

We compare PersonalizedRouter with other representative baseline methods under two different simulation strategies. All models are trained under the general experimental setting, which involves 10 LLMs and 9 users, aiming to evaluate their ability to adapt to new queries from existing users.

To further assess the scalability of the router, we conducted experiments in a large-scale scenario. Under the multi-cost-efficiency simulation strategy, we evaluated 10 LLMs across 1,000 users. Under the LLM-as-a-Judge strategy, to further mitigate the bias of relying on a single LLM as a judge, we considered three different LLMs, and for each model, we applied two types of instruction prompts, resulting in six distinct judge configurations. For each configuration, 200 users were simulated, leading to a total of 1,200 simulated user preferences. In other words, each user profile is evaluated under six distinct judge configurations, effectively yielding six different user perspectives. Ultimately, the combination of 200 user profiles with six judge configurations results in a total of 1,200 simulated user samples. Our model outperforms all baselines under both simulation strategies. Specifically, under the two simulation strategies, our model outperforms the strongest baseline by 16.19% and 59.69%, respectively. Moreover, the results indicate that our model achieves better performance while requiring less computation time.

To more intuitively demonstrate the scalability of PersonalizedRouter, we compare the results from two experiments conducted at different scales. The results provided in Table 5 indicate that PersonalizedRouter exhibits stable performance, with 5.2% drop relative to the Oracle.

To evaluate the ability of different LLM selection methods to generalize to new users, we also train all models under the new user experimental setting. Specifically, we treat the first three users as new users and the remaining six as known users for model training.

To evaluate the generalization capability of PersonalizedRouter to new LLMs, we conduct experiments under the new LLM experimental setting. Similar to the generalization to new users' settings, the model is trained on data from the first 10 LLMs, while the remaining 5 LLMs are treated as an auxiliary dataset for evaluation.

Although simulated users can partially validate whether the router is capable of modeling latent user preferences, they still differ from real users with diverse preferences. To further verify whether the router is effective in multiple real-user settings, we collected a small-scale human interaction dataset for experimentation. Specifically, we recruited 40 users and provided them with 80 queries. Each query was answered by 10 LLMs, and users selected the response that best aligned with their preference.

BibTeX

TODO:BibTex Code

PersonalizedRouter: Personalized LLM Routing via Graph-based User Preference Modeling

Abstract

Introduction

PersonalizedRouter

Experiments

BibTeX