Skip to content

Evaluation workflow

This page is a practical checklist for comparing routers on the same query set. For the underlying concepts (metrics, supervision, trade-offs), see Training and evaluation.

1) Prepare an evaluation file

Use JSONL and include stable IDs if you plan to join with labels later:

{"query_id":"q1","query":"What is machine learning?"}
{"query_id":"q2","query":"Explain transformers."}

2) Run routing-only for each router

Route-only runs are fast and avoid API costs:

llmrouter infer --router knnrouter --config configs/model_config_test/knnrouter.yaml --input eval.jsonl --output knn.jsonl --output-format jsonl --route-only
llmrouter infer --router svmrouter --config configs/model_config_test/svmrouter.yaml --input eval.jsonl --output svm.jsonl --output-format jsonl --route-only

3) Aggregate results

At minimum, you can compute: - routing distribution (how often each model is selected) - disagreement between routers

If you also have routing labels (for example, a best_model per query), you can compute simple accuracy by comparing model_name against that label.

4) (Optional) Full inference

Full inference can be useful for end-to-end validation, but it is slower and requires API credentials.

llmrouter infer --router knnrouter --config configs/model_config_test/knnrouter.yaml --input eval.jsonl --output knn_full.jsonl --output-format jsonl

Next