Deployment and integration¶
This page covers common ways to run LLMRouter in production-like settings.
Two common deployment patterns¶
Pattern A: route-only + your own model gateway (recommended)¶
Use LLMRouter to pick model_name, and then call your model gateway/service yourself. This keeps secrets, retries, logging, and observability in one place.
- Use
llmrouter infer --route-only ...for a CLI workflow, or - Load a router in Python and call
route_singledirectly
Pattern B: full inference through llmrouter infer¶
Let LLMRouter route and call the selected model via an OpenAI-compatible endpoint. This is the simplest end-to-end path, but you still need to manage API_KEYS and api_endpoint.
CLI-based inference¶
For a single query:
llmrouter infer --router knnrouter --config configs/model_config_test/knnrouter.yaml --query "Hello"
For batch inference:
llmrouter infer --router knnrouter --config configs/model_config_test/knnrouter.yaml --input queries.jsonl --output results.jsonl --output-format jsonl
Use --route-only to skip API calls and return routing decisions only.
Python integration (load once, reuse)¶
If you are embedding LLMRouter in a service, load the router once at startup and reuse it across requests:
from llmrouter.cli.router_inference import load_router, route_query
router_name = "knnrouter"
router = load_router(router_name, "configs/model_config_test/knnrouter.yaml")
result = route_query("Explain transformers.", router, router_name)
print(result["model_name"])
To perform full inference inside Python, use infer_query (requires API_KEYS and api_endpoint):
from llmrouter.cli.router_inference import infer_query
result = infer_query("Explain transformers.", router, router_name)
print(result["response"])
Chat UI¶
LLMRouter includes a Gradio-based chat UI:
API credentials and endpoints¶
- Configure
api_endpointin your YAML config to point to your LLM endpoint. - Set
API_KEYSin the environment forcall_api. - Single key:
API_KEYS=your-key - Multiple keys:
API_KEYS='["key1", "key2"]'
call_api rotates across keys in a round-robin fashion.
Scaling notes¶
- Multi-process deployments: each process maintains its own API key rotation counters.
- Route-only is the safest way to scale quickly since it avoids external calls.
Service integration¶
If you are embedding LLMRouter in a service:
- Load the router once and reuse it across requests.
- Consider using llmrouter.cli.router_inference.load_router to load from config.
- Keep config, data, and model artifacts versioned together.
Operational tips¶
- Validate configs in CI by running
llmrouter list-routersand a small--route-onlytest. - Prefer batch inference for large offline jobs.
- Tune
--tempand--max-tokensfor your latency and cost targets.