Routing Policies¶
This guide covers how BLIS distributes incoming requests across instances in cluster mode. For single-instance simulation, routing is not applicable. For instance-level request ordering, see Scheduling & Priority.
# Quick example: compare round-robin vs weighted routing
./blis run --model qwen/qwen3-14b \
--num-instances 4 --rate 100 --num-requests 500 \
--routing-policy weighted --trace-level decisions --summarize-trace
Available Policies¶
| Policy | Flag Value | Strategy |
|---|---|---|
| Round-robin | round-robin |
Cyclic assignment — request N goes to instance N % k |
| Least-loaded | least-loaded |
Send to the instance with lowest EffectiveLoad |
| Weighted | weighted |
Composable multi-scorer pipeline (default: llm-d parity) |
| Always-busiest | always-busiest |
Pathological template — sends to the most loaded instance (for testing) |
Weighted Scoring (Composable Pipeline)¶
The weighted routing policy is the most flexible. It combines multiple scoring dimensions, each evaluating instances on a [0, 1] scale:
Available Scorers¶
| Scorer | What It Measures | llm-d Equivalent |
|---|---|---|
prefix-affinity |
Proportional prefix match ratio via router-side block hash cache | prefix-scorer |
queue-depth |
Effective load: QueueDepth + BatchSize + InFlightRequests (min-max normalized) |
queue-scorer |
kv-utilization |
Inverse KV utilization: 1 - KVUtilization |
kv-cache-utilization-scorer |
load-balance |
Inverse transform: 1 / (1 + effectiveLoad) |
BLIS-native (no llm-d equivalent) |
Prefix-affinity is a scorer, not a standalone policy
The prefix-affinity scorer operates within the weighted routing pipeline, composed with load-balancing scorers. It uses a router-side PrefixCacheIndex with proportional block hash matching and LRU eviction. Always pair it with at least one load-aware scorer (queue-depth or kv-utilization) to prevent cold-start pile-on.
Default Profile¶
When --routing-scorers is not specified, the default profile is:
This matches the llm-d Endpoint Picker scoring pipeline. Weights are relative — only ratios matter. [3, 2, 2] behaves identically to [0.43, 0.29, 0.29].
Signal Freshness¶
Canonical source: Signal freshness tiers are specified in
docs/contributing/standards/invariants.md(INV-7). The descriptions below provide additional user-facing context;invariants.mdis authoritative if they diverge.
Not all routing signals are equally fresh
In production inference serving systems (e.g., llm-d), the router is a separate process from the inference engines. Some signals are maintained at the router level, while others require periodic reporting from instances. BLIS models this asymmetry.
Why Signals Have Different Staleness¶
Different signals originate from different places in the system:
- Router-local signals are maintained by the router itself — they're always current because the router controls them directly.
- Instance-internal signals live on the inference engine and must be communicated to the router — they're inherently stale by the reporting interval.
BLIS models three signal freshness tiers:
| Tier | Signals | Source | Freshness |
|---|---|---|---|
| Router-local | InFlightRequests, prefix cache index | Router increments InFlightRequests at dispatch, decrements at completion; prefix cache updated after each routing decision | Always fresh — router owns this state |
| Instance-reported (Immediate/Periodic) | QueueDepth, BatchSize, KVUtilization, FreeKVBlocks, CacheHitRate | Instance-internal state (scheduler queue, running batch, KV cache) | When --snapshot-refresh-interval=0 (default): Immediate (read from instance at routing time). When >0: all Prometheus-sourced signals share the same Periodic refresh interval, matching real vLLM's single /metrics endpoint. |
DES semantics of 'Immediate' mode
"Immediate" means "re-read from the instance object at query time" — NOT "perfectly synchronized with the simulation clock." At the same clock tick, cluster events are processed before instance events (determinism rule). So a routing decision at time T sees QueueDepth that hasn't yet processed instance events at time T. This is a determinism mechanism (INV-6), not a freshness guarantee.
Staleness Impact¶
At high request rates, many routing decisions occur between KV utilization updates (step time varies by model — ~6ms for Qwen3-14B / H100 / TP=1 at low load, longer under batch saturation). If using kv-utilization:1 alone, all decisions within one step see the same stale utilization — this can cause severe load imbalance.
Safe zone for --snapshot-refresh-interval
Below 5ms (~1 step time): no degradation. At 10ms: 14% TTFT p99 increase. At 100ms: +354%. The default composite profile (prefix-affinity:3, queue-depth:2, kv-utilization:2) is inherently resilient — queue-depth's Immediate signal corrects stale KV signals, mitigating ~99% of the effect.
When to Use Which Policy¶
| Workload | Recommended Policy | Why |
|---|---|---|
| Uniform traffic, no prefix sharing | least-loaded or weighted with queue-depth:1 |
Load balance is the only signal that matters |
| RAG with shared system prompts | weighted with prefix-affinity:3,queue-depth:1 |
Prefix affinity maximizes KV cache reuse |
| Mixed SLO classes | weighted default + priority scheduling |
Routing distributes load; scheduling prioritizes critical requests |
| Low traffic (< 10 req/s) | Any | All policies produce equivalent results within 5% |
Example: Comparing Policies¶
BLIS includes a routing comparison script:
This runs 5 configurations and shows TTFT p99, target distribution, and throughput for each. See examples/routing-comparison.sh for the full script.
Further Reading¶
- Scheduling & Priority — instance-level request ordering
- Admission Control — the gate before routing
- Cluster Architecture — how the routing pipeline works internally
- Configuration Reference — all routing flags
- Metrics & Results — understanding trace summaries and regret analysis