Routing Policies¶

This guide covers how BLIS distributes incoming requests across instances in cluster mode. For single-instance simulation, routing is not applicable. For instance-level request ordering, see Scheduling & Priority.

# Quick example: compare round-robin vs weighted routing
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 100 --num-requests 500 \
  --routing-policy weighted --trace-level decisions --summarize-trace

Available Policies¶

Policy	Flag Value	Strategy
Round-robin	`round-robin`	Cyclic assignment — request N goes to instance N % k
Least-loaded	`least-loaded`	Send to the instance with lowest `EffectiveLoad`
Weighted	`weighted`	Composable multi-scorer pipeline (default: llm-d parity)
Always-busiest	`always-busiest`	Pathological template — sends to the most loaded instance (for testing)

Weighted Scoring (Composable Pipeline)¶

The weighted routing policy is the most flexible. It combines multiple scoring dimensions, each evaluating instances on a [0, 1] scale:

--routing-policy weighted --routing-scorers "prefix-affinity:3,queue-depth:2,kv-utilization:2"

Available Scorers¶

Scorer	What It Measures	llm-d Equivalent
`prefix-affinity`	Proportional prefix match ratio via router-side block hash cache	prefix-scorer
`queue-depth`	Effective load: `QueueDepth + BatchSize + InFlightRequests` (min-max normalized)	queue-scorer
`kv-utilization`	Inverse KV utilization: `1 - KVUtilization`	kv-cache-utilization-scorer
`load-balance`	Inverse transform: `1 / (1 + effectiveLoad)`	BLIS-native (no llm-d equivalent)

Prefix-affinity is a scorer, not a standalone policy

The prefix-affinity scorer operates within the weighted routing pipeline, composed with load-balancing scorers. It uses a router-side PrefixCacheIndex with proportional block hash matching and LRU eviction. Always pair it with at least one load-aware scorer (queue-depth or kv-utilization) to prevent cold-start pile-on.

Default Profile¶

When --routing-scorers is not specified, the default profile is:

prefix-affinity:3, queue-depth:2, kv-utilization:2

This matches the llm-d Endpoint Picker scoring pipeline. Weights are relative — only ratios matter. [3, 2, 2] behaves identically to [0.43, 0.29, 0.29].

Signal Freshness¶

Canonical source: Signal freshness tiers are specified in docs/contributing/standards/invariants.md (INV-7). The descriptions below provide additional user-facing context; invariants.md is authoritative if they diverge.

Not all routing signals are equally fresh

In production inference serving systems (e.g., llm-d), the router is a separate process from the inference engines. Some signals are maintained at the router level, while others require periodic reporting from instances. BLIS models this asymmetry.

Why Signals Have Different Staleness¶

Different signals originate from different places in the system:

Router-local signals are maintained by the router itself — they're always current because the router controls them directly.
Instance-internal signals live on the inference engine and must be communicated to the router — they're inherently stale by the reporting interval.

BLIS models three signal freshness tiers:

Tier	Signals	Source	Freshness
Router-local	InFlightRequests, prefix cache index	Router increments InFlightRequests at dispatch, decrements at completion; prefix cache updated after each routing decision	Always fresh — router owns this state
Instance-reported (Immediate/Periodic)	QueueDepth, BatchSize, KVUtilization, FreeKVBlocks, CacheHitRate	Instance-internal state (scheduler queue, running batch, KV cache)	When `--snapshot-refresh-interval=0` (default): Immediate (read from instance at routing time). When `>0`: all Prometheus-sourced signals share the same Periodic refresh interval, matching real vLLM's single `/metrics` endpoint.

DES semantics of 'Immediate' mode

"Immediate" means "re-read from the instance object at query time" — NOT "perfectly synchronized with the simulation clock." At the same clock tick, cluster events are processed before instance events (determinism rule). So a routing decision at time T sees QueueDepth that hasn't yet processed instance events at time T. This is a determinism mechanism (INV-6), not a freshness guarantee.

Staleness Impact¶

At high request rates, many routing decisions occur between KV utilization updates (step time varies by model — ~6ms for Qwen3-14B / H100 / TP=1 at low load, longer under batch saturation). If using kv-utilization:1 alone, all decisions within one step see the same stale utilization — this can cause severe load imbalance.

Safe zone for --snapshot-refresh-interval

Below 5ms (~1 step time): no degradation. At 10ms: 14% TTFT p99 increase. At 100ms: +354%. The default composite profile (prefix-affinity:3, queue-depth:2, kv-utilization:2) is inherently resilient — queue-depth's Immediate signal corrects stale KV signals, mitigating ~99% of the effect.

When to Use Which Policy¶

Workload	Recommended Policy	Why
Uniform traffic, no prefix sharing	`least-loaded` or `weighted` with `queue-depth:1`	Load balance is the only signal that matters
RAG with shared system prompts	`weighted` with `prefix-affinity:3,queue-depth:1`	Prefix affinity maximizes KV cache reuse
Mixed SLO classes	`weighted` default + priority scheduling	Routing distributes load; scheduling prioritizes critical requests
Low traffic (< 10 req/s)	Any	All policies produce equivalent results within 5%

Example: Comparing Policies¶

BLIS includes a routing comparison script:

chmod +x examples/routing-comparison.sh
./examples/routing-comparison.sh

This runs 5 configurations and shows TTFT p99, target distribution, and throughput for each. See examples/routing-comparison.sh for the full script.