Cluster Architecture¶
This page describes how BLIS simulates multi-instance inference serving clusters. For single-instance simulation internals, see Core Engine.
Canonical sources: Signal freshness (INV-7) is defined in
docs/contributing/standards/invariants.md. If signal freshness descriptions here diverge,invariants.mdis authoritative.
Overview¶
A BLIS cluster consists of N independent inference instances orchestrated by a shared-clock event loop. Each incoming request passes through a three-stage pipeline — admission, routing, and per-instance processing — before metrics are aggregated across all instances.
flowchart TD
WG["Workload<br/>Generator"] -->|Requests| Adm{"Admission<br/>Policy"}
Adm -->|rejected| Rej["Rejected<br/>(counted)"]
Adm -->|admitted| GQ["Gateway Queue<br/>(opt: --flow-control)"]
GQ -->|dispatch| Rtr["Routing Policy<br/>(WeightedScoring)<br/>PA:3 QD:2 KV:2"]
style GQ fill:#e7f5ff,stroke:#90caf9
Snap["Routing Snapshots<br/>QueueDepth, BatchSize,<br/>KVUtil, InFlightRequests"] -.-> Rtr
Rtr --> I0
Rtr --> I1
Rtr --> IN
subgraph I0 ["Instance 0"]
I0WQ["WaitQueue → Batch Formation<br/>KV Cache → Latency Model"]
end
subgraph I1 ["Instance 1"]
I1Int[" · · · "]
end
subgraph IN ["Instance N"]
INInt["WaitQueue → Batch Formation<br/>KV Cache → Latency Model"]
end
I0 -.->|snapshots| Snap
I1 -.->|snapshots| Snap
IN -.->|snapshots| Snap
I0 --> Met["Metrics Aggregation<br/>+ Fitness Evaluation"]
I1 --> Met
IN --> Met
Met --> Out1["stdout (JSON)<br/>Deterministic"]
Met --> Out2["stderr (logrus)<br/>Diagnostics"]
style WG fill:#a5d8ff
style Adm fill:#ffec99
style Rej fill:#ffc9c9
style Rtr fill:#b2f2bb
style Snap fill:#e7f5ff
style I0 fill:#d0bfff
style IN fill:#d0bfff
style Met fill:#ffd8a8
style Out1 fill:#d0bfff
style Out2 fill:#f1f3f5
Shared-Clock Event Loop¶
The cluster simulator maintains a single global clock shared across all instances. At each iteration, it compares the earliest cluster-level event (request arrival, admission decision, routing decision) against the earliest per-instance event (step completion, queueing), and processes whichever is earlier.
Ordering rules:
- Cluster events at time T are processed before instance events at time T (cluster-first priority)
- When multiple instances have events at the same time, the instance with the lowest index goes first
- Within a single instance, events are ordered by (timestamp, priority, seqID) — the same three-key scheme used at the cluster level
The simulation terminates when the clock exceeds the configured horizon or no events remain.
Admission Pipeline¶
Admission is the first gate in the online routing pipeline. Every incoming request is evaluated by the admission policy before being passed to routing.
Built-in Admission Policies¶
| Policy | Behavior |
|---|---|
always-admit |
Accept all requests (default) |
token-bucket |
Rate-limiting via a token bucket with configurable capacity and refill rate |
reject-all |
Reject all requests (for pathological testing) |
Token bucket rate-limits by consuming tokens proportional to input length: each request consumes tokens equal to its input token count, tokens refill at a constant rate, and requests are rejected when the bucket has insufficient tokens. Capacity and refill rate are configured via --token-bucket-capacity and --token-bucket-refill-rate.
Rejected requests are counted in the output metrics but do not enter the routing pipeline. To add a new admission policy, see Extension Recipes. See Configuration Reference for flag details.
Routing Pipeline¶
Routing selects which instance receives an admitted request. The routing policy sees a RouterState containing routing snapshots for all instances and the current simulation clock.
Simple Routing Policies¶
| Policy | Selection Rule |
|---|---|
round-robin |
Cyclic instance assignment |
least-loaded |
Instance with minimum effective load |
always-busiest |
Instance with maximum load (for pathological testing) |
Effective load is defined as QueueDepth + BatchSize + InFlightRequests, where InFlightRequests counts requests that have been dispatched to an instance but not yet completed. This tracks the full dispatch-to-response lifecycle, matching real HTTP router behavior (llm-d, Envoy).
Weighted Scoring Policy¶
The weighted routing policy composes multiple scorers into a single routing decision. This is the recommended policy for production-like simulations and matches the architecture of real-world inference routers like llm-d's Endpoint Picker.
The routing decision follows this pipeline:
- Score: Each scorer produces a per-instance score in [0, 1]
- Clamp: Scores are clamped to [0, 1] (scorers should already produce values in this range)
- Weight: Each score is multiplied by its configured weight
- Sum: Weighted scores are summed across scorers for each instance
- Select: The instance with the highest total score is chosen (argmax)
Default weights: prefix-affinity:3, queue-depth:2, kv-utilization:2 (llm-d parity). Note: weights are normalized to sum to 1.0 before scoring, so only weight ratios matter — prefix-affinity:3,queue-depth:2 is identical to prefix-affinity:30,queue-depth:20. To add a new scorer, see Extension Recipes.
flowchart TD
Req["Request<br/>+ Input Tokens"] --> S1
Snap["Routing Snapshots<br/>(per instance)<br/>QD, BS, KVUtil"] --> S2
Snap --> S3
S1["prefix-affinity<br/>(prefix overlap proportion)"] --> W1["× weight (3)"]
S2["queue-depth<br/>(min-max norm of load)"] --> W2["× weight (2)"]
S3["kv-utilization<br/>(1 − KVUtilization)"] --> W3["× weight (2)"]
W1 --> Sum["Weighted Sum<br/>(per instance)<br/>3×PA + 2×QD + 2×KV"]
W2 --> Sum
W3 --> Sum
Sum --> Sel["Argmax<br/>→ Selected Instance"]
Sel -->|observer callback| Obs["Update prefix<br/>cache index"]
Obs -.->|stateful feedback| S1
style Req fill:#a5d8ff
style Snap fill:#e7f5ff
style S1 fill:#fff3e0
style S2 fill:#fff3e0
style S3 fill:#fff3e0
style Sel fill:#c8e6c9
style Obs fill:#ffecb3
Scorer Composition¶
Scorers are the building blocks of the weighted routing policy. Each scorer evaluates one signal dimension across all instances.
Built-in Scorers¶
| Scorer | Signal | Score Computation | Notes |
|---|---|---|---|
prefix-affinity |
Prefix cache overlap | Proportion of request's block hashes found in instance's cache index | Stateful: updates cache index after routing via observer |
precise-prefix-cache |
Actual KV cache state | Min-max normalization of cached prefix block count per instance | Stateless: queries instance KV cache directly via CacheQueryFn |
no-hit-lru |
Routing history | Cold requests (zero cache hits): LRU positional scoring; warm requests: neutral 0.5 | Stateful: observer tracks LRU order on cold routing only |
queue-depth |
Instance load | Min-max normalization of effective load (lower load = higher score) | Stateless |
kv-utilization |
Memory pressure | 1 - KVUtilization (lower utilization = higher score) |
Stateless |
load-balance |
Instance load | 1 / (1 + EffectiveLoad) (decreasing function of load) |
Stateless |
Stateful vs. Stateless Scorers¶
Most scorers are stateless — they compute scores purely from the current routing snapshot. Two scorers are stateful: prefix-affinity updates its router-side prefix cache index after each routing decision via an observer callback, and no-hit-lru tracks LRU ordering of cold request routing to distribute cache-miss traffic. The precise-prefix-cache scorer queries actual instance KV cache state directly (via CacheQueryFn) rather than maintaining a router-side approximation.
Router-Side Prefix Cache Index¶
The prefix-affinity scorer maintains a lightweight approximate cache of per-instance block hash history. This is separate from the actual per-instance KV cache and serves as a routing-time estimate of cache hit probability.
Key properties: - Per-instance LRU with bounded capacity (default: 10,000 blocks) - Hierarchical block hashing: each block's hash chains with the prior block's hash for semantic prefix matching - Updated synchronously after each routing decision (synchronous freshness) - Score = proportion of request's block hashes found in the instance's cache index
Signal Freshness¶
Routing decisions depend on instance state signals with different freshness guarantees. Understanding freshness tiers is important for interpreting simulation results under high load.
Freshness Tiers¶
| Tier | Signals | Update Mechanism | Staleness |
|---|---|---|---|
| Synchronous (router-local) | InFlightRequests, prefix cache index | Router increments InFlightRequests at dispatch, decrements at completion; prefix cache updated after each routing decision | None — router owns this state |
| Immediate/Periodic (instance-reported) | QueueDepth, BatchSize, KVUtilization, FreeKVBlocks, CacheHitRate | When --snapshot-refresh-interval=0: Immediate (read from instance at routing time). When >0: all Prometheus-sourced signals share the same Periodic refresh interval, matching real vLLM's single /metrics endpoint (#463). |
Immediate: current within tick. Periodic: stale up to interval. |
The --snapshot-refresh-interval flag controls how frequently Prometheus-sourced signals (QueueDepth, BatchSize, KVUtilization) are re-read from instances. Setting it to 0 (default) makes all signals Immediate. Non-zero values introduce realistic staleness matching real vLLM Prometheus scrape intervals.
Counterfactual Regret¶
When decision tracing is enabled (--trace-level decisions), BLIS computes counterfactual regret for each routing decision. This measures how much better an alternative routing choice could have been.
Computation:
1. Score all candidate instances using the routing policy's scoring function
2. Rank candidates by score (descending)
3. Compute regret = best_score - chosen_score (clamped to >= 0)
4. Record the top-k candidates with their scores and instance state
Interpretation:
- For the weighted policy, regret is typically zero because the chosen instance IS the highest-scored candidate
- For round-robin, regret is non-zero because the policy ignores load signals
- Higher regret does not necessarily imply worse performance — round-robin can achieve lower tail latency than least-loaded through perfect distribution uniformity
Configure counterfactual analysis via --counterfactual-k (number of candidates to record per decision).
Metrics Aggregation¶
After simulation completes, per-instance metrics are aggregated into a unified cluster result:
| Metric Category | Aggregation |
|---|---|
| TTFT, E2E | Combined across all instances. JSON output: mean, p90, p95, p99. Internal Distribution: also p50, min, max (used by fitness evaluation). ITL: mean, p90, p95, p99 in JSON output. |
| Throughput | Total output tokens / simulation time; total requests / simulation time |
| Request counts | Sum of completed, queued, running, preempted, dropped across instances |
| Per-SLO-class | Separate distributions per SLO class (for multi-tenant analysis) |
| Fairness | Jain Fairness Index across tenant throughputs |
Fitness Evaluation¶
When --fitness-weights are configured, BLIS computes a single fitness score from the aggregated metrics. This enables automated policy comparison:
- Latency metrics (TTFT, E2E) are normalized via
1/(1 + value/1000)wherevalueis in ticks (microseconds) and 1000 ticks = 1ms is the reference point (lower latency = higher score). For example, TTFT of 50,000 ticks (50ms) maps to1/(1+50) = 0.0196. - Throughput metrics are normalized via
value/(value + reference)wherereferenceRPS = 100.0andreferenceTPS = 10000.0(higher throughput = higher score) - Normalized scores are multiplied by their configured weights and summed
- Higher fitness = better performance
Note: the normalization compresses raw metric differences significantly. A 38% TTFT improvement might map to only an 8% fitness score difference. Always examine raw metrics alongside fitness scores.
Instance Isolation¶
Each instance in the cluster is a fully independent single-instance simulator with its own: - Event queue and simulation state - Wait queue and running batch - KV cache (block allocation, prefix caching, LRU eviction) - Latency model - Scheduling and priority policies
Instances share the global clock but have no direct communication. All inter-instance coordination happens through the routing layer (via routing snapshots). This matches the architecture of real inference serving clusters where instances are independent processes.
Online Routing Pipeline Walkthrough¶
A complete request lifecycle through the cluster pipeline:
- Generation: The workload generator creates a request with arrival time, input tokens, and output tokens
- ClusterArrivalEvent: Scheduled at the request's arrival time
- AdmissionDecisionEvent: Admission policy evaluates the request
- If rejected: request counted, pipeline ends
- If admitted: proceed to routing (with optional
--admission-latencydelay) - Gateway Queue (optional,
--flow-control): When flow control is enabled, admitted requests enter a priority-ordered gateway queue instead of routing immediately. A saturation detector evaluates cluster capacity on each completion; when saturation < 1.0, the highest-priority request is dequeued and routed with freshRouterState(late binding). When disabled (default), this step is skipped. - RoutingDecisionEvent: Routing policy selects target instance
- InFlightRequests for target instance incremented
- Request injected into target instance's wait queue (with optional
--routing-latencydelay) - QueuedEvent: Fired by target instance when request enters its queue
- If no StepEvent exists, one is scheduled (work-conserving)
- Per-instance processing: Request follows the single-instance lifecycle (see Core Engine)
- Completion/Drop: InFlightRequests decremented when request completes or is dropped as unservable
- Metrics: Request metrics (TTFT, E2E, ITL) recorded at instance level, aggregated at cluster level
Observe / Replay / Calibrate Pipeline¶
Alongside the DES simulation pipeline described above, BLIS provides an offline validation workflow for comparing simulator predictions against real server behavior. This pipeline has three stages, of which only Replay engages the DES event loop. Observe is an HTTP workload dispatcher that sends requests to a real inference server and records per-request timing. Calibrate is a statistical comparison tool that computes per-metric MAPE, Pearson R, and quality grades.
flowchart LR
WS["WorkloadSpec<br/>or --rate flags"] --> Obs
SU["Server URL"] --> Obs
Obs["blis observe<br/>(HTTP client)"] --> TV["TraceV2<br/>(YAML + CSV)"]
TV --> Rep
MC["--model + sim flags"] --> Rep
Rep["blis replay<br/>(DES engine)"] --> SR["SimResult JSON<br/>(--results-path)"]
TV --> Cal
SR --> Cal
Cal["blis calibrate<br/>(statistical comparison)"] --> Rpt["Calibration<br/>Report"]
style Obs fill:#a5d8ff
style Rep fill:#d0bfff
style Cal fill:#b2f2bb
style TV fill:#fff3e0
style SR fill:#fff3e0
style Rpt fill:#c8e6c9
For the full pipeline guide with worked examples, see Observe / Replay / Calibrate. For flag details, see Configuration Reference.