Cluster Architecture¶

This page describes how BLIS simulates multi-instance inference serving clusters. For single-instance simulation internals, see Core Engine.

Canonical sources: Signal freshness (INV-7) is defined in docs/contributing/standards/invariants.md. If signal freshness descriptions here diverge, invariants.md is authoritative.

Overview¶

A BLIS cluster consists of N independent inference instances orchestrated by a shared-clock event loop. Each incoming request passes through a three-stage pipeline — admission, routing, and per-instance processing — before metrics are aggregated across all instances.

flowchart TD
    WG["Workload<br/>Generator"] -->|Requests| Adm{"Admission<br/>Policy"}
    Adm -->|rejected| Rej["Rejected<br/>(counted)"]
    Adm -->|admitted| GQ["Gateway Queue<br/>(opt: --flow-control)"]
    GQ -->|dispatch| Rtr["Routing Policy<br/>(WeightedScoring)<br/>PA:3 QD:2 KV:2"]

    style GQ fill:#e7f5ff,stroke:#90caf9

    Snap["Routing Snapshots<br/>QueueDepth, BatchSize,<br/>KVUtil, InFlightRequests"] -.-> Rtr

    Rtr --> I0
    Rtr --> I1
    Rtr --> IN

    subgraph I0 ["Instance 0"]
        I0WQ["WaitQueue → Batch Formation<br/>KV Cache → Latency Model"]
    end

    subgraph I1 ["Instance 1"]
        I1Int[" · · · "]
    end

    subgraph IN ["Instance N"]
        INInt["WaitQueue → Batch Formation<br/>KV Cache → Latency Model"]
    end

    I0 -.->|snapshots| Snap
    I1 -.->|snapshots| Snap
    IN -.->|snapshots| Snap

    I0 --> Met["Metrics Aggregation<br/>+ Fitness Evaluation"]
    I1 --> Met
    IN --> Met

    Met --> Out1["stdout (JSON)<br/>Deterministic"]
    Met --> Out2["stderr (logrus)<br/>Diagnostics"]

    style WG fill:#a5d8ff
    style Adm fill:#ffec99
    style Rej fill:#ffc9c9
    style Rtr fill:#b2f2bb
    style Snap fill:#e7f5ff
    style I0 fill:#d0bfff
    style IN fill:#d0bfff
    style Met fill:#ffd8a8
    style Out1 fill:#d0bfff
    style Out2 fill:#f1f3f5

Shared-Clock Event Loop¶

The cluster simulator maintains a single global clock shared across all instances. At each iteration, it compares the earliest cluster-level event (request arrival, admission decision, routing decision) against the earliest per-instance event (step completion, queueing), and processes whichever is earlier.

Ordering rules: - Cluster events at time T are processed before instance events at time T (cluster-first priority) - When multiple instances have events at the same time, the instance with the lowest index goes first - Within a single instance, events are ordered by (timestamp, priority, seqID) — the same three-key scheme used at the cluster level

The simulation terminates when the clock exceeds the configured horizon or no events remain.

Admission Pipeline¶

Admission is the first gate in the online routing pipeline. Every incoming request is evaluated by the admission policy before being passed to routing.

Built-in Admission Policies¶

Policy	Behavior
`always-admit`	Accept all requests (default)
`token-bucket`	Rate-limiting via a token bucket with configurable capacity and refill rate
`reject-all`	Reject all requests (for pathological testing)

Token bucket rate-limits by consuming tokens proportional to input length: each request consumes tokens equal to its input token count, tokens refill at a constant rate, and requests are rejected when the bucket has insufficient tokens. Capacity and refill rate are configured via --token-bucket-capacity and --token-bucket-refill-rate.

Rejected requests are counted in the output metrics but do not enter the routing pipeline. To add a new admission policy, see Extension Recipes. See Configuration Reference for flag details.

Routing Pipeline¶

Routing selects which instance receives an admitted request. The routing policy sees a RouterState containing routing snapshots for all instances and the current simulation clock.

Simple Routing Policies¶

Policy	Selection Rule
`round-robin`	Cyclic instance assignment
`least-loaded`	Instance with minimum effective load
`always-busiest`	Instance with maximum load (for pathological testing)

Effective load is defined as QueueDepth + BatchSize + InFlightRequests, where InFlightRequests counts requests that have been dispatched to an instance but not yet completed. This tracks the full dispatch-to-response lifecycle, matching real HTTP router behavior (llm-d, Envoy).

Weighted Scoring Policy¶

The weighted routing policy composes multiple scorers into a single routing decision. This is the recommended policy for production-like simulations and matches the architecture of real-world inference routers like llm-d's Endpoint Picker.

The routing decision follows this pipeline:

Score: Each scorer produces a per-instance score in [0, 1]
Clamp: Scores are clamped to [0, 1] (scorers should already produce values in this range)
Weight: Each score is multiplied by its configured weight
Sum: Weighted scores are summed across scorers for each instance
Select: The instance with the highest total score is chosen (argmax)

Default weights: prefix-affinity:3, queue-depth:2, kv-utilization:2 (llm-d parity). Note: weights are normalized to sum to 1.0 before scoring, so only weight ratios matter — prefix-affinity:3,queue-depth:2 is identical to prefix-affinity:30,queue-depth:20. To add a new scorer, see Extension Recipes.

flowchart TD
    Req["Request<br/>+ Input Tokens"] --> S1
    Snap["Routing Snapshots<br/>(per instance)<br/>QD, BS, KVUtil"] --> S2
    Snap --> S3

    S1["prefix-affinity<br/>(prefix overlap proportion)"] --> W1["× weight (3)"]
    S2["queue-depth<br/>(min-max norm of load)"] --> W2["× weight (2)"]
    S3["kv-utilization<br/>(1 − KVUtilization)"] --> W3["× weight (2)"]

    W1 --> Sum["Weighted Sum<br/>(per instance)<br/>3×PA + 2×QD + 2×KV"]
    W2 --> Sum
    W3 --> Sum

    Sum --> Sel["Argmax<br/>→ Selected Instance"]
    Sel -->|observer callback| Obs["Update prefix<br/>cache index"]
    Obs -.->|stateful feedback| S1

    style Req fill:#a5d8ff
    style Snap fill:#e7f5ff
    style S1 fill:#fff3e0
    style S2 fill:#fff3e0
    style S3 fill:#fff3e0
    style Sel fill:#c8e6c9
    style Obs fill:#ffecb3

Scorer Composition¶

Scorers are the building blocks of the weighted routing policy. Each scorer evaluates one signal dimension across all instances.

Built-in Scorers¶

Scorer	Signal	Score Computation	Notes
`prefix-affinity`	Prefix cache overlap	Proportion of request's block hashes found in instance's cache index	Stateful: updates cache index after routing via observer
`precise-prefix-cache`	Actual KV cache state	Min-max normalization of cached prefix block count per instance	Stateless: queries instance KV cache directly via `CacheQueryFn`
`no-hit-lru`	Routing history	Cold requests (zero cache hits): LRU positional scoring; warm requests: neutral 0.5	Stateful: observer tracks LRU order on cold routing only
`queue-depth`	Instance load	Min-max normalization of effective load (lower load = higher score)	Stateless
`kv-utilization`	Memory pressure	`1 - KVUtilization` (lower utilization = higher score)	Stateless
`load-balance`	Instance load	`1 / (1 + EffectiveLoad)` (decreasing function of load)	Stateless

Stateful vs. Stateless Scorers¶

Most scorers are stateless — they compute scores purely from the current routing snapshot. Two scorers are stateful: prefix-affinity updates its router-side prefix cache index after each routing decision via an observer callback, and no-hit-lru tracks LRU ordering of cold request routing to distribute cache-miss traffic. The precise-prefix-cache scorer queries actual instance KV cache state directly (via CacheQueryFn) rather than maintaining a router-side approximation.

Router-Side Prefix Cache Index¶

The prefix-affinity scorer maintains a lightweight approximate cache of per-instance block hash history. This is separate from the actual per-instance KV cache and serves as a routing-time estimate of cache hit probability.

Key properties: - Per-instance LRU with bounded capacity (default: 10,000 blocks) - Hierarchical block hashing: each block's hash chains with the prior block's hash for semantic prefix matching - Updated synchronously after each routing decision (synchronous freshness) - Score = proportion of request's block hashes found in the instance's cache index

Signal Freshness¶

Routing decisions depend on instance state signals with different freshness guarantees. Understanding freshness tiers is important for interpreting simulation results under high load.

Freshness Tiers¶

Tier	Signals	Update Mechanism	Staleness
Synchronous (router-local)	InFlightRequests, prefix cache index	Router increments InFlightRequests at dispatch, decrements at completion; prefix cache updated after each routing decision	None — router owns this state
Immediate/Periodic (instance-reported)	QueueDepth, BatchSize, KVUtilization, FreeKVBlocks, CacheHitRate	When `--snapshot-refresh-interval=0`: Immediate (read from instance at routing time). When `>0`: all Prometheus-sourced signals share the same Periodic refresh interval, matching real vLLM's single `/metrics` endpoint (#463).	Immediate: current within tick. Periodic: stale up to interval.

The --snapshot-refresh-interval flag controls how frequently Prometheus-sourced signals (QueueDepth, BatchSize, KVUtilization) are re-read from instances. Setting it to 0 (default) makes all signals Immediate. Non-zero values introduce realistic staleness matching real vLLM Prometheus scrape intervals.

Counterfactual Regret¶

When decision tracing is enabled (--trace-level decisions), BLIS computes counterfactual regret for each routing decision. This measures how much better an alternative routing choice could have been.

Computation: 1. Score all candidate instances using the routing policy's scoring function 2. Rank candidates by score (descending) 3. Compute regret = best_score - chosen_score (clamped to >= 0) 4. Record the top-k candidates with their scores and instance state

Interpretation: - For the weighted policy, regret is typically zero because the chosen instance IS the highest-scored candidate - For round-robin, regret is non-zero because the policy ignores load signals - Higher regret does not necessarily imply worse performance — round-robin can achieve lower tail latency than least-loaded through perfect distribution uniformity

Configure counterfactual analysis via --counterfactual-k (number of candidates to record per decision).

Metrics Aggregation¶

After simulation completes, per-instance metrics are aggregated into a unified cluster result:

Metric Category	Aggregation
TTFT, E2E	Combined across all instances. JSON output: mean, p90, p95, p99. Internal `Distribution`: also p50, min, max (used by fitness evaluation). ITL: mean, p90, p95, p99 in JSON output.
Throughput	Total output tokens / simulation time; total requests / simulation time
Request counts	Sum of completed, queued, running, preempted, dropped across instances
Per-SLO-class	Separate distributions per SLO class (for multi-tenant analysis)
Fairness	Jain Fairness Index across tenant throughputs

Fitness Evaluation¶

When --fitness-weights are configured, BLIS computes a single fitness score from the aggregated metrics. This enables automated policy comparison:

Latency metrics (TTFT, E2E) are normalized via 1/(1 + value/1000) where value is in ticks (microseconds) and 1000 ticks = 1ms is the reference point (lower latency = higher score). For example, TTFT of 50,000 ticks (50ms) maps to 1/(1+50) = 0.0196.
Throughput metrics are normalized via value/(value + reference) where referenceRPS = 100.0 and referenceTPS = 10000.0 (higher throughput = higher score)
Normalized scores are multiplied by their configured weights and summed
Higher fitness = better performance

Note: the normalization compresses raw metric differences significantly. A 38% TTFT improvement might map to only an 8% fitness score difference. Always examine raw metrics alongside fitness scores.

Instance Isolation¶

Each instance in the cluster is a fully independent single-instance simulator with its own: - Event queue and simulation state - Wait queue and running batch - KV cache (block allocation, prefix caching, LRU eviction) - Latency model - Scheduling and priority policies

Instances share the global clock but have no direct communication. All inter-instance coordination happens through the routing layer (via routing snapshots). This matches the architecture of real inference serving clusters where instances are independent processes.

Online Routing Pipeline Walkthrough¶

A complete request lifecycle through the cluster pipeline:

Generation: The workload generator creates a request with arrival time, input tokens, and output tokens
ClusterArrivalEvent: Scheduled at the request's arrival time
AdmissionDecisionEvent: Admission policy evaluates the request
If rejected: request counted, pipeline ends
If admitted: proceed to routing (with optional --admission-latency delay)
Gateway Queue (optional, --flow-control): When flow control is enabled, admitted requests enter a priority-ordered gateway queue instead of routing immediately. A saturation detector evaluates cluster capacity on each completion; when saturation < 1.0, the highest-priority request is dequeued and routed with fresh RouterState (late binding). When disabled (default), this step is skipped.
RoutingDecisionEvent: Routing policy selects target instance
InFlightRequests for target instance incremented
Request injected into target instance's wait queue (with optional --routing-latency delay)
QueuedEvent: Fired by target instance when request enters its queue
If no StepEvent exists, one is scheduled (work-conserving)
Per-instance processing: Request follows the single-instance lifecycle (see Core Engine)
Completion/Drop: InFlightRequests decremented when request completes or is dropped as unservable
Metrics: Request metrics (TTFT, E2E, ITL) recorded at instance level, aggregated at cluster level

Observe / Replay / Calibrate Pipeline¶

Alongside the DES simulation pipeline described above, BLIS provides an offline validation workflow for comparing simulator predictions against real server behavior. This pipeline has three stages, of which only Replay engages the DES event loop. Observe is an HTTP workload dispatcher that sends requests to a real inference server and records per-request timing. Calibrate is a statistical comparison tool that computes per-metric MAPE, Pearson R, and quality grades.

flowchart LR
    WS["WorkloadSpec<br/>or --rate flags"] --> Obs
    SU["Server URL"] --> Obs
    Obs["blis observe<br/>(HTTP client)"] --> TV["TraceV2<br/>(YAML + CSV)"]

    TV --> Rep
    MC["--model + sim flags"] --> Rep
    Rep["blis replay<br/>(DES engine)"] --> SR["SimResult JSON<br/>(--results-path)"]

    TV --> Cal
    SR --> Cal
    Cal["blis calibrate<br/>(statistical comparison)"] --> Rpt["Calibration<br/>Report"]

    style Obs fill:#a5d8ff
    style Rep fill:#d0bfff
    style Cal fill:#b2f2bb
    style TV fill:#fff3e0
    style SR fill:#fff3e0
    style Rpt fill:#c8e6c9

For the full pipeline guide with worked examples, see Observe / Replay / Calibrate. For flag details, see Configuration Reference.