Skip to content

BLIS System Invariants

Invariants are properties that must hold at all times during and after simulation. They are verified by invariant tests (see R7) and checked during self-audit (Step 4.75).

Hypothesis family mapping: INV-1 through INV-3, INV-5, and INV-6 belong to the Scheduler invariants (safety/liveness) family. INV-4 (KV cache conservation), INV-7 (signal freshness), INV-8 (work-conserving property), and INV-9 (oracle knowledge boundary) belong to the Structural model family. See docs/contributing/standards/experiments.md for hypothesis family definitions.

INV-1: Request Conservation

Statement: injected_requests == completed_requests + still_queued + still_running + dropped_unservable + timed_out at simulation end (all levels).

Cluster-level extension (issues #882, #1190, #1228, #1193, #1264): At cluster level, routing rejections, gateway queue, in-flight gateway eviction, TTL expiry, and encode-pool routing add additional buckets: injected_requests == completed_requests + still_queued + still_running + dropped_unservable + timed_out + routing_rejections + gateway_queue_depth + gateway_queue_shed + gateway_queue_rejected + gateway_evicted + gateway_expired + encode_routing_rejections. gateway_queue_depth counts requests still in the gateway queue at horizon. gateway_queue_shed counts requests shed (evicted victims) from the gateway queue due to capacity limits. gateway_queue_rejected counts requests rejected from the gateway queue when no displaceable victim exists (either no sheddable entries, or all sheddable entries have equal/higher priority than the incoming request). gateway_evicted counts requests evicted in-flight from instances when the system is saturated and higher-priority requests are waiting (Phase 4, #1228). gateway_expired counts requests removed from the gateway queue when their TTL expires (Phase 6, #1193); default off (TTL=0). encode_routing_rejections (GAP-4, #1264) counts requests rejected at the encode routing stage when the encode pool has zero routable instances; the term is always zero when --encode-instances 0 (default). Each terminal bucket is mutually exclusive — a request lands in exactly one. Single-instance simulations have no gateway queue and no encode pool; all six terms are always zero there.

Full pipeline: num_requests == injected_requests + rejected_requests (from anomaly counters).

Verification: sim/cluster/cluster_test.go — conservation tests. Conservation fields (still_queued, still_running, injected_requests) are included in CLI JSON output.

Evidence: Issue #183 — a silently-dropped request violated conservation for months.

Experimental validation: H12 confirmed conservation across 10 policy configurations (67 invariant checks) — including round-robin, least-loaded, weighted (multiple scorer configs), SJF, priority-FCFS, token-bucket admission, and always-busiest. H8 confirmed conservation under extreme KV pressure (15 configurations). Full preemption-path validation is blocked by the panic bug (#293).

Additional evidence (hardening wave): Issue #498, fix #504 — InjectArrival silently accepted requests with ArrivalTime > Horizon, registering them in Metrics.Requests but never firing the arrival event. This broke conservation accounting (LHS included the request, RHS never completed it). Fix: log warning on beyond-horizon injection.


INV-2: Request Lifecycle

Statement: Requests transition queued -> running -> completed. No invalid transitions. Requests not completed before horizon remain in current state.

Verification: State machine assertions in request processing code.


INV-3: Clock Monotonicity

Statement: Simulation clock never decreases. Every event's timestamp >= the previous event's timestamp.

Verification: Clock is advanced in the event loop only via min-heap extraction, which guarantees non-decreasing order.


INV-4: KV Cache Conservation

Statement: allocated_blocks + free_blocks = total_blocks at all times.

Verification: Checked after every allocation/deallocation. Check-then-act pre-check gate before any state mutation (vLLM parity); post-pre-check popFreeBlock() == nil panics (structurally unreachable in single-threaded DES). FreeBlockCnt maintained in lockstep by appendToFreeList/removeFromFreeList. verifyBlockConservation() provides independent free-list walk for debug-mode assertions.

Operational note (H8): KV cache pressure exhibits a sharp cliff, not gradual degradation. In H8's workload, performance was identical above ~2200 blocks and collapsed below it (4.7x TTFT P99 increase with just 4.5% fewer blocks). Below ~1000 blocks, the preempt-requeue cycle can livelock (see R19). Capacity planning formula: threshold ≈ rate / num_instances × (input_tokens + output_tokens) / block_size.

Additional evidence (hardening wave): Two KV conservation bugs discovered in March 2026: (1) Issue #492, fix #502 — prefill capacity pre-check over-estimated by up to 1 block (partial last-block fill not accounted for), causing false allocation failures that triggered unnecessary preemptions. (2) Issue #501, fix #506 — TieredKVCache CPU→GPU reload could produce an inverted range (newStart >= endIndex), causing a slice-bounds panic in block allocation. Both bugs directly affected the allocation/deallocation balance that INV-4 protects. (See also #519 in INV-8 — the range-loop livelock primarily violated the work-conserving property, not block-level conservation.)


INV-5: Causality

Statement: arrival_time <= enqueue_time <= schedule_time <= completion_time for every request.

Verification: Per-request metric timestamps recorded at each lifecycle stage. Invariant tests verify ordering for all completed requests.


INV-6: Determinism

Statement: Same seed must produce byte-identical stdout across runs.

Verification: Run same configuration twice with same seed; diff stdout. Wall-clock timing goes to stderr (not stdout).

Common violation sources: - Go map iteration feeding output ordering (R2) - Floating-point accumulation order dependencies - Wall-clock-dependent randomness (must use PartitionedRNG) - Stateful scorers with non-deterministic internal state


INV-7: Signal Freshness Hierarchy

Statement: Routing snapshot signals have tiered freshness due to DES event ordering and configurable staleness.

Signal Owner Freshness (interval=0) Freshness (interval>0) Updated By
InFlightRequests Cluster Synchronous Synchronous RoutingDecisionEvent.Execute() (increment), completion detection (decrement)
PreemptionCount Instance Immediate Periodic CachedSnapshotProvider.Snapshot() — routed through ObservabilityConfig.PreemptionCount like other instance signals
QueueDepth Instance Immediate Periodic QueuedEvent.Execute()
BatchSize Instance Immediate Periodic StepEvent.Execute()
KVUtilization Instance Immediate Periodic FormBatch()AllocateKVBlocks()
CacheHitRate Instance Immediate Periodic FormBatch()
cacheQueryFn (precise-prefix-cache, no-hit-lru) ¹ Instance (via CachedSnapshotProvider) Ground truth (synchronous) Periodic (CacheBlocks interval, default 50ms) CachedSnapshotProvider.RefreshCacheIfNeeded() in buildRouterState()
LoadingSnapshots (TotalKvCapacityTokens) ² Instance (direct accessor) Synchronous Synchronous buildRouterState() reads inst.TotalKvCapacityTokens() directly — bypasses snapshotProvider entirely; unaffected by --snapshot-refresh-interval or --cache-signal-delay

¹ cacheQueryFn freshness is governed by --cache-signal-delay (default 50ms), which maps to ObservabilityConfig.CacheBlocks. The "interval=0" / "interval>0" columns for this row refer to --cache-signal-delay. Cache block staleness is now managed by CachedSnapshotProvider alongside other signals (#1060).

² LoadingSnapshots carries Model, GPUType, TPDegree, CostPerHour, and TotalKvCapacityTokens; this row covers TotalKvCapacityTokens freshness because it is the only autoscaler-relevant field that is derived at runtime (from KVCache), though stable after construction (the others are literal config constants set at NewInstanceSimulator). All fields are always synchronous regardless of observability configuration. Used by DefaultCollector to populate ModelSignals.PendingTotalKvCapacityTokens for pending supply accounting in V2SaturationAnalyzer (#1109).

Design implication: When --snapshot-refresh-interval > 0 (default: 50000µs = 50ms, llm-d parity), all Prometheus-sourced signals (QueueDepth, BatchSize, KVUtilization) share the same scrape interval — matching real vLLM deployments where all three are exposed via the same /metrics endpoint. Set --snapshot-refresh-interval 0 for oracle/immediate mode. InFlightRequests remains synchronous (gateway-local counter, not Prometheus-sourced). When --cache-signal-delay > 0 (default: 50ms), prefix cache query closures use periodic snapshots of each instance's HashToBlock map, managed by CachedSnapshotProvider alongside other signal snapshots. The 50ms default models aggregate signal staleness from production llm-d. Set --cache-signal-delay 0 for oracle mode (live cache state).

EffectiveLoad() = QueueDepth + BatchSize + InFlightRequests. The synchronous InFlightRequests term compensates for Periodic staleness in the other two terms. The queue-depth scorer reads QueueDepth only (GIE parity); EffectiveLoad() is used by load-balance, least-loaded, always-busiest, and admission policies. The active-requests scorer reads InFlightRequests only (synchronous). The running-requests scorer reads BatchSize (Periodic/Immediate). The load-aware scorer reads QueueDepth only (Periodic/Immediate), with a linear threshold at 128.

Verification: H3 hypothesis experiment, H29 snapshot-staleness experiment (see hypothesis-archive branch).

Evidence: Issues #282, #283. At rate=5000, kv-utilization-only routing produces 200x worse distribution uniformity than queue-depth. Issue #463: unified Prometheus staleness model.


INV-8: Work-Conserving Property

Statement: After every step completion, if WaitQ.Len() > 0, a StepEvent must exist in the event queue. The simulator must not idle while there is work waiting.

Verification: sim/simulator_test.goTestWorkConserving_StepRestartsWhenWaitQNonEmpty. Deterministic test with MaxRunningReqs=1, two requests arriving simultaneously. Without the property, the second request is stranded forever (no arrival to trigger a new StepEvent). With the property, both complete.

Evidence: H-MMK experiment (PR #325) — without the work-conserving fix, W_q error was 151,000% at ρ=0.3. After fix, error dropped to 47% (remaining gap is discrete step processing, not a bug).

Additional evidence (hardening wave): Issue #349, fix #519 — Go range over mutable RunningBatch.Requests during FormBatch Phase 1 visited evicted requests, triggering 102K+ cascading preemptions with zero completions. The simulator never made forward progress (zero completed requests = INV-8 violation). See R21.

Code location: Search for // Work-conserving: comment in sim/simulator.go — the else branch of len(remaining) > 0 checks WaitQ.Len() > 0 and schedules a new StepEvent.

Hypothesis family: Structural model (same as INV-4, INV-7).


INV-9: Oracle Knowledge Boundary

Statement: Servability decisions — enqueue guard (EnqueueRequest), admission control (AdmissionPolicy), routing (RoutingPolicy) — must not read Request.OutputTokens or len(Request.OutputTokens). The control plane uses Request.MaxOutputLen (client-declared output budget) for sequence-length checks against MaxModelLen. When MaxOutputLen == 0 (no budget), only input length is checked; the proactive MaxModelLen cap in FormBatch (clamping to maxModelLen-1-ProgressIndex) and the completion boundary in processCompletions (PI >= maxModelLen-1) enforce output growth limits. Only the execution engine (executeBatchStep, processCompletions, recordRequestCompletion, FormBatch step planning) may access OutputTokens for token generation, completion detection, and per-step resource allocation.

Rationale: In real inference serving (vLLM), the engine does not know actual output length at admission time — only the client's declared max_tokens budget. BLIS's Request.OutputTokens is oracle knowledge (pre-determined for simulation). Using it for servability decisions would make the simulator's control plane behave differently from a real system, invalidating capacity planning results. See issue #567 ("Architectural Principle: Oracle Knowledge Boundary").

Scope: The boundary applies to servability decisions (admit/reject/route), not to all scheduler operations. FormBatch legitimately reads OutputTokens for decode-phase step planning (whether to allocate a decode token), which mirrors vLLM's scheduler reading sequence state for per-step execution. The distinction: "should this request enter the system?" (servability — no oracle) vs. "what should this request do in the current step?" (execution — oracle allowed).

Verification: sim/simulator_test.goTestEnqueueRequest_MaxOutputLen_OracleKnowledgeBoundary: a request with OutputTokens=1000 but MaxOutputLen=0 and MaxModelLen=512 is NOT rejected (input=200 < 512 passes input-only check), proving the enqueue guard does not peek at OutputTokens. Grep-based verification: admission.go, routing.go, routing_scorers.go, routing_prefix_scorer.go, scheduler.go, slo_priority.go contain zero references to OutputTokens.

Evidence: Issue #567 — the original implementation's BC-4 fallback (effectiveMaxOutput = len(r.OutputTokens)) violated this boundary. Fixed in the same PR after convergence review caught it.

Hypothesis family: Structural model (same as INV-4, INV-7, INV-8).


INV-10: Session Causality

Statement: For all rounds N in a closed-loop session: round[N+1].ArrivalTime >= round[N].CompletionTime + ThinkTimeUs. Boundary: ThinkTimeUs = 0 produces equality.

Verification: sim/workload/session_test.goTestSession_RoundGeneration_CorrectArrivalTime verifies the arrival time formula. The ThinkTimeUs=0 boundary is inherent in the formula.

Evidence: Design doc docs/plans/2026-03-13-client-behavior-model-design.md — INV-10 definition. Guaranteed by construction in SessionManager.OnComplete.

Hypothesis family: Scheduler invariants (safety/liveness) — causality chain for session rounds.


INV-11: Session Completeness

Statement: Every session reaches exactly one terminal state: completed (all rounds done), cancelled (a round timed out or was dropped), horizon-interrupted (simulation ended mid-session), or budget-exhausted (concurrency mode: global follow-up request cap reached). No session is silently abandoned.

Verification: sim/workload/session_test.go — tests cover all terminal paths: TestSession_TimeoutCancels_NoMoreRounds (cancelled), TestSession_FinalRound_Completes (completed), TestSession_BeyondHorizon_NotGenerated (horizon-interrupted), TestSession_DroppedFollowUp_CancelsSession (cancelled via drop). Budget-exhausted path verified via TestConcurrencyMode_EndToEnd_SessionFollowUps (budget exhaustion stops follow-up generation).

Evidence: Design doc INV-11 definition. The SessionManager.OnComplete method transitions sessions to exactly one terminal state before returning nil. The budget_exhausted state is reached when the shared follow-up budget (set via SetFollowUpBudget for --concurrency mode) is depleted — the session's unlimited-rounds flag would otherwise continue generating follow-ups, but the global cap takes precedence.

Hypothesis family: Structural model — session lifecycle completeness.


INV-12: Phase 1 Completeness Under Priority Preemption

Statement: After Phase 1 of FormBatch completes, every non-preempted running request in decode phase has NumNewTokens > 0, provided the token budget was not exhausted and MaxModelLen did not cap the request. No running request is silently skipped due to index drift from non-tail eviction.

Context: With --preemption-policy priority, the preemption victim may be at any index in the running batch (not just the tail). Removing an element at index i < reqIndex shifts subsequent elements left by one. Without the reqIndex -= adjustment correction (analog of vLLM scheduler.py:853 req_index -= 1), the Phase 1 loop skips the shifted element.

Verification: sim/batch_formation_test.goTestPreemption_Priority_Phase1Completeness: verifies that after non-tail eviction where victimIdx < reqIndex, ALL remaining running requests receive decode tokens (NumNewTokens > 0). The index adjustment is tested with [bg, crit, std] batch where bg is evicted at index 0 while processing crit at index 1.

Trivially satisfied for FCFS: With --preemption-policy fcfs (default), victims are always at the batch tail (victimIdx == len-1 >= reqIndex), so adjustment == 0 and no element skipping is possible.

Hypothesis family: Structural model (same as INV-4, INV-7, INV-8, INV-9).


INV-13: Run/Replay Parity

Statement: For any configuration supported by both blis run and blis replay, a trace exported via blis run --trace-output and replayed via blis replay --session-mode fixed with identical flags MUST produce identical per-request TTFT, E2E, and aggregate metrics. Features not yet supported by replay (autoscaler, node pools) MUST cause a logrus.Fatalf at startup when those flags are explicitly set or when a policy bundle containing those features is passed — never silent degradation or a mere warning.

Verification: cmd/replay_test.goTestINV13_RunReplayParity_PD verifies that a PD-disaggregated cluster produces matching RequestTTFTs and RequestE2Es maps when the same requests are run directly vs. through the trace-export-then-replay path. TestReplayCmd_AutoscalerBundleFatal and TestReplayCmd_NodePoolsBundleFatal verify fatal exit for unsupported features.

Evidence: cmd/replay.go::replayCmd.Run wires all 12 PD disaggregation fields to cluster.DeploymentConfig (same set as runCmd): PrefillInstances, DecodeInstances, SharedInstances, PDDecider, PDPrefixThreshold, PDTransferBandwidthGBps, PDTransferBaseLatencyMs, PDTransferContention, PrefillScorerConfigs, DecodeScorerConfigs, PrefillOverrides, DecodeOverrides. Autoscaler and node-pool checks use logrus.Fatalf to prevent silent divergence.

Hypothesis family: Scheduler invariants (safety/liveness) — same as INV-1, INV-5, INV-6.


PD Disaggregation Invariants

INV-PD-1: KV Completeness

Statement: For every disaggregated request, decode_enqueue_time >= kv_transfer_completion_time. A decode sub-request must not be enqueued before its KV transfer completes.

Verification: sim/cluster/disaggregation_test.goTestDisaggregation_RequestCompletesFullPath checks DecodeEnqueueTime >= TransferCompleteTime for every parent request. Runtime defensive check in KVTransferCompletedEvent.Execute().

Evidence: Both TransferCompleteTime and DecodeEnqueueTime are set in KVTransferCompletedEvent.Execute() at the same simulation tick (e.time), so the invariant holds by construction.

INV-PD-2: Pool Exclusivity

Statement: Prefill sub-requests route only to prefill pool instances; decode sub-requests route only to decode pool instances.

Verification: sim/cluster/disaggregation_test.goTestDisaggregation_PrefillRoutedToPrefillPool and TestDisaggregation_DecodeRoutedToDecodePool verify pool role for every parent request's prefill and decode instance assignments.

Evidence: buildPoolFilteredSnapshots(role) filters routing snapshots to only include instances of the specified pool role before passing to the routing policy.

INV-PD-3: Transfer Conservation

Statement: initiated_transfers == completed_transfers at simulation end, provided all transfers complete within the simulation horizon. At bounded horizons, the difference (initiated - completed) represents in-flight transfers accounted for in the pdInTransfer conservation correction (see INV-1 PD correction in cluster.go).

Verification: sim/cluster/disaggregation_test.goTestDisaggregation_TransferConservation asserts equality and expected count (uses unbounded horizon).

Evidence: transfersInitiated incremented on every entry to KVTransferStartedEvent.Execute() — both the happy path (via scheduleTransferCompletion) and the drop-at-start path (via dropAtStart, issue #1343). transfersCompleted incremented on every entry to KVTransferCompletedEvent.Execute(). Every started event schedules exactly one completed event — successful reservations schedule a real completion at start + duration; drop-at-start schedules a zero-duration degenerate completion at the same tick so the counters stay paired even when decode-side reservation fails.

INV-PD-4: Phase Causality

Statement: For every disaggregated request: arrival <= prefill_enqueue <= prefill_complete <= transfer_start <= transfer_complete <= decode_enqueue <= completion.

Verification: sim/cluster/disaggregation_test.goTestDisaggregation_PhaseCausality checks the full causal chain for every parent request.

Evidence: Each phase transition is enforced by DES event ordering: earlier phases schedule later-phase events at time >= current_time.

INV-PD-5: Pool Stability

Statement: Pool membership is fixed at construction time and never changes during simulation.

Verification: sim/cluster/disaggregation_test.goTestDisaggregation_PoolStability compares PoolMembership() before and after Run().

Evidence: BuildPoolMembershipFromIndices is called once in NewClusterSimulator and stored in cs.poolMembership. No code path in Run() modifies this map.

INV-PD-6: Metric Map Parent Granularity

Statement: After Run() completes on a disaggregated cluster, every per-request metric map (RequestE2Es, RequestTTFTs, RequestITLs, RequestSchedulingDelays, RequestCompletionTimes, Requests) contains only parent-level request IDs. No key may have a _prefill or _decode suffix. Completed parents contribute exactly one entry per map (keyed by parent ID); dropped or incomplete parents contribute no entry.

Verification: sim/cluster/disaggregation_test.goTestDisaggregation_MetricProjection_NoSubRequestKeys checks all six maps for suffix-free keys; TestDisaggregation_MetricProjection_DroppedParent_NoSubRequestKeys verifies the invariant holds when decode KV allocation fails. TestDisaggregation_MetricProjection_NoOp verifies the projection is a no-op for non-disaggregated clusters.

Evidence: projectPDMetrics() in sim/cluster/cluster.go is called after aggregateMetrics() and the conservation correction. It unconditionally deletes the pfx and dec keys for every parent request, and conditionally inserts a parent-keyed entry only for completed parents (CompletionTime > 0 && DecodeInstanceID != "").

INV-PD-6b: CompletionTime Includes PostDecodeFixedOverhead

Statement: For all successfully decoded parent requests (DecodeInstanceID != ""), parent.CompletionTime equals the cluster clock at decode completion plus the decode instance's PostDecodeFixedOverhead(). For roofline, overhead is 0, so CompletionTime equals the raw cluster clock tick. For trained-physics, overhead is α₁ (≈ 777 µs), so CompletionTime exceeds the raw clock by that amount. This ensures that projectPDMetrics() computes RequestE2Es[parentID] = CompletionTime - ArrivalTime consistently with how recordRequestCompletion computes non-PD E2E (which also adds PostDecodeFixedOverhead). Note: the non-PD path applies the overhead conditionally when len(req.OutputTokens) > 0; the PD path applies it unconditionally, which is safe because decode sub-requests always inherit OutputTokens from the original request via KVTransferCompletedEvent.Execute.

Verification: sim/cluster/disaggregation_test.goTestDisaggregation_CompletionTime_IncludesNonZeroOverhead verifies that E2E_with_overhead − E2E_without_overhead == overhead exactly when overhead is non-zero (trained-physics, directly exercises the bug-fix site). TestDisaggregation_CompletionTime_GeqAllPriorPhaseTimestamps verifies CompletionTime >= DecodeEnqueueTime and CompletionTime >= TransferCompleteTime (phase causality preserved). TestDisaggregation_E2E_IncludesOverhead_ZeroOverheadRegression verifies RequestE2Es[parentID] == CompletionTime − ArrivalTime and E2E >= TTFT for zero-overhead backend (roofline).

Evidence: detectDecodeCompletions() in sim/cluster/cluster.go stamps parent.CompletionTime = c.clock + inst.PostDecodeFixedOverhead(). Fixed in issue #846.

INV-P2-1: Pool-Config Consistency

Statement: Per-pool hardware overrides produce a valid SimConfig for each pool role: zero-valued PoolOverrides is a no-op (backward-compatible), non-nil fields override only the specified fields, and the global SimConfig is never mutated.

Verification: sim/cluster/resolve_test.goTestINV_P2_1_PoolConfigConsistency verifies observable KV capacity differences between pools pre-simulation via FreeKVBlocks(); TestINV_P2_1_RequestConservation verifies INV-1 holds under heterogeneous pool configuration.

Evidence: ResolvePoolConfig performs a struct copy and applies only non-nil/non-zero overrides. resolveConfigForRole is called in the instance construction loop in NewClusterSimulator, before any simulation state is created.


INV-P2-2: Fair-Share KV Transfer Bandwidth

Statement: When --pd-transfer-contention is enabled, the effective bandwidth available to each concurrent KV transfer is total_bandwidth / active_transfers, where active_transfers is the count of transfers in flight at the moment the new transfer starts (inclusive of the new transfer). With a single transfer in flight, the full bandwidth is used (active_transfers == 1, divisor == 1). This invariant gates the transfer duration formula in KVTransferStartedEvent.Execute().

Verification: sim/cluster/transfer_contention_test.go: - TestTransferContention_INVP22_EffectiveBandwidthFormula — golden test for the N=1 duration (9 µs with 10 blocks at 10 GB/s) - TestTransferContention_INVP22_N2FormulaExact — golden test for the N=2 duration (17 µs with same payload at 5 GB/s effective) - TestTransferContention_INVP22_DivisorLaw — invariant test: duration(N) / duration(1) ≈ N for N ∈ {1,2,3,4,5,8,10} with monotonicity - TestTransferContention_INVP22_FairShareBandwidth — end-to-end: concurrent transfers record peak >= 1 when multiple requests arrive simultaneously

Evidence: PR9 (sim/cluster/pd_events.go, KVTransferStartedEvent.Execute()). Gated behind PDTransferContention flag (off by default for backward compatibility). The activeTransfers counter is incremented before the divisor is applied, ensuring the new transfer receives a fair share of the bandwidth with every other transfer currently in flight.