BLIS System Invariants¶
Invariants are properties that must hold at all times during and after simulation. They are verified by invariant tests (see R7) and checked during self-audit (Step 4.75).
Hypothesis family mapping: INV-1 through INV-3, INV-5, and INV-6 belong to the Scheduler invariants (safety/liveness) family. INV-4 (KV cache conservation), INV-7 (signal freshness), INV-8 (work-conserving property), and INV-9 (oracle knowledge boundary) belong to the Structural model family. See docs/contributing/standards/experiments.md for hypothesis family definitions.
INV-1: Request Conservation¶
Statement: injected_requests == completed_requests + still_queued + still_running + dropped_unservable + timed_out at simulation end (all levels).
Cluster-level extension (Phase 1B-1b, issue #882): At cluster level, the deferred queue, routing rejections, and gateway queue add additional buckets: injected_requests == completed_requests + still_queued + still_running + dropped_unservable + timed_out + deferred_horizon_interrupted + routing_rejections + gateway_queue_depth + gateway_queue_shed. deferred_horizon_interrupted counts Batch/Background requests still parked in the deferred queue when the simulation horizon is reached. gateway_queue_depth counts requests still in the gateway queue at horizon. gateway_queue_shed counts requests shed from the gateway queue due to capacity limits. Single-instance simulations have no deferred queue or gateway queue; all three terms are always zero there.
Full pipeline: num_requests == injected_requests + rejected_requests (from anomaly counters).
Verification: sim/cluster/cluster_test.go and sim/cluster/cluster_deferred_test.go — conservation tests. Conservation fields (still_queued, still_running, injected_requests) are included in CLI JSON output. DeferredHorizonInterrupted is included in RawMetrics and printed by the CLI anomaly block when non-zero.
Evidence: Issue #183 — a silently-dropped request violated conservation for months.
Experimental validation: H12 confirmed conservation across 10 policy configurations (67 invariant checks) — including round-robin, least-loaded, weighted (multiple scorer configs), SJF, priority-FCFS, token-bucket admission, and always-busiest. H8 confirmed conservation under extreme KV pressure (15 configurations). Full preemption-path validation is blocked by the panic bug (#293).
Additional evidence (hardening wave): Issue #498, fix #504 — InjectArrival silently accepted requests with ArrivalTime > Horizon, registering them in Metrics.Requests but never firing the arrival event. This broke conservation accounting (LHS included the request, RHS never completed it). Fix: log warning on beyond-horizon injection.
INV-2: Request Lifecycle¶
Statement: Requests transition queued -> running -> completed. No invalid transitions. Requests not completed before horizon remain in current state.
Verification: State machine assertions in request processing code.
INV-3: Clock Monotonicity¶
Statement: Simulation clock never decreases. Every event's timestamp >= the previous event's timestamp.
Verification: Clock is advanced in the event loop only via min-heap extraction, which guarantees non-decreasing order.
INV-4: KV Cache Conservation¶
Statement: allocated_blocks + free_blocks = total_blocks at all times.
Verification: Checked after every allocation/deallocation. Transactional allocation with rollback on mid-loop failure (R5).
Operational note (H8): KV cache pressure exhibits a sharp cliff, not gradual degradation. In H8's workload, performance was identical above ~2200 blocks and collapsed below it (4.7x TTFT P99 increase with just 4.5% fewer blocks). Below ~1000 blocks, the preempt-requeue cycle can livelock (see R19). Capacity planning formula: threshold ≈ rate / num_instances × (input_tokens + output_tokens) / block_size.
Additional evidence (hardening wave): Two KV conservation bugs discovered in March 2026: (1) Issue #492, fix #502 — prefill capacity pre-check over-estimated by up to 1 block (partial last-block fill not accounted for), causing false allocation failures that triggered unnecessary preemptions. (2) Issue #501, fix #506 — TieredKVCache CPU→GPU reload could produce an inverted range (newStart >= endIndex), causing a slice-bounds panic in block allocation. Both bugs directly affected the allocation/deallocation balance that INV-4 protects. (See also #519 in INV-8 — the range-loop livelock primarily violated the work-conserving property, not block-level conservation.)
INV-5: Causality¶
Statement: arrival_time <= enqueue_time <= schedule_time <= completion_time for every request.
Verification: Per-request metric timestamps recorded at each lifecycle stage. Invariant tests verify ordering for all completed requests.
INV-6: Determinism¶
Statement: Same seed must produce byte-identical stdout across runs.
Verification: Run same configuration twice with same seed; diff stdout. Wall-clock timing goes to stderr (not stdout).
Common violation sources: - Go map iteration feeding output ordering (R2) - Floating-point accumulation order dependencies - Wall-clock-dependent randomness (must use PartitionedRNG) - Stateful scorers with non-deterministic internal state
INV-7: Signal Freshness Hierarchy¶
Statement: Routing snapshot signals have tiered freshness due to DES event ordering and configurable staleness.
| Signal | Owner | Freshness (interval=0) | Freshness (interval>0) | Updated By |
|---|---|---|---|---|
| InFlightRequests | Cluster | Synchronous | Synchronous | RoutingDecisionEvent.Execute() (increment), completion detection (decrement) |
| QueueDepth | Instance | Immediate | Periodic | QueuedEvent.Execute() |
| BatchSize | Instance | Immediate | Periodic | StepEvent.Execute() |
| KVUtilization | Instance | Immediate | Periodic | FormBatch() → AllocateKVBlocks() |
| CacheHitRate | Instance | Immediate | Periodic | FormBatch() |
| cacheQueryFn (precise-prefix-cache, no-hit-lru) ¹ | Instance (via StaleCacheIndex) | Ground truth (synchronous) | Demand-triggered (CacheSignalDelay interval, checked at routing decisions) | StaleCacheIndex.RefreshIfNeeded() in buildRouterState() |
¹ cacheQueryFn freshness is governed by --cache-signal-delay (default 2s), not --snapshot-refresh-interval. The "interval=0" / "interval>0" columns for this row refer to --cache-signal-delay.
Design implication: When --snapshot-refresh-interval > 0, all Prometheus-sourced signals (QueueDepth, BatchSize, KVUtilization) share the same scrape interval — matching real vLLM deployments where all three are exposed via the same /metrics endpoint. InFlightRequests remains synchronous (gateway-local counter, not Prometheus-sourced). When --cache-signal-delay > 0 (default: 2s), prefix cache query closures use a separate periodic snapshot of each instance's HashToBlock map, modeling asynchronous KV event propagation from production llm-d. The 2s default matches llm-d's defaultSpeculativeTTL — the blind spot between routing decision and KV event arrival via ZMQ. Set --cache-signal-delay 0 for oracle mode (live cache state).
EffectiveLoad() = QueueDepth + BatchSize + InFlightRequests. The synchronous InFlightRequests term compensates for Periodic staleness in the other two terms.
Verification: H3 hypothesis experiment, H29 snapshot-staleness experiment (see hypothesis-archive branch).
Evidence: Issues #282, #283. At rate=5000, kv-utilization-only routing produces 200x worse distribution uniformity than queue-depth. Issue #463: unified Prometheus staleness model.
INV-8: Work-Conserving Property¶
Statement: After every step completion, if WaitQ.Len() > 0, a StepEvent must exist in the event queue. The simulator must not idle while there is work waiting.
Verification: sim/simulator_test.go — TestWorkConserving_StepRestartsWhenWaitQNonEmpty. Deterministic test with MaxRunningReqs=1, two requests arriving simultaneously. Without the property, the second request is stranded forever (no arrival to trigger a new StepEvent). With the property, both complete.
Evidence: H-MMK experiment (PR #325) — without the work-conserving fix, W_q error was 151,000% at ρ=0.3. After fix, error dropped to 47% (remaining gap is discrete step processing, not a bug).
Additional evidence (hardening wave): Issue #349, fix #519 — Go range over mutable RunningBatch.Requests during FormBatch Phase 1 visited evicted requests, triggering 102K+ cascading preemptions with zero completions. The simulator never made forward progress (zero completed requests = INV-8 violation). See R21.
Code location: Search for // Work-conserving: comment in sim/simulator.go — the else branch of len(remaining) > 0 checks WaitQ.Len() > 0 and schedules a new StepEvent.
Hypothesis family: Structural model (same as INV-4, INV-7).
INV-9: Oracle Knowledge Boundary¶
Statement: Servability decisions — enqueue guard (EnqueueRequest), admission control (AdmissionPolicy), routing (RoutingPolicy), and priority scoring (PriorityPolicy) — must not read Request.OutputTokens or len(Request.OutputTokens). The control plane uses Request.MaxOutputLen (client-declared output budget) for sequence-length checks against MaxModelLen. When MaxOutputLen == 0 (no budget), only input length is checked; the proactive MaxModelLen cap in FormBatch (clamping to maxModelLen-1-ProgressIndex) and the completion boundary in processCompletions (PI >= maxModelLen-1) enforce output growth limits. Only the execution engine (executeBatchStep, processCompletions, recordRequestCompletion, FormBatch step planning) may access OutputTokens for token generation, completion detection, and per-step resource allocation.
Rationale: In real inference serving (vLLM), the engine does not know actual output length at admission time — only the client's declared max_tokens budget. BLIS's Request.OutputTokens is oracle knowledge (pre-determined for simulation). Using it for servability decisions would make the simulator's control plane behave differently from a real system, invalidating capacity planning results. See issue #567 ("Architectural Principle: Oracle Knowledge Boundary").
Scope: The boundary applies to servability decisions (admit/reject/route), not to all scheduler operations. FormBatch legitimately reads OutputTokens for decode-phase step planning (whether to allocate a decode token), which mirrors vLLM's scheduler reading sequence state for per-step execution. The distinction: "should this request enter the system?" (servability — no oracle) vs. "what should this request do in the current step?" (execution — oracle allowed).
Verification: sim/simulator_test.go — TestEnqueueRequest_MaxOutputLen_OracleKnowledgeBoundary: a request with OutputTokens=1000 but MaxOutputLen=0 and MaxModelLen=512 is NOT rejected (input=200 < 512 passes input-only check), proving the enqueue guard does not peek at OutputTokens. Grep-based verification: admission.go, routing.go, routing_scorers.go, routing_prefix_scorer.go, scheduler.go, priority.go contain zero references to OutputTokens.
Evidence: Issue #567 — the original implementation's BC-4 fallback (effectiveMaxOutput = len(r.OutputTokens)) violated this boundary. Fixed in the same PR after convergence review caught it.
Hypothesis family: Structural model (same as INV-4, INV-7, INV-8).
INV-10: Session Causality¶
Statement: For all rounds N in a closed-loop session: round[N+1].ArrivalTime >= round[N].CompletionTime + ThinkTimeUs. Boundary: ThinkTimeUs = 0 produces equality.
Verification: sim/workload/session_test.go — TestSession_RoundGeneration_CorrectArrivalTime verifies the arrival time formula. The ThinkTimeUs=0 boundary is inherent in the formula.
Evidence: Design doc docs/plans/2026-03-13-client-behavior-model-design.md — INV-10 definition. Guaranteed by construction in SessionManager.OnComplete.
Hypothesis family: Scheduler invariants (safety/liveness) — causality chain for session rounds.
INV-11: Session Completeness¶
Statement: Every session reaches exactly one terminal state: completed (all rounds done), cancelled (a round timed out or was dropped), horizon-interrupted (simulation ended mid-session), or budget-exhausted (concurrency mode: global follow-up request cap reached). No session is silently abandoned.
Verification: sim/workload/session_test.go — tests cover all terminal paths: TestSession_TimeoutCancels_NoMoreRounds (cancelled), TestSession_FinalRound_Completes (completed), TestSession_BeyondHorizon_NotGenerated (horizon-interrupted), TestSession_DroppedFollowUp_CancelsSession (cancelled via drop). Budget-exhausted path verified via TestConcurrencyMode_EndToEnd_SessionFollowUps (budget exhaustion stops follow-up generation).
Evidence: Design doc INV-11 definition. The SessionManager.OnComplete method transitions sessions to exactly one terminal state before returning nil. The budget_exhausted state is reached when the shared follow-up budget (set via SetFollowUpBudget for --concurrency mode) is depleted — the session's unlimited-rounds flag would otherwise continue generating follow-ups, but the global cap takes precedence.
Hypothesis family: Structural model — session lifecycle completeness.
PD Disaggregation Invariants¶
INV-PD-1: KV Completeness¶
Statement: For every disaggregated request, decode_enqueue_time >= kv_transfer_completion_time. A decode sub-request must not be enqueued before its KV transfer completes.
Verification: sim/cluster/disaggregation_test.go — TestDisaggregation_RequestCompletesFullPath checks DecodeEnqueueTime >= TransferCompleteTime for every parent request. Runtime defensive check in DecodeRoutingEvent.Execute().
Evidence: By event priority ordering: KVTransferCompletedEvent (priority 6) schedules DecodeRoutingEvent (priority 7) at the same timestamp. DecodeEnqueueTime is set in DecodeRoutingEvent which fires after transfer completion.
INV-PD-2: Pool Exclusivity¶
Statement: Prefill sub-requests route only to prefill pool instances; decode sub-requests route only to decode pool instances.
Verification: sim/cluster/disaggregation_test.go — TestDisaggregation_PrefillRoutedToPrefillPool and TestDisaggregation_DecodeRoutedToDecodePool verify pool role for every parent request's prefill and decode instance assignments.
Evidence: buildPoolFilteredSnapshots(role) filters routing snapshots to only include instances of the specified pool role before passing to the routing policy.
INV-PD-3: Transfer Conservation¶
Statement: initiated_transfers == completed_transfers at simulation end, provided all transfers complete within the simulation horizon. At bounded horizons, the difference (initiated - completed) represents in-flight transfers accounted for in the pdInTransfer conservation correction (see INV-1 PD correction in cluster.go).
Verification: sim/cluster/disaggregation_test.go — TestDisaggregation_TransferConservation asserts equality and expected count (uses unbounded horizon).
Evidence: transfersInitiated incremented in KVTransferStartedEvent.Execute(), transfersCompleted incremented in KVTransferCompletedEvent.Execute(). Every started event schedules exactly one completed event.
INV-PD-4: Phase Causality¶
Statement: For every disaggregated request: arrival <= prefill_enqueue <= prefill_complete <= transfer_start <= transfer_complete <= decode_enqueue <= completion.
Verification: sim/cluster/disaggregation_test.go — TestDisaggregation_PhaseCausality checks the full causal chain for every parent request.
Evidence: Each phase transition is enforced by DES event ordering: earlier phases schedule later-phase events at time >= current_time.
INV-PD-5: Pool Stability¶
Statement: Pool membership is fixed at construction time and never changes during simulation.
Verification: sim/cluster/disaggregation_test.go — TestDisaggregation_PoolStability compares PoolMembership() before and after Run().
Evidence: BuildPoolMembershipFromIndices is called once in NewClusterSimulator and stored in cs.poolMembership. No code path in Run() modifies this map.
INV-PD-6: Metric Map Parent Granularity¶
Statement: After Run() completes on a disaggregated cluster, every per-request metric map (RequestE2Es, RequestTTFTs, RequestITLs, RequestSchedulingDelays, RequestCompletionTimes, Requests) contains only parent-level request IDs. No key may have a _prefill or _decode suffix. Completed parents contribute exactly one entry per map (keyed by parent ID); dropped or incomplete parents contribute no entry.
Verification: sim/cluster/disaggregation_test.go — TestDisaggregation_MetricProjection_NoSubRequestKeys checks all six maps for suffix-free keys; TestDisaggregation_MetricProjection_DroppedParent_NoSubRequestKeys verifies the invariant holds when decode KV allocation fails. TestDisaggregation_MetricProjection_NoOp verifies the projection is a no-op for non-disaggregated clusters.
Evidence: projectPDMetrics() in sim/cluster/cluster.go is called after aggregateMetrics() and the conservation correction. It unconditionally deletes the pfx and dec keys for every parent request, and conditionally inserts a parent-keyed entry only for completed parents (CompletionTime > 0 && DecodeInstanceID != "").
INV-PD-6b: CompletionTime Includes PostDecodeFixedOverhead¶
Statement: For all successfully decoded parent requests (DecodeInstanceID != ""), parent.CompletionTime equals the cluster clock at decode completion plus the decode instance's PostDecodeFixedOverhead(). For backends where overhead is 0 (blackbox, roofline, cross-model), CompletionTime equals the raw cluster clock tick. For trained-roofline (overhead ≈ 1850 µs), CompletionTime exceeds the raw clock by the overhead amount. This ensures that projectPDMetrics() computes RequestE2Es[parentID] = CompletionTime - ArrivalTime consistently with how recordRequestCompletion computes non-PD E2E (which also adds PostDecodeFixedOverhead). Note: the non-PD path applies the overhead conditionally when len(req.OutputTokens) > 0; the PD path applies it unconditionally, which is safe because decode sub-requests always inherit OutputTokens from the original request via KVTransferCompletedEvent.Execute.
Verification: sim/cluster/disaggregation_test.go — TestDisaggregation_CompletionTime_IncludesNonZeroOverhead verifies that E2E_with_overhead − E2E_without_overhead == overhead exactly for trained-roofline clusters (directly exercises the bug-fix site). TestDisaggregation_CompletionTime_GeqAllPriorPhaseTimestamps verifies CompletionTime >= DecodeEnqueueTime and CompletionTime >= TransferCompleteTime (phase causality preserved). TestDisaggregation_E2E_IncludesOverhead_ZeroOverheadRegression verifies RequestE2Es[parentID] == CompletionTime − ArrivalTime and E2E >= TTFT for blackbox (overhead=0) clusters.
Evidence: detectDecodeCompletions() in sim/cluster/cluster.go stamps parent.CompletionTime = c.clock + inst.PostDecodeFixedOverhead(). Fixed in issue #846.
INV-P2-1: Pool-Config Consistency¶
Statement: Per-pool hardware overrides produce a valid SimConfig for each pool role: zero-valued PoolOverrides is a no-op (backward-compatible), non-nil fields override only the specified fields, and the global SimConfig is never mutated.
Verification: sim/cluster/resolve_test.go — TestINV_P2_1_PoolConfigConsistency verifies observable KV capacity differences between pools pre-simulation via FreeKVBlocks(); TestINV_P2_1_RequestConservation verifies INV-1 holds under heterogeneous pool configuration.
Evidence: ResolvePoolConfig performs a struct copy and applies only non-nil/non-zero overrides. resolveConfigForRole is called in the instance construction loop in NewClusterSimulator, before any simulation state is created.
INV-P2-2: Fair-Share KV Transfer Bandwidth¶
Statement: When --pd-transfer-contention is enabled, the effective bandwidth available to each concurrent KV transfer is total_bandwidth / active_transfers, where active_transfers is the count of transfers in flight at the moment the new transfer starts (inclusive of the new transfer). With a single transfer in flight, the full bandwidth is used (active_transfers == 1, divisor == 1). This invariant gates the transfer duration formula in KVTransferStartedEvent.Execute().
Verification: sim/cluster/transfer_contention_test.go:
- TestTransferContention_INVP22_EffectiveBandwidthFormula — golden test for the N=1 duration (9 µs with 10 blocks at 10 GB/s)
- TestTransferContention_INVP22_N2FormulaExact — golden test for the N=2 duration (17 µs with same payload at 5 GB/s effective)
- TestTransferContention_INVP22_DivisorLaw — invariant test: duration(N) / duration(1) ≈ N for N ∈ {1,2,3,4,5,8,10} with monotonicity
- TestTransferContention_INVP22_FairShareBandwidth — end-to-end: concurrent transfers record peak >= 1 when multiple requests arrive simultaneously
Evidence: PR9 (sim/cluster/pd_events.go, KVTransferStartedEvent.Execute()). Gated behind PDTransferContention flag (off by default for backward compatibility). The activeTransfers counter is incremented before the divisor is applied, ensuring the new transfer receives a fair share of the bandwidth with every other transfer currently in flight.