Core Engine¶
This page describes BLIS's single-instance discrete event simulation engine. For multi-instance cluster orchestration, see Cluster Architecture.
Canonical sources: System invariants (INV-1 through INV-11) are defined in
docs/contributing/standards/invariants.md. If invariant descriptions here diverge,invariants.mdis authoritative.
Overview¶
Each BLIS instance is a self-contained discrete event simulator. The engine maintains an event queue (min-heap), processes events in timestamp order, and advances a simulation clock. The core loop is:
while events remain and clock < horizon:
event = pop earliest event from queue
advance clock to event.timestamp
execute event (may produce new events)
The engine models the full vLLM inference pipeline: request arrival, queueing, batch formation, step execution, KV cache management, and latency estimation — all without real GPU hardware.
Event Queue¶
The event queue is a min-heap ordered by event timestamp. Events represent state transitions in the simulation:
| Event Type | Trigger | Effect |
|---|---|---|
ArrivalEvent |
Request enters system | Computes queueing delay (alpha overhead), schedules QueuedEvent |
QueuedEvent |
Request enters wait queue | Adds request to wait queue; if no StepEvent exists, schedules one (work-conserving) |
StepEvent |
Batch ready for execution | Runs the 4-phase Step() cycle (see below) |
ScheduledEvent |
Request moves to running batch | Timeline marker for tracing (scheduling delay recorded in scheduleBatch) |
RequestLeftEvent |
Request completes | Timeline marker for tracing (E2E metrics recorded in processCompletions) |
Clock monotonicity (INV-3): The simulation clock never decreases. Events are processed in strictly non-decreasing timestamp order.
Work-conserving (INV-8): After every step completion, if the wait queue is non-empty, a StepEvent must exist in the event queue. The simulator never idles while work is waiting.
flowchart TD
EQ["Event Queue (min-heap)<br/>Ordered by timestamp"] -->|Pop min| Clock["Advance clock<br/>to event.timestamp"]
Clock --> Exec["Execute Event Handler<br/>Arrival → Queued → WaitQ<br/>Step → Step() phases<br/>Completion → metrics"]
Exec -->|produces new events| EQ
Exec -->|clock ≥ horizon<br/>or queue empty| Done["Simulation<br/>Complete"]
style EQ fill:#fff9db
style Clock fill:#a5d8ff
style Exec fill:#b2f2bb
style Done fill:#c8e6c9
Step Phases¶
The Step() function is a 4-line orchestrator that delegates to four phases:
flowchart TD
P1["1. scheduleBatch<br/>Priority assignment<br/>Queue reordering<br/>Batch formation"]
P2["2. executeBatchStep<br/>Latency estimation<br/>Prefill/decode progress<br/>TTFT at boundary"]
P3["3. processCompletions<br/>Release KV blocks<br/>Record E2E metrics<br/>Schedule RequestLeft"]
P4["4. scheduleNextStep<br/>Batch has work → Step<br/>Queue has work → Step<br/>Both empty → idle"]
P1 --> P2 --> P3 --> P4
style P1 fill:#d0bfff
style P2 fill:#d0bfff
style P3 fill:#d0bfff
style P4 fill:#d0bfff
Phase 1: Schedule Batch (scheduleBatch)¶
- Assign priority scores to all queued requests via the priority policy
- Reorder the wait queue via the scheduling policy (e.g., FCFS, SJF, priority-based)
- Invoke batch formation to select which requests enter the running batch
- Record preemption metrics for any evicted requests
- Schedule
ScheduledEventfor any newly admitted requests
Phase 2: Execute Batch Step (executeBatchStep)¶
- Compute step time via the latency model based on batch composition
- For each request in the batch:
- If in prefill phase: advance the progress index through input tokens (respecting chunked prefill limits)
- If in decode phase: advance the progress index by one output token
- Record TTFT at the prefill-to-decode boundary
- Return the computed step time to Phase 4 for scheduling the next
StepEvent(the clock itself advances only at the event-loop level when the next event is popped)
Phase 3: Process Completions (processCompletions)¶
Identify completed requests (all output tokens generated), release their KV blocks, record E2E metrics, and schedule RequestLeftEvent. Note: TTFT is already recorded in Phase 2 at the prefill-to-decode boundary, so Phase 3 only handles E2E and completion bookkeeping. This separation (TTFT in Phase 2, E2E in Phase 3) is the "two-pass" design that ensures TTFT is recorded before E2E.
Phase 4: Schedule Next Step (scheduleNextStep)¶
- If the running batch still has requests, schedule the next
StepEventatnow + stepTime - If the running batch is empty but the wait queue has requests, schedule the next
StepEventatnow + stepTime(work-conserving) - If both are empty, do nothing (next event will be a future arrival)
Request Lifecycle¶
Requests follow a linear state machine with one exception (preemption):
stateDiagram-v2
[*] --> Arrival
Arrival --> Queued : alpha queueing delay
Arrival --> DroppedUnservable : exceeds MaxModelLen or KV capacity
state "Queued" as Queued
note right of Queued
EnqueueTime recorded
Scheduler orders queue
end note
Queued --> Running : BatchFormation selects
state "Running" as Running
note right of Running
ScheduleTime recorded
TTFT at prefill→decode boundary
KV blocks allocated
ITL recorded per decode step
end note
Running --> Completed
Running --> Queued : Preemption (KV eviction, re-enqueue front)
state "Completed" as Completed
note right of Completed
CompletionTime recorded
KV blocks released
E2E = CompletionTime − ArrivalTime
end note
Completed --> RequestLeft
RequestLeft --> [*]
DroppedUnservable --> [*]
INV-5 (Causality): ArrivalTime ≤ EnqueueTime ≤ ScheduleTime ≤ CompletionTime
States¶
| State | Description |
|---|---|
| Queued | In wait queue, not yet in running batch. Ordered by scheduling policy. |
| Running | In running batch, actively being processed. Progressing through prefill or decode. |
| Completed | All output tokens generated. KV blocks released. Metrics recorded. |
Key Timestamps¶
These are the conceptual timestamps in a request's lifecycle. Some are stored as struct fields (e.g., ArrivalTime, FirstTokenTime), while others are computed at metric recording time.
| Timestamp | When Recorded | Used For |
|---|---|---|
| Arrival time | Request creation | E2E and scheduling delay baseline |
| Enqueue time | After alpha queueing delay | Conceptual start of wait queue residence |
| Schedule time | Batch formation selects request | Scheduling delay = time in wait queue |
| First token time | End of prefill phase | TTFT = FirstTokenTime (stored on Request) |
| Completion time | All tokens generated | E2E = FirstTokenTime + sum(ITLs) |
Causality invariant (INV-5): arrival_time <= enqueue_time <= schedule_time <= completion_time
Preemption¶
When KV cache pressure forces eviction, running requests are preempted:
1. Request is removed from the running batch (tail-first eviction)
2. Its KV blocks are freed
3. It is re-enqueued at the front of the wait queue (not the back)
4. A debug log is emitted and PreemptionCount is incremented
Preempted requests reset to the beginning of prefill (ProgressIndex = 0) and their KV blocks are freed. However, the freed blocks' prefix hashes are preserved in the KV cache's free list — when the request is re-scheduled, prefix caching may find these blocks (if not yet evicted by LRU), reducing recomputation. This matches vLLM's "recompute" preemption mode.
Dropped Requests¶
Requests are dropped as unservable at enqueue time (incrementing DroppedUnservable) via two guards:
- MaxModelLen guard — when
--max-model-lenis set, requests whose total sequence length exceeds the context window are rejected. When the request declares an output budget (MaxOutputLen > 0), the check isinput + budget > maxModelLen. Otherwise, input length alone is checked (vLLM defaultsmax_tokenstomax_model_len - seq_len; the runtime stop inprocessCompletionshandles output growth). - KV capacity guard — requests whose input tokens require more KV blocks than the total cache capacity are rejected. This prevents livelock where the simulator would endlessly preempt and re-enqueue a request that can never fit.
Both guards fire before the request enters the wait queue, mirroring vLLM's pre-engine rejection. Additionally, when --max-model-len is set, a proactive cap in FormBatch clamps token scheduling to maxModelLen - 1 - ProgressIndex (matching vLLM scheduler.py:773-774), a decode guard in executeBatchStep prevents phantom token generation when 0 tokens are allocated, and processCompletions force-completes any request whose ProgressIndex reaches MaxModelLen - 1. Output tokens per length-capped request: MaxModelLen - 1 - len(InputTokens).
Batch Formation¶
BLIS models vLLM's continuous batching algorithm through the VLLMBatchFormation implementation.
Two-Phase Algorithm¶
Phase 1: Continuing Requests
Process requests already in the running batch:
- Apply chunked prefill limits (--long-prefill-token-threshold)
- Allocate token budget for decode tokens
- If KV allocation fails for a continuing request, preempt requests from the batch tail to free blocks
Phase 2: New Requests
Dequeue requests from the wait queue:
- Compute cached prefix blocks (prefix caching reduces allocation needs)
- Allocate KV blocks for uncached prefix tokens being processed this step (bounded by chunked prefill threshold and remaining token budget)
- Stop dequeuing when: max batch size reached (--max-num-running-reqs), allocation fails (cache full), token budget exhausted, or a preemption occurred during Phase 1
Constraints¶
| Constraint | Flag | Effect |
|---|---|---|
| Max batch size | --max-num-running-reqs |
Limits number of concurrent requests in the running batch |
| Token budget | --max-num-scheduled-tokens |
Limits total new tokens across all running requests per step |
| Chunked prefill | --long-prefill-token-threshold |
Splits long prefills across multiple steps |
Preemption Strategy¶
When KV allocation fails for a continuing request: 1. Evict requests from the batch tail (reverse order of admission) 2. Free their KV blocks 3. Re-enqueue evicted requests at the front of the wait queue 4. Retry allocation for the original request 5. Circuit breaker: Stop if allocation still fails after exhausting the batch, or if the request itself is unservable
KV Cache Management¶
The KV cache simulates GPU memory organized as fixed-size blocks. Each block holds --block-size-in-tokens tokens (default: 16).
Single-Tier Cache¶
The default KV cache operates entirely in GPU memory:
- Block allocation: Requests are allocated blocks proportional to their token count
- Prefix caching: Hierarchical block hashing enables prefix sharing across requests. Each block's hash chains with the prior block's hash, creating semantic prefix signatures. When a new request shares a prefix with an existing request, the cached blocks are reused rather than reallocated.
- LRU eviction: Free blocks are managed via a doubly-linked list with LRU ordering
- Reference counting: Shared blocks (prefix caching) are reference-counted and exempt from eviction while any request references them
- Transactional allocation: Multi-block allocations are rolled back on failure (no partial allocation)
Conservation invariant (INV-4): allocated_blocks + free_blocks = total_blocks at all times.
Tiered Cache (GPU + CPU)¶
When --kv-cpu-blocks is set to a positive value, BLIS enables a two-tier cache:
- GPU tier: Full KV cache with prefix caching and LRU eviction
- CPU tier: Simple capacity store for offloaded blocks
- Offload trigger: When GPU utilization exceeds
--kv-offload-threshold(default: 0.9), blocks are offloaded to CPU - Reload: On GPU allocation failure, blocks are reloaded from CPU with a transfer latency penalty
- Transfer latency: Per reloaded block:
base_latency + ceil(block_size_tokens / bandwidth). Accumulated across all reloaded blocks. Non-blocking (added to step time). - Thrashing detection: Blocks offloaded and reloaded within 1000 ticks (1ms) increment a thrashing counter
Latency Models¶
BLIS predicts GPU step time through one of four latency model backends. The choice is made via the --latency-model flag or automatically based on available configuration.
Blackbox Model¶
Uses trained regression coefficients to predict step time:
StepTime = beta0 + beta1 * cache_miss_tokens + beta2 * decode_tokens
QueueingTime = alpha0 + alpha1 * input_length
OutputTokenProcessingTime = alpha2
- Beta coefficients model GPU execution time as a linear function of batch composition
- Alpha coefficients model non-GPU overhead (tokenization, API serialization, output processing)
- Coefficients are trained offline via Bayesian optimization against real vLLM measurements
- Pre-trained coefficients for common model/GPU combinations are shipped in
defaults.yaml
See Configuration Reference: Coefficient Calibration for the training process.
Roofline Model (Default)¶
Uses analytical FLOPs/bandwidth estimation when no trained coefficients are available:
Phase Time = max(total_FLOPs / peak_compute, total_bytes / peak_bandwidth)
Step Time = Prefill Phase Time + Decode Phase Time
- Requires HuggingFace
config.json(model architecture: layers, heads, hidden dim) - Requires
hardware_config.json(GPU specs: peak TFLOPS, peak bandwidth, MFU) - Accounts for Tensor Parallelism, All-Reduce latency, and per-layer overheads
- No training data needed — works for any supported model immediately
See Roofline Estimation for implementation details.
Alpha Overhead¶
Alpha overhead models non-GPU processing time:
- Queueing time (alpha0 + alpha1 * input_length): Delays request enqueue but does not block the server. The simulation clock is not advanced by this overhead.
- Output token processing time (alpha2): Added to per-request ITL/TTFT metrics but does not block the next step.
This is architecturally correct for vLLM, where CPU post-processing (tokenization, output serialization) runs concurrently with GPU execution.
Scheduling Policies¶
Scheduling policies control the order in which queued requests are selected for batch formation. They operate per-instance. To add a new scheduling policy, see Extension Recipes.
| Policy | Ordering Rule | Use Case |
|---|---|---|
fcfs |
No reordering (arrival order) | Default, fair |
priority-fcfs |
Priority descending, then arrival ascending | SLO-aware scheduling |
sjf |
Input token count ascending, then arrival ascending | Shortest-job-first for TTFT optimization |
reverse-priority |
Priority ascending (starves high-priority) | Pathological testing only |
Priority Policies¶
Priority policies assign a numeric score to each request before scheduling:
| Policy | Score Computation | Behavior |
|---|---|---|
constant |
Fixed score for all requests | No differentiation (default) |
slo-based |
base + age_weight * (clock - arrival) |
Favors older requests, prevents starvation. Note: despite the name, does not currently use per-request SLO metadata. |
inverted-slo |
base - age_weight * age |
Starves older requests (pathological) |
Metrics¶
BLIS records per-request and aggregate metrics throughout the simulation.
Per-Request Metrics¶
| Metric | Definition |
|---|---|
| TTFT | Time from arrival to first token: includes queueing delay, prefill step times, and output processing overhead (alpha2) |
| E2E | FirstTokenTime + sum(ITLs), where each ITL includes step time + alpha2 |
| ITL | Observed time between consecutive decode steps (includes alpha2 per token) |
| Scheduling Delay | Time from request arrival to entering the running batch (includes alpha queueing overhead + wait queue residence) |
Aggregate Metrics¶
| Metric | Aggregation |
|---|---|
| TTFT, E2E, ITL distributions | Mean, p90, p95, p99 in JSON output. Cluster-internal Distribution type also computes p50, min, max for fitness evaluation. |
| Throughput | Output tokens per second, requests per second |
| Preemption count | Total KV cache evictions |
| KV allocation failures | Failed block allocations |
| Dropped unservable | Requests too large for cache |
Conservation Invariant (INV-1)¶
At simulation end: injected_requests == completed_requests + still_queued + still_running + dropped_unservable
This is the fundamental accounting invariant that ensures no requests are silently lost.
Determinism¶
BLIS guarantees deterministic output: the same seed produces byte-identical stdout across runs (INV-6).
Key mechanisms: - Partitioned RNG: Each subsystem (workload generation, scheduling, etc.) uses an independent random stream derived from the seed, so adding randomness to one subsystem doesn't perturb others - Sorted map iteration: Where map iteration order affects output, keys are sorted first - Stdout/stderr separation: Deterministic results go to stdout; wall-clock timing and diagnostics go to stderr via logrus