The Physics of High-Fidelity Distributed Inference Platform Simulation¶
Production LLM inference platforms are distributed systems where routing policies, admission control, autoscaling, and engine-level scheduling all interact to determine latencies and throughput. How do you explore how different policies and configurations affect these KPIs before deploying to production? Testing a new routing policy or autoscaling threshold on live traffic risks cascading bugs across the fleet, while building separate test environments burns GPU-hours and still cannot predict interactions between cluster-level policies and engine-level batch dynamics.
The answer is end-to-end simulation: model the entire distributed inference stack to explore how policies and configurations affect latencies and throughput for your workloads. What does it take to build a simulator accurate enough to guide these decisions? The challenge lies in capturing the right mechanisms. At the engine level, batches process together — all requests wait for the slowest operation to finish, so KV cache fills trigger preemptions and long prompts stall short decodes. At the cluster level, routing policies operate on stale cache state, admission control gates overload, and prefill/decode disaggregation trades utilization for latency. At the control plane, autoscalers react to lagged metrics, creating oscillations. When these couplings are not modeled, predictions diverge: a back-of-the-envelope model might predict 50ms time-to-first-token while production measures 200ms.
Building Fidelity from First Principles¶
BLIS (Blackbox Inference Simulator) models inference serving through discrete-event simulation, advancing from event to event rather than stepping through continuous time. This approach runs orders of magnitude faster than real-time, requires no GPUs, and evaluates hours of production traffic in seconds.
BLIS uses discrete-event simulation to model the full stack. This full-stack fidelity enables capacity planning (instance count, GPU type, TP degree) and configuration search (routing weights, admission thresholds). Without modeling distributed system couplings, planners predict linear scaling where production saturates, miss SLO violations from routing pile-on, or deploy autoscalers that oscillate.
By modeling the behavior of production systems at the server (vLLM) and platform (llm-d) layers, BLIS enables safe experimentation before deployment:
- Routing policies — Test new scorer combinations and weights
- Admission control — Explore saturation thresholds and flow control strategies
- Capacity planning — Compare model/GPU/TP configurations
- Workload analysis — Test how switching from TP=2 to TP=4 affects tail latency under production traffic patterns
This article walks through what it takes to build that level of fidelity — from token batching physics to distributed orchestration, by following a request's end-to-end journey through the system to see where every millisecond of complexity originates.
A Request's Journey: The Hidden Complexity¶
A user hits enter, and 50ms later the first token appears. What happened in between? Three architectural layers working together: the inference engine (vLLM), the data plane (cluster orchestration), and the control plane (autoscaling), all of which high-fidelity simulation must model.
flowchart TB
Request([Request])
subgraph Layer2["Layer 2: Data Plane"]
Admit[Admission]
Route[Routing]
PD{P/D Split?}
Admit --> Route
Route --> PD
end
subgraph Layer1["Layer 1: Engine"]
Sched[Scheduling]
KV[KV Cache]
Batch[Batch Formation]
Step[Forward Pass]
Sched --> KV --> Batch --> Step
end
subgraph Disagg["Prefill/Decode Disaggregation"]
PF[Prefill Pool]
Dec[Decode Pool]
PF -->|KV Transfer| Dec
end
Response([Response])
subgraph Layer3["Layer 3: Control Plane"]
Monitor[Monitor Metrics]
Scale[Autoscaling Decisions]
Monitor --> Scale
end
Request --> Admit
PD -->|Aggregate| Sched
PD -->|Disaggregated| PF
Step --> Response
Dec --> Response
Scale -.->|add/remove instances| Admit
Layer 1: The Engine (vLLM)¶
TL;DR: Batched execution couples requests together - a heavy prompt in the batch slows down fast decodes running alongside it. BLIS models the full vLLM pipeline (continuous batching, request scheduling and preemption, KV cache pressure, chunked prefill) and predicts forward pass timing using a generalizable model that runs on CPUs without needing real GPUs.
The inference engine does not process requests individually. It processes them in continuously evolving batches. A step is one GPU forward pass that advances every request in the batch, either processing prompt tokens (prefill) or generating the next output token (decode). The slowest operation determines when the step completes.
Why does this matter? Consider a batch with three requests decoding single tokens (fast, memory-bound) and one request processing a 512-token prompt (slow, compute-bound). Everyone waits for the slowest. This is not an edge case — batch composition constantly shifts as new requests arrive and completed ones leave.
What BLIS captures. vLLM's complexity: continuous batching (requests join and leave mid-flight), mixed prefill-decode execution, block-level KV cache management (prefix reuse, preemption, CPU offloading), and chunked prefill. BLIS models these mechanisms because they determine when requests complete.
How BLIS predicts step time without GPUs. BLIS combines physics-based latency models with learned corrections:
flowchart LR
subgraph Inputs["📊 <b>Inputs</b>"]
Batch["Batch State<br/>(size, sequences)"]
LLM["Model Architecture<br/>(layers, params)"]
HW["Hardware Specs<br/>(TFLOPs, bandwidth)"]
end
subgraph Physics["⚡ <b>Physics-Based Latency Model</b>"]
Compute["Compute Bound<br/>Compute Operations / GPU Speed"]
Memory["Memory Bound<br/>Bytes Transferred / GPU Bandwidth"]
end
subgraph Learned["🎯 <b>Learned Corrections</b>"]
Coeffs["Trained on<br/>Real vLLM Traces"]
end
Batch --> Physics
LLM --> Physics
HW --> Physics
Compute --> StepTime["⏱️ Predicted Step Time"]
Memory --> StepTime
Coeffs --> StepTime
This approach is intended to generalize across LLM architectures, hardware configurations, and tensor parallelism degrees, enabling seamless experimentation with any model-GPU-TP combination without per-configuration calibration. Accurate forward pass predictions drive accurate end-to-end latency metrics.
Layer 2: The Data Plane (Cluster Orchestration)¶
TL;DR: Production clusters run multiple vLLM instances behind a routing gateway. BLIS models saturation-based admission control, composable weighted routing with in-flight tracking, configurable cache signal staleness, and prefill/decode disaggregation. Pluggable interfaces enable algorithm discovery — test new serving policies without writing production code.
flowchart TB
A[Request Arrives]
B{Admission Control}
C[Routing]
D{Disaggregation?}
E[Aggregated vLLM Processing]
F[Prefill Pool]
G[Decode Pool]
X[Drop]
A --> B
B -- Rejected --> X
B -- Admitted --> C
C --> D
D -- No --> E
D -- Yes --> F
F -- KV Transfer --> G
Admission control determines whether requests enter the system. BLIS models saturation-based admit/reject decisions, the default behavior in llm-d: when cluster load exceeds thresholds, incoming requests are rejected or queued rather than overwhelming instances. This prevents queue explosion during traffic spikes and avoids pile-on where burst arrivals flood the same "best" instance.
Routing assigns each request to an instance by scoring on weighted signals - prefix cache hits, queue depth, KV utilization. The challenge: burst arrivals cause all routing decisions to see the same stale state and pick the same "best" instance. BLIS models in-flight tracking (counting already-dispatched requests) and signal staleness (cache state queries a 2-second-old snapshot, matching llm-d's ZMQ propagation delay).
Prefill/decode disaggregation separates compute-bound prefill from memory-bound decode onto dedicated GPU pools, allowing each to be sized for its bottleneck. Requests process prefill first, then transfer their KV cache over the network to a decode instance. BLIS models the full pipeline: prefill routing, KV transfer, decode routing, and fair-share bandwidth contention when multiple transfers run concurrently.
Layer 3: The Control Plane (Autoscaling)¶
TL;DR: Real autoscaling experiments are expensive—feedback loops spanning minutes (HPA scrapes, pod scheduling, VM provisioning, model loading) require 30+ minutes and 10+ GPU replicas per test. BLIS models llm-d's WVA (Workload Variant Autoscaler) four-stage pipeline with pluggable Collector/Analyzer/Optimizer/Actuator interfaces, compressing experiments to seconds on a laptop.
Autoscaling dynamically adjusts instance count to match demand. In production, this happens through feedback loops where HPA scrapes, Kubernetes scheduling, VM provisioning, and model loading all add latency before a new replica serves traffic.
What BLIS captures. BLIS models llm-d's WVA (Workload Variant Autoscaler) four-stage pipeline — Collect, Analyze, Optimize, Actuate, with pluggable interfaces. Collector observes per-replica metrics, Analyzer detects saturation and emits scaling signals, Optimizer decides which GPU types to add/remove respecting multi-model inventory constraints, and Actuator applies decisions with configurable delay.
What BLIS enables. Researchers can sweep scaling thresholds, compare analyzer strategies under identical workloads, and test multi-model scenarios where scaling one model steals GPUs from another. Each pluggable interface becomes a research hook: swap in a cost-aware Optimizer or an Analyzer that predicts load spikes. The result: discover and validate better autoscaling policies before deployment, with full control over feedback delays and provisioning latencies.
BLIS in Action: Simulating a Configuration Decision¶
Consider a configuration decision: you are deploying Qwen3-14B for chatbot workloads at 50 req/s with 8 instances. Does routing policy matter? What about hardware choice?
Testing this in production means provisioning separate GPU pools, running 30+ minutes of traffic per setup and burning GPU-hours to discover the answer. With BLIS, you can simulate these configurations in seconds on a laptop:
# Install and build BLIS
git clone https://github.com/inference-sim/inference-sim.git
go build -o blis main.go
# H100 with round-robin routing
./blis run --model qwen/qwen3-14b --workload chatbot --rate 50 \
--num-instances 8 --tp 2 --hardware H100 --routing-policy round-robin
# H100 with prefix-aware routing
./blis run --model qwen/qwen3-14b --workload chatbot --rate 50 \
--num-instances 8 --tp 2 --hardware H100 --routing-policy weighted \
--routing-scorers "prefix-affinity:2,queue-depth:1"
# A100-80 with prefix-aware routing
./blis run --model qwen/qwen3-14b --workload chatbot --rate 50 \
--num-instances 8 --tp 2 --hardware A100-80 --routing-policy weighted \
--routing-scorers "prefix-affinity:2,queue-depth:1"
Simulated Results:
| Configuration | Predicted P99 TTFT | Key Finding |
|---|---|---|
| H100 (round-robin) | 12.1ms | Baseline with naive routing |
| H100 (prefix-aware) | 11.3ms | 7% improvement from KV cache reuse |
| A100-80 (prefix-aware) | 45.8ms | 4× slower than H100 — hardware choice dominates |
What the simulation predicts: Prefix-aware routing delivers measurable gains on H100, but hardware choice has far greater impact. These simulated predictions guide configuration decisions without provisioning real GPUs.
From Modeling to Validation¶
We have covered what it takes to build a high-fidelity distributed platform simulator: modeling engine physics, data plane coordination, and control plane feedback loops. But how do we know this modeling is accurate?
We have validated BLIS against production workloads and compared its accuracy to commercial simulators. The methodology and results — cross-system benchmarks and achievable accuracy without per-configuration tuning — are covered in a subsequent article.
Next in this series: Validating Against Ground Truth — Quantifying BLIS accuracy on real workloads, how validation catches regressions, and the methodology behind single-digit percent error.