BLIS — Blackbox Inference Simulator¶
A discrete-event simulator for LLM inference serving systems. BLIS models multi-instance clusters with configurable admission control, request routing, KV-cache dynamics (including tiered GPU+CPU offloading), scheduling policies, and token generation — all driven by pluggable latency models (data-driven coefficients, analytical roofline, or custom backends).
The simulator is CPU-only, deterministic, and designed for capacity planning, policy optimization research, and performance prediction across model/GPU/TP configurations without requiring real GPUs.
Quick Start¶
git clone https://github.com/inference-sim/inference-sim.git
cd inference-sim
go build -o blis main.go
./blis run --model qwen/qwen3-14b
Key Features¶
- Discrete-event simulation for prefill, decode, and request scheduling
- Deterministic execution — same seed produces byte-identical output across runs
- KV-cache modeling with prefix caching and tiered GPU+CPU offload
- Chunked prefill and preemption-aware batch formation
- Pluggable latency models — roofline (default, analytical FLOPs/bandwidth), blackbox (data-driven coefficients), cross-model (physics-informed, MoE-aware), and trained-roofline (roofline × learned corrections), with an extensible interface for custom backends
- Multi-instance cluster simulation with shared-clock event loop
- Pluggable routing policies — round-robin, least-loaded, and composable weighted-scoring with prefix-affinity, queue-depth, and KV-utilization scorers
- Admission control, priority policies, and instance schedulers — each a pluggable policy axis
- Canonical workload specification — multi-client YAML DSL with Poisson/Gamma/Weibull/constant arrival processes, 5 distribution types, SLO classes (critical/standard/sheddable/batch/background), prefix groups, cohort dynamics, multimodal, reasoning multi-turn, and composable specs via
blis compose - Rich metrics pipeline — per-request, per-instance, and cluster-level metrics including TTFT/ITL/E2E distributions, KV cache diagnostics, anomaly detection (priority inversions, HOL blocking), SLO attainment, Jain fairness index, and multi-objective fitness evaluation
- Decision tracing and counterfactual analysis with top-k regret computation
- Hypothesis experimentation framework for rigorous, reproducible experiments
Architecture Overview¶
Request Arrival → Admission → Routing → WaitQueue → Batch Formation → Step Execution → Completion
↓ ↓
KV Allocation Latency Estimation
Admission and Routing apply in cluster mode (multi-instance). Single-instance mode skips directly to WaitQueue.
Documentation Guide¶
| Section | What You'll Find |
|---|---|
| Getting Started | What is BLIS, installation, quick start, capacity planning tutorial |
| Concepts | System architecture, core engine, glossary, roofline estimation |
| User Guide | Task-oriented guides: routing, admission, scheduling, latency models, KV cache, workloads, cluster, metrics, experimentation |
| Reference | Configuration reference, supported models, workload spec schema |
| Contributing | Extension recipes, PR workflow, standards, templates |
Reading Order for Newcomers¶
- What is BLIS? — understand the problem BLIS solves
- Quick Start — run your first simulation
- Tutorial: Capacity Planning — end-to-end walkthrough
- Glossary — learn BLIS-specific terminology
- User Guide — task-oriented how-to guides
License¶
This project is licensed under the Apache License, Version 2.0. See LICENSE for details.