BLIS — Blackbox Inference Simulator¶

A discrete-event simulator for LLM inference serving systems. BLIS models multi-instance clusters with configurable admission control, request routing, KV-cache dynamics (including tiered GPU+CPU offloading), scheduling policies, and token generation — all driven by pluggable latency models (data-driven coefficients, analytical roofline, or custom backends).

The simulator is CPU-only, deterministic, and designed for capacity planning, policy optimization research, and performance prediction across model/GPU/TP configurations without requiring real GPUs.

Quick Start¶

git clone https://github.com/inference-sim/inference-sim.git
cd inference-sim
go build -o blis main.go
./blis run --model qwen/qwen3-14b

Key Features¶

Discrete-event simulation for prefill, decode, and request scheduling
Deterministic execution — same seed produces byte-identical output across runs
KV-cache modeling with prefix caching and tiered GPU+CPU offload
Chunked prefill and preemption-aware batch formation
Pluggable latency models — roofline (default, analytical FLOPs/bandwidth), blackbox (data-driven coefficients), cross-model (physics-informed, MoE-aware), and trained-roofline (roofline × learned corrections), with an extensible interface for custom backends
Multi-instance cluster simulation with shared-clock event loop
Pluggable routing policies — round-robin, least-loaded, and composable weighted-scoring with prefix-affinity, queue-depth, and KV-utilization scorers
Admission control, priority policies, and instance schedulers — each a pluggable policy axis
Canonical workload specification — multi-client YAML DSL with Poisson/Gamma/Weibull/constant arrival processes, 5 distribution types, SLO classes (critical/standard/sheddable/batch/background), prefix groups, cohort dynamics, multimodal, reasoning multi-turn, and composable specs via blis compose
Rich metrics pipeline — per-request, per-instance, and cluster-level metrics including TTFT/ITL/E2E distributions, KV cache diagnostics, anomaly detection (priority inversions, HOL blocking), SLO attainment, Jain fairness index, and multi-objective fitness evaluation
Decision tracing and counterfactual analysis with top-k regret computation
Hypothesis experimentation framework for rigorous, reproducible experiments

Architecture Overview¶

Request Arrival → Admission → Routing → WaitQueue → Batch Formation → Step Execution → Completion
                                            ↓              ↓
                                      KV Allocation   Latency Estimation

Admission and Routing apply in cluster mode (multi-instance). Single-instance mode skips directly to WaitQueue.

Documentation Guide¶

Section	What You'll Find
Getting Started	What is BLIS, installation, quick start, capacity planning tutorial
Concepts	System architecture, core engine, glossary, roofline estimation
User Guide	Task-oriented guides: routing, admission, scheduling, latency models, KV cache, workloads, cluster, metrics, experimentation
Reference	Configuration reference, supported models, workload spec schema
Contributing	Extension recipes, PR workflow, standards, templates

Reading Order for Newcomers¶

What is BLIS? — understand the problem BLIS solves
Quick Start — run your first simulation
Tutorial: Capacity Planning — end-to-end walkthrough
Glossary — learn BLIS-specific terminology
User Guide — task-oriented how-to guides

License¶

This project is licensed under the Apache License, Version 2.0. See LICENSE for details.