Latency Models¶

The LatencyModel interface determines how BLIS estimates GPU step time for each batch iteration. BLIS ships two backends -- trained-physics (default, physics-informed roofline with MoE-aware corrections) and roofline (pure analytical) -- and the pluggable architecture supports adding custom backends.

Migration note: Three legacy backends have been removed (blackbox, crossmodel, trained-roofline). Use --latency-model trained-physics instead, which supersedes all three with improved accuracy and MoE support.

# Trained-physics mode (default) — roofline × architecture-aware basis functions × learned corrections
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 100 --num-requests 500

# Roofline mode — pure analytical estimation from model architecture (explicit flag)
./blis run --model qwen/qwen3-14b \
  --latency-model roofline --hardware H100 --tp 1 \
  --num-instances 4 --rate 100 --num-requests 500

Trained-Physics Mode (Default)¶

Trained-physics mode combines roofline basis functions with learned correction coefficients. It provides better out-of-box accuracy than pure roofline by capturing architecture-specific overheads (MoE routing, memory access patterns) that analytical models miss.

Benefits: - Better generalization across model architectures and TP configurations - Lower MAPE in practice compared to pure roofline - No per-model calibration needed

Use this for capacity planning and what-if analysis unless you specifically need pure analytical estimates.

Roofline Mode¶

Roofline mode computes step time analytically from model architecture (FLOPs, parameter count) and hardware specifications (compute throughput, memory bandwidth). It does not require pre-trained coefficients, making it suitable for new models.

The `--latency-model roofline` Flag¶

The simplest way to use roofline mode:

./blis run --model qwen/qwen3-14b \
  --latency-model roofline --hardware H100 --tp 1

This auto-resolves both required inputs:

Model config -- checks model_configs/ for a cached config.json, fetches from HuggingFace on miss
Hardware config -- uses the bundled hardware_config.json

Supported hardware: The bundled hardware_config.json includes specs for H100 (80 GB HBM3, 989.5 TFLOPS BF16, 3.35 TB/s), A100-SXM (80 GB HBM2e, 312 TFLOPS BF16, 2.04 TB/s), and A100-80 (alias for A100-SXM). To use a different GPU, add an entry to hardware_config.json with the required fields (TFlopsPeak, BwPeakTBs, mfuPrefill, mfuDecode, MemoryGiB) and reference it via --hardware <name>.

Validated models: Any dense or MoE transformer with a HuggingFace config.json works. The following have been validated end-to-end:

Set HF_TOKEN to access gated models (e.g., Llama-2) and avoid rate limits:

export HF_TOKEN=your_token_here
./blis run --model meta-llama/llama-3.1-8b-instruct \
  --latency-model roofline --hardware H100 --tp 1

Manual Configuration¶

For full control, provide configs explicitly:

./blis run --model my-custom-model \
  --model-config-folder ./my-model-configs/ \
  --hardware-config ./my-hardware-config.json \
  --hardware H100 --tp 4

Adding Support for New Models¶

Any model with a HuggingFace config.json can use roofline mode:

Download config.json from HuggingFace
Place it in model_configs/<model-name>/config.json
Run with --latency-model roofline --hardware <GPU> --tp <N>

Or let BLIS fetch it automatically with --latency-model roofline.

Tensor Parallelism and Roofline¶

The --tp flag divides FLOPs and memory traffic across TP ranks:

Higher TP reduces per-GPU step time (more parallelism)
Higher TP reduces KV blocks per GPU (memory split across ranks)

When choosing between TP and replication (more instances): TP reduces per-request latency, replication increases throughput. For capacity planning, simulate both configurations.

Automatic KV block calculation

For both latency backends (roofline, trained-physics), --total-kv-blocks is automatically derived from model architecture and GPU memory if not explicitly set. The auto-calculated value accounts for TP (KV heads are sharded across ranks; total GPU memory scales with GPU count). Override with --total-kv-blocks <N> for non-standard deployments. The auto-calculation uses reference constants (90% GPU utilization, standard activation/overhead budgets matching the llm-d-benchmark capacity planner) and requires SwiGLU-family activations.

Automatic MaxModelLen derivation

When using roofline or trained-physics mode and --max-model-len is not explicitly set, BLIS auto-derives it from max_position_embeddings in the HuggingFace config.json. For models with rope_scaling, the scaling factor is applied based on vLLM's blacklist approach: types linear, dynamic, yarn, default, and mrope apply the factor; types su, longrope, and llama3 are excluded (these encode the full context in max_position_embeddings). For yarn, original_max_position_embeddings is used as the base when present. gemma3 models skip rope_scaling entirely (max_position_embeddings is pre-scaled). The derived value is then capped at the KV-feasible maximum (total_kv_blocks * block_size) to prevent context windows from exceeding GPU memory capacity. Override with --max-model-len ` when needed.

How Trained-Physics Works¶

Trained-physics mode applies learned correction factors to analytical roofline basis functions, combining the physical grounding of roofline with the accuracy of data-driven fitting. Coefficients are fitted from real vLLM measurements and generalize across model architectures, workloads, and TP configurations.

StepTime formula (10 beta coefficients in bundled defaults):

StepTime = β₁ₐ × T_pf_compute                  # prefill compute only
         + β₁ᵦ × T_pf_kv                       # prefill memory (typically ~0)
         + β₂ₐ × T_dc_compute                  # decode compute (typically ~0)
         + β₂ᵦ × T_dc_kv                       # decode memory only
         + β₃ × T_weight                       # weight loading × correction
         + β₄ × T_tp                           # TP communication × correction
         + β₅ × L                              # per-layer overhead (µs/layer)
         + β₆ × batch_size                     # per-request scheduling (µs/req)
         + β₇                                  # per-step fixed overhead (µs)
         + β₈ × nMoE                           # per-MoE-layer overhead (µs/layer)

The model supports 7-10 beta coefficients. Bundled defaults use 10 coefficients with prefill/decode split.

Beta coefficients:

β₁ₐ (prefill compute, ~0.15): Corrects analytical FlashAttention + MLP FLOP estimates for kernel efficiency, memory access patterns.
β₁ᵦ (prefill memory, ~0): Prefill KV cache write bandwidth correction (typically near zero).
β₂ₐ (decode compute, ~0): Decode compute correction (typically near zero, decode is memory-bound).
β₂ᵦ (decode memory, ~1.9): Corrects KV cache read bandwidth. Primary decode bottleneck.
β₃ (weight loading, ~1.4): Corrects model weight bandwidth for cache effects, prefetching, HBM contention.
β₄ (TP communication, ~0.75): Corrects tensor-parallel All-Reduce overhead.
β₅ (per-layer, ~32 µs/layer): Fixed overhead per transformer layer: kernel launch, CUDA graph, residual connections.
β₆ (per-request, ~4 µs/request): Scheduling overhead per request: queue management, attention mask construction.
β₇ (per-step, ~126 µs/step): Fixed overhead per step: CUDA synchronization, sampler invocation.
β₈ (MoE-layer, ~482 µs/layer): Per-MoE-layer overhead for router gating, token permutation. Architecture-aware: applies only to interleaved MoE architectures (InterleaveMoELayerStep > 0). Zero for uniform MoE and dense models.

Alpha coefficients (3 terms, API/framework overheads in µs):

α₀ (QueueingTime, ~15,563 µs): Fixed per-request API processing (HTTP parsing, request validation, queue insertion).
α₁ (PostDecodeFixedOverhead, ~777 µs): Fixed per-request post-decode overhead (detokenization setup, finish reason determination).
α₂ (OutputTokenProcessingTime, ~46 µs/token): Per-output-token overhead (streaming token transmission, incremental detokenization).

Pre-trained coefficients are stored in trained_physics_coefficients in defaults.yaml. No per-model calibration needed -- the model generalizes across architectures, workloads, and TP configurations.

Generalization Scope¶

The trained-physics model is designed to generalize without per-model calibration:

Supported hardware:

H100 (80 GB HBM3, 989.5 TFLOPS BF16 / 1979 TFLOPS FP8, 3.35 TB/s)
A100-SXM (80 GB HBM2e, 312 TFLOPS BF16, 2.04 TB/s)
A100-80 (alias for A100-SXM)
L40S (48 GB GDDR6, 362 TFLOPS BF16 / 1466 TFLOPS FP8, 0.864 TB/s)

Coefficients were trained on H100 traces but the roofline basis functions automatically scale to each GPU's compute/bandwidth specifications via hardware config. This enables the model to generalize across hardware without GPU-specific calibration.

Model architectures:

Dense transformers (Llama-2, Qwen3, GPT, etc.): Standard attention + MLP layers
Uniform MoE (Mixtral): All layers are MoE with top-k expert routing
Interleaved MoE (Scout): Alternating MoE and dense layers with architecture-specific β₈ overhead

The model automatically detects MoE configuration from config.json (num_local_experts, num_experts_per_tok, interleave_moe_layer_step) and adjusts basis functions accordingly.

Workload types:

Prefill-heavy (large input, short output): Chatbot prompts, document Q&A
Decode-heavy (small input, long output): Content generation, code completion
Mixed batches (concurrent prefill/decode): Production serving with heterogeneous requests
TP configurations: TP=1, TP=2, TP=4, TP=8 (All-Reduce overhead scales via β₄)

Why trained-physics over roofline:

Trained-physics uses 13 coefficients (10 beta: prefill compute/memory split, decode compute/memory split, weight, TP, layer overhead, batch overhead, step overhead, MoE overhead; 3 alpha: queueing, post-decode, per-token) that capture more architectural detail than pure roofline (no learned corrections). The prefill/decode split (β₁ₐ/β₁ᵦ, β₂ₐ/β₂ᵦ) and MoE-specific overhead (β₈) enable better generalization to unseen model architectures (especially interleaved MoE) and batch compositions (mixed prefill/decode).

MoE architecture detection

β₈ applies conditionally based on InterleaveMoELayerStep from the model's config.json: 0 = uniform MoE (β₈ skipped), 1 = alternating MoE/dense (β₈ × 24 layers for Scout's 48 total), 2 = every 3rd layer is MoE, etc. This prevents over-penalizing uniform MoE models like Mixtral where expert routing overhead is amortized across all layers.

When to Use Which¶

Aspect	Roofline	Trained-Physics (default)
When to use	Quick analytical estimate	Default (generalizes across architectures, workloads, TP)
Data required	HF `config.json` + `--hardware` + `--tp`	HF `config.json` + `--hardware` + `--tp` (global coefficients bundled)
GPU step time accuracy	Good (analytical)	Better (13 global params, physics-informed basis functions)
MoE support	Yes (per-expert FLOPs + effective expert count)	Yes (per-expert FLOPs + effective expert count + β₈ per-MoE-layer overhead)
Alpha model	α₀ + α₁·inputLen (constant + per-token queueing)	α₀ (constant), α₁ (post-decode fixed), α₂ (per-token)
PostDecodeFixedOverhead	0	α₁ (~777µs)

Choosing the right mode

Trained-physics is the default for any model with a HuggingFace config.json (generalizes across architectures, workloads, and TP configurations without per-model calibration). Roofline for pure analytical estimates when no learned corrections are desired.

Current limitations

All analytical latency models support tensor parallelism (TP). Data parallelism (DP) and expert parallelism (EP) scheduling overhead are not yet modeled. Quantized weight precision (GPTQ, AWQ, FP8, compressed-tensors) is auto-detected from quantization_config, model name conventions (e.g., w4a16, FP8), or torch_dtype fallback, and is used for weight bandwidth and KV capacity calculations. MFU calibration values are still derived from FP16/BF16 measurements.

Pluggable Architecture¶

The LatencyModel interface (defined in sim/latency_model.go) has four methods:

Method	Purpose
`StepTime(batch)`	Duration of one batch step given the running batch
`QueueingTime(req)`	Arrival-to-queue delay for a request
`OutputTokenProcessingTime()`	Per-token post-processing time
`PostDecodeFixedOverhead()`	Fixed per-request overhead at completion (0 for roofline, non-zero for trained-physics)

All time estimates are in microseconds (ticks).

New backends register via the NewLatencyModelFunc variable in sim/latency_model.go. The sim/latency/register.go file uses init() to wire the factory, breaking the import cycle between sim/ (interface owner) and sim/latency/ (implementation). To add a custom backend, implement the four methods and register your factory via init() in a sub-package. See Extension Recipes for a step-by-step guide.