Latency Models¶
The LatencyModel interface determines how BLIS estimates GPU step time for each batch iteration. BLIS ships four backends -- roofline (default, analytical), blackbox (data-driven), cross-model (physics-informed), and trained-roofline (roofline × learned corrections) -- and the pluggable architecture supports adding custom backends.
# Roofline mode (default) — analytical estimation from model architecture
./blis run --model qwen/qwen3-14b \
--num-instances 4 --rate 100 --num-requests 500
# Blackbox mode — uses pre-trained per-model coefficients
./blis run --model qwen/qwen3-14b \
--latency-model blackbox \
--num-instances 4 --rate 100 --num-requests 500
# Cross-model mode — physics-informed with hand-engineered features
./blis run --model qwen/qwen3-14b \
--latency-model crossmodel --hardware H100 --tp 1 \
--num-instances 4 --rate 100 --num-requests 500
# Trained-roofline mode — roofline basis functions × learned corrections (7% MAPE)
./blis run --model qwen/qwen3-14b \
--latency-model trained-roofline --hardware H100 --tp 1 \
--num-instances 4 --rate 100 --num-requests 500
Blackbox Mode¶
Blackbox mode uses trained regression coefficients from defaults.yaml, fit offline via Bayesian optimization against real vLLM measurements.
Beta coefficients [beta0, beta1, beta2] estimate GPU step time:
beta0-- fixed per-step overhead (microseconds)beta1-- cost per prefill token (cache miss)beta2-- cost per decode token
Alpha coefficients [alpha0, alpha1, alpha2] estimate CPU-side overhead:
All alpha and beta coefficients must be non-negative. Negative values are rejected at construction time (INV-5: causality). Pre-trained coefficient sets exist in defaults.yaml for common model/GPU/TP combinations (e.g., qwen/qwen3-14b on H100 with TP=1).
Alpha overhead is non-blocking
Alpha coefficients model CPU post-processing (tokenization, output serialization) that runs concurrently with GPU execution. Alpha time inflates TTFT and ITL metrics but does not block step scheduling -- the next batch step is scheduled at now + stepTime regardless of alpha overhead. This matches real vLLM's asynchronous post-processing pipeline.
Roofline Mode (Default)¶
Roofline mode computes step time analytically from model architecture (FLOPs, parameter count) and hardware specifications (compute throughput, memory bandwidth). It does not require pre-trained coefficients, making it suitable for new models.
The --latency-model roofline Flag¶
The simplest way to use roofline mode:
This auto-resolves both required inputs:
- Model config -- checks
model_configs/for a cachedconfig.json, fetches from HuggingFace on miss - Hardware config -- uses the bundled
hardware_config.json
Supported hardware: The bundled hardware_config.json includes specs for H100 (80 GB HBM3, 989.5 TFLOPS BF16, 3.35 TB/s), A100-SXM (80 GB HBM2e, 312 TFLOPS BF16, 2.04 TB/s), and A100-80 (alias for A100-SXM). To use a different GPU, add an entry to hardware_config.json with the required fields (TFlopsPeak, BwPeakTBs, mfuPrefill, mfuDecode, MemoryGiB) and reference it via --hardware <name>.
Validated models: Any dense or MoE transformer with a HuggingFace config.json works. The following have been validated end-to-end:
Set HF_TOKEN to access gated models (e.g., Llama-2) and avoid rate limits:
export HF_TOKEN=your_token_here
./blis run --model meta-llama/llama-3.1-8b-instruct \
--latency-model roofline --hardware H100 --tp 1
Manual Configuration¶
For full control, provide configs explicitly:
./blis run --model my-custom-model \
--model-config-folder ./my-model-configs/ \
--hardware-config ./my-hardware-config.json \
--hardware H100 --tp 4
Adding Support for New Models¶
Any model with a HuggingFace config.json can use roofline mode:
- Download
config.jsonfrom HuggingFace - Place it in
model_configs/<model-name>/config.json - Run with
--latency-model roofline --hardware <GPU> --tp <N>
Or let BLIS fetch it automatically with --latency-model roofline.
Tensor Parallelism and Roofline¶
The --tp flag divides FLOPs and memory traffic across TP ranks:
- Higher TP reduces per-GPU step time (more parallelism)
- Higher TP reduces KV blocks per GPU (memory split across ranks)
When choosing between TP and replication (more instances): TP reduces per-request latency, replication increases throughput. For capacity planning, simulate both configurations.
Automatic KV block calculation
When using roofline or crossmodel mode, --total-kv-blocks is automatically derived from model architecture and GPU memory if not explicitly set. The auto-calculated value accounts for TP (KV heads are sharded across ranks; total GPU memory scales with GPU count). Override with --total-kv-blocks <N> for non-standard deployments. The auto-calculation uses reference constants (90% GPU utilization, standard activation/overhead budgets matching the llm-d-benchmark capacity planner) and requires SwiGLU-family activations.
Automatic MaxModelLen derivation
When using roofline or crossmodel mode and --max-model-len is not explicitly set, BLIS auto-derives it from max_position_embeddings in the HuggingFace config.json. For models with rope_scaling, the scaling factor is applied based on vLLM's blacklist approach: types linear, dynamic, yarn, default, and mrope apply the factor; types su, longrope, and llama3 are excluded (these encode the full context in max_position_embeddings). For yarn, original_max_position_embeddings is used as the base when present. gemma3 models skip rope_scaling entirely (max_position_embeddings is pre-scaled). The derived value is then capped at the KV-feasible maximum (total_kv_blocks * block_size) to prevent context windows from exceeding GPU memory capacity. Override with --max-model-len <N> when needed.
Cross-Model Mode (Physics-Informed)¶
Cross-model mode estimates step time using 7 globally-fitted coefficients (4 beta for step time + 3 alpha for CPU overhead) that work across model architectures. Unlike blackbox (per-model coefficients) or roofline (no MoE awareness), cross-model uses architecture features from config.json to scale a single coefficient set.
StepTime formula:
stepTime = β₀ × numLayers # Per-layer CUDA kernel dispatch
+ β₁ × dc × kvDimScaled # KV cache bandwidth (decode only)
+ β₂ × (pf+dc) × isMoE # MoE expert routing (Mixtral, etc.)
+ β₃ × isTP # TP synchronization barrier
Where kvDimScaled = numLayers × numKVHeads × headDim / TP × 1e-6, isMoE = 1.0 if the model has expert routing, and isTP = 1.0 if TP > 1.
Pre-trained coefficients from real vLLM measurements across 4 architectures (7B-70B dense + 8x7B MoE) are stored in crossmodel_defaults in defaults.yaml. No per-model calibration needed.
MoE support: Cross-model correctly handles Mixture-of-Experts models. The β₂ term captures the per-token routing and expert dispatch overhead, activated when num_local_experts > 0 in the model's HuggingFace config.json. The MoE indicator is binary (MoE vs dense); the specific active expert count (num_experts_per_tok) is parsed for future refinement but not yet used in the formula.
Dense model prefill limitation
For dense models (non-MoE), step time does not scale with prefill token count — prefill compute cost is absorbed into the per-layer overhead (β₀). A batch prefilling 1 token costs the same as 2048 tokens. This is a known approximation from the training methodology (prefill KV writes overlap with compute on H100). For prefill-heavy dense-model workloads, blackbox mode with trained coefficients provides more accurate estimates because its β₁ term explicitly models per-prefill-token cost.
Automatic KV block calculation
Like roofline mode, crossmodel auto-derives --total-kv-blocks from model architecture and GPU memory when the flag is not set. Override with --total-kv-blocks <N> for non-standard deployments. The auto-calculation uses reference constants (90% GPU utilization, standard activation/overhead budgets matching the llm-d-benchmark capacity planner) and requires SwiGLU-family activations (silu, swiglu, geglu).
Trained-Roofline Mode (Recommended for New Models)¶
Trained-roofline mode applies learned correction factors to analytical roofline basis functions, combining the physical grounding of roofline with the accuracy of data-driven fitting. Coefficients are fitted from 137K real vLLM requests across 4 architectures (Llama-2-7b, Llama-2-70b, Mixtral-8x7B, CodeLlama-34b) via non-negative least squares regression.
./blis run --model meta-llama/llama-3.1-8b-instruct \
--latency-model trained-roofline --hardware H100 --tp 2
Same auto-fetch chain as roofline and crossmodel (HuggingFace config + hardware config resolution).
StepTime formula (7 terms):
StepTime = β₁ × max(T_pf_compute, T_pf_kv) # prefill roofline bottleneck × correction
+ β₂ × max(T_dc_compute, T_dc_kv) # decode roofline bottleneck × correction
+ β₃ × T_weight # weight loading × correction
+ β₄ × T_tp # TP communication × correction
+ β₅ × L # per-layer overhead (µs/layer)
+ β₆ × batch_size # per-request scheduling (µs/req)
+ β₇ # per-step fixed overhead (µs)
Where each basis function (T_pf_compute, T_pf_kv, etc.) is a full analytical roofline calculation from model architecture + hardware specs. β₁-β₄ are dimensionless correction factors (near 1.0 = roofline is accurate). β₅-β₇ capture overhead not in the roofline model.
Key differences from roofline mode:
- No MFU scaling -- β₁ and β₂ ARE the MFU corrections. Applying
MfuPrefill/MfuDecodewould double-count. - 3-matrix SwiGLU -- uses
6 × d × d_fffor FFN FLOPs (gate + up + down) vs roofline's 2-matrix convention. - MoE-aware weight loading --
min(N, max(k, B×k))effective experts, not all N.
Alpha model (3 coefficients):
α₀= API processing overhead (constant, added to TTFT viaQueueingTime)α₁= Fixed per-request post-decode overhead (added to E2E viaPostDecodeFixedOverhead)α₂= Per-output-token detokenization cost (added to ITL viaOutputTokenProcessingTime)
Pre-trained coefficients (7% MAPE on GPU combined step time, test split) are stored in trained_roofline_defaults in defaults.yaml. No per-model calibration needed -- the roofline basis functions handle architecture-specific scaling.
TTFT accuracy caveat
The "7% MAPE" headline applies to GPU combined step time. The alpha model has higher error: α₀ (pre-queueing) has 93% MAPE because it's a single constant for a highly variable real-world quantity. TTFT predictions have higher error than GPU step time predictions. For TTFT-sensitive analysis, consider calibrating α₀ per-deployment.
Chunked prefill limitation
Trained-roofline was fitted on single-step prefill data. When used with --long-prefill-token-threshold > 0 (chunked prefill), the attention FLOPs formula uses len(InputTokens) (total prompt) as context for each chunk, overestimating early-chunk step times. For chunked-prefill workloads, pure roofline mode may be more accurate until coefficients are refit on chunked data.
When to Use Which¶
| Aspect | Roofline (default) | Blackbox | Cross-Model | Trained-Roofline |
|---|---|---|---|---|
| When to use | Quick analytical estimate | Model has per-model coefficients in defaults.yaml |
Hand-engineered physics features | Recommended for new models (best accuracy without per-model training) |
| Data required | HF config.json + --hardware + --tp |
defaults.yaml entry for model/GPU/TP |
HF config.json + --hardware + --tp |
HF config.json + --hardware + --tp (global coefficients bundled) |
| GPU step time accuracy | Good (analytical) | Highest (per-model) | Good (7 global params) | 7% MAPE (10 global params, roofline × corrections) |
| MoE support | No (dense only) | If trained | Yes (binary indicator) | Yes (per-expert FLOPs + effective expert count) |
| Alpha model | Same as blackbox | α₀ + α₁·inputLen | Same as blackbox | α₀ (constant), α₁ (post-decode fixed), α₂ (per-token) |
| PostDecodeFixedOverhead | 0 | 0 | 0 | α₁ (~1.85ms) |
Choosing the right mode
Trained-roofline is the recommended default for any model with a HuggingFace config.json (7% MAPE GPU combined, MoE-aware, no per-model calibration needed). Blackbox for models with per-model coefficients in defaults.yaml (slightly higher accuracy due to per-model fitting). Cross-model for backward compatibility with existing crossmodel workflows. Roofline for pure analytical estimates when no learned corrections are desired.
Current limitations
All analytical latency models support tensor parallelism (TP). Data parallelism (DP) and expert parallelism (EP) scheduling overhead are not yet modeled. Quantized weight precision (GPTQ, AWQ, FP8, compressed-tensors) is auto-detected from quantization_config, model name conventions (e.g., w4a16, FP8), or torch_dtype fallback, and is used for weight bandwidth and KV capacity calculations. MFU calibration values are still derived from FP16/BF16 measurements.
Pluggable Architecture¶
The LatencyModel interface (defined in sim/latency_model.go) has four methods:
| Method | Purpose |
|---|---|
StepTime(batch) |
Duration of one batch step given the running batch |
QueueingTime(req) |
Arrival-to-queue delay for a request |
OutputTokenProcessingTime() |
Per-token post-processing time |
PostDecodeFixedOverhead() |
Fixed per-request overhead at completion (0 for blackbox/roofline/crossmodel) |
All time estimates are in microseconds (ticks).
New backends register via the NewLatencyModelFunc variable in sim/latency_model.go. The sim/latency/register.go file uses init() to wire the factory, breaking the import cycle between sim/ (interface owner) and sim/latency/ (implementation). To add a custom backend, implement the four methods and register your factory via init() in a sub-package. See Extension Recipes for a step-by-step guide.
Further Reading¶
- Roofline Estimation -- the mathematical model behind roofline step time calculation
- Configuration Reference -- all roofline-related CLI flags