Supported Models¶
BLIS supports any dense transformer model with a HuggingFace config.json. The default roofline mode auto-fetches configs on first use — no setup required. MoE (Mixture-of-Experts) models are also supported; Mixtral 8x7B has been validated end-to-end.
Blackbox Coefficient Catalog¶
The models below have pre-trained alpha/beta coefficients in defaults.yaml for blackbox mode (--latency-model blackbox). Blackbox mode offers slightly higher accuracy for these specific model/GPU/TP combinations due to per-model fitting.
Dense Models¶
| Model | Sizes |
|---|---|
| Meta LLaMA 3.1 | 8B |
| Meta LLaMA 3.3 | 70B |
| IBM Granite 3.1 | 8B |
| CodeLlama | 34B |
| Microsoft Phi-4 | 14B |
| Mistral Small (2501) | 24B |
| Mistral Small 3.1 (2503) | 24B |
| NVIDIA LLaMA 3.1 Nemotron | 70B |
| OpenAI GPT-OSS | 20B, 120B |
| Qwen 2.5 | 7B |
MoE Models¶
| Model | Architecture |
|---|---|
| LLaMA 4 Maverick (FP8) | 17B, 128 experts |
| LLaMA 4 Scout | 17B, 16 experts |
| Mixtral | 8x7B |
Quantized Variants¶
Red Hat AI (redhatai/) provides FP8, W4A16, and W8A8 quantized variants for many of the above models, including LLaMA 3.1/3.3/4, Mistral Small 3.1, Phi-4, Qwen 2.5, and SmolLM3 3B (FP8 only). See defaults.yaml for the full list.
Validated Architectures¶
The analytical modes (roofline, cross-model, trained-roofline) have been validated against real vLLM measurements on these architectures:
- Qwen 2.5 1.5B/3B, Qwen 3 14B
- LLaMA 2 7B/70B
- CodeLlama 34B
- Mixtral 8x7B (MoE)
Trained-roofline achieves 7% MAPE GPU combined step time across these architectures. Any other model with a HuggingFace config.json will work — it just hasn't been formally validated.
Parallelism and quantization
The analytical latency models (roofline, cross-model, trained-roofline) model tensor parallelism (TP). Data parallelism (DP) and expert parallelism (EP) are not yet modeled. Quantized weight precision is auto-detected and used for weight bandwidth and KV capacity calculations in all analytical backends. Supported formats: GPTQ, AWQ, FP8, and compressed-tensors (via quantization_config), plus model name conventions (e.g., w4a16, FP8).
MFU Calibration (Updated March 2026)
Hardware MFU (Model FLOPs Utilization) values in hardware_config.json were recalibrated based on empirical measurements and roofline theory. The updated values (H100: prefill=0.45/decode=0.30, A100: prefill=0.38/decode=0.18, L40S: prefill=0.32/decode=0.08) reflect conservative estimates for capacity planning. For detailed justification including evidence from FlashAttention-3, NVIDIA MLPerf, and production deployments, see Discussion #589. If you have existing capacity planning results, consider re-running simulations with the updated values for more accurate estimates.
Adding Blackbox Coefficients¶
To calibrate blackbox coefficients for a new model:
- Run live vLLM profiling (see Coefficient Calibration)
- Add the entry to
defaults.yaml