Supported Models¶

BLIS supports any dense transformer model with a HuggingFace config.json. The default roofline mode auto-fetches configs on first use — no setup required. MoE (Mixture-of-Experts) models are also supported; Mixtral 8x7B has been validated end-to-end.

Blackbox Coefficient Catalog¶

The models below have pre-trained alpha/beta coefficients in defaults.yaml for blackbox mode (--latency-model blackbox). Blackbox mode offers slightly higher accuracy for these specific model/GPU/TP combinations due to per-model fitting.

Dense Models¶

Model	Sizes
Meta LLaMA 3.1	8B
Meta LLaMA 3.3	70B
IBM Granite 3.1	8B
CodeLlama	34B
Microsoft Phi-4	14B
Mistral Small (2501)	24B
Mistral Small 3.1 (2503)	24B
NVIDIA LLaMA 3.1 Nemotron	70B
OpenAI GPT-OSS	20B, 120B
Qwen 2.5	7B

MoE Models¶

Model	Architecture
LLaMA 4 Maverick (FP8)	17B, 128 experts
LLaMA 4 Scout	17B, 16 experts
Mixtral	8x7B

Quantized Variants¶

Red Hat AI (redhatai/) provides FP8, W4A16, and W8A8 quantized variants for many of the above models, including LLaMA 3.1/3.3/4, Mistral Small 3.1, Phi-4, Qwen 2.5, and SmolLM3 3B (FP8 only). See defaults.yaml for the full list.

Validated Architectures¶

The analytical modes (roofline, cross-model, trained-roofline) have been validated against real vLLM measurements on these architectures:

Qwen 2.5 1.5B/3B, Qwen 3 14B
LLaMA 2 7B/70B
CodeLlama 34B
Mixtral 8x7B (MoE)

Trained-roofline achieves 7% MAPE GPU combined step time across these architectures. Any other model with a HuggingFace config.json will work — it just hasn't been formally validated.

Parallelism and quantization

The analytical latency models (roofline, cross-model, trained-roofline) model tensor parallelism (TP). Data parallelism (DP) and expert parallelism (EP) are not yet modeled. Quantized weight precision is auto-detected and used for weight bandwidth and KV capacity calculations in all analytical backends. Supported formats: GPTQ, AWQ, FP8, and compressed-tensors (via quantization_config), plus model name conventions (e.g., w4a16, FP8).

MFU Calibration (Updated March 2026)

Hardware MFU (Model FLOPs Utilization) values in hardware_config.json were recalibrated based on empirical measurements and roofline theory. The updated values (H100: prefill=0.45/decode=0.30, A100: prefill=0.38/decode=0.18, L40S: prefill=0.32/decode=0.08) reflect conservative estimates for capacity planning. For detailed justification including evidence from FlashAttention-3, NVIDIA MLPerf, and production deployments, see Discussion #589. If you have existing capacity planning results, consider re-running simulations with the updated values for more accurate estimates.

Adding Blackbox Coefficients¶

To calibrate blackbox coefficients for a new model:

Run live vLLM profiling (see Coefficient Calibration)
Add the entry to defaults.yaml