Configuration Reference¶
This page documents all CLI flags, configuration files, and their interactions. For architectural context on what these settings control, see Cluster Architecture and Core Engine.
Configuration Precedence¶
BLIS uses a layered configuration system where more specific sources override more general ones:
CLI flags (highest priority — explicit user input)
↓ overrides
YAML files (policy-config, workload-spec, defaults.yaml)
↓ overrides
Hardcoded defaults (lowest priority)
CLI flags only override YAML values when explicitly set. BLIS checks whether each flag was provided by the user (not just whether it has a non-default value), so default flag values do not accidentally override YAML configuration.
Parameter Resolution by Category¶
The general precedence (CLI → YAML → hardcoded) applies everywhere, but each parameter category has its own resolution layers. The chains below show highest-to-lowest priority for each category.
Latency coefficients (--alpha-coeffs, --beta-coeffs):
- Explicit CLI flags — if passed, used directly (no
defaults.yamllookup) defaults.yamlmodels[]entry — matched by model/GPU/TP/vllm-version (blackbox mode)- Analytical computation — roofline, crossmodel, or trained-roofline backends compute from architecture + hardware specs
- Error — blackbox mode with no matching entry exits with an error
Hardware and TP (--hardware, --tp, --vllm-version):
- Explicit CLI flags
defaults.yamldefaults[]entry — matched by--model- Error — roofline/crossmodel/trained-roofline require these values; blackbox fails at coefficient lookup
KV cache blocks (--total-kv-blocks): See the detailed Resolution Process below — four layers including auto-calculation from GPU memory.
Workload parameters (--rate, --num-requests, --prompt-tokens, etc.):
--workload-specYAML file — when set, all token distribution and arrival parameters come from the YAML; CLI distribution flags are ignored- CLI distribution flags — when
--workload distribution(default) and no--workload-spec - Named preset from
defaults.yaml— when--workload <name>(e.g.,chatbot) - Hardcoded CLI flag defaults — (e.g.,
--prompt-tokens 512,--output-tokens 512)
Note
--seed, --horizon, and --num-requests are exceptions — they override the workload-spec YAML values even when --workload-spec is set. --rate does NOT override aggregate_rate in the YAML (see Common Pitfalls).
Routing, admission, and scheduling (--routing-policy, --admission-policy, --scheduler, etc.):
- Explicit CLI flags
--policy-configYAML bundle — loads all policy settings from one file- Hardcoded defaults —
round-robin,always-admit,fcfs
Batch formation (--max-num-running-reqs, --max-num-scheduled-tokens, etc.):
- Explicit CLI flags
- Hardcoded defaults — 256 running reqs, 2048 scheduled tokens
Batch formation has no YAML override path — defaults.yaml and --policy-config do not include batch settings.
Known Unit Gotchas¶
All internal timestamps in the DES (arrival time, schedule time, completion time, clock) use ticks, where 1 tick = 1 microsecond (μs). Output metrics convert to milliseconds for human readability, but several fields and flags use different units:
| Field / Flag | Unit | Notes |
|---|---|---|
ttft_ms, e2e_ms, itl_ms (per-request JSON) |
milliseconds | Converted from ticks by dividing by 1,000 |
scheduling_delay_ms (per-request JSON) |
milliseconds | Converted from ticks by dividing by 1,000. Historically was in ticks (μs) despite the _ms suffix — fixed by BC-14. Old hypothesis scripts (pre-fix) divide by 1,000 again unnecessarily. |
scheduling_delay_p99_ms (aggregate) |
milliseconds | Always was in milliseconds |
--horizon |
ticks (μs) | Simulation time limit. 1,000,000 = 1 second |
--admission-latency, --routing-latency |
ticks (μs) | Decision latency injected into the DES event queue |
think_time_us (workload YAML) |
microseconds | Inter-round delay in multi-turn sessions. 5,000,000 = 5 seconds |
aggregate_rate, --rate |
requests/second | Not ticks — real-world time unit |
--kv-transfer-bandwidth |
blocks/tick | Transfer rate between GPU and CPU KV tiers |
--kv-transfer-base-latency |
ticks (μs) | Fixed per-transfer overhead |
Common Pitfalls¶
Capacity estimate mismatch (issue #390). CLI distribution mode defaults to --prompt-tokens 512, --output-tokens 512. If you estimate per-instance capacity using CLI mode, then run a workload-spec YAML with shorter sequences (e.g., mean 256/128), the YAML workload will achieve ~1.5x higher throughput than the CLI estimate predicted. Always derive capacity estimates from the actual workload you plan to run, not from CLI defaults.
--rate does NOT override workload-spec YAML. The --rate flag only applies in CLI distribution mode. When --workload-spec is set, request rate comes from aggregate_rate in the YAML file — the --rate flag is ignored. To change the rate for a YAML workload, edit the aggregate_rate field in the spec.
aggregate_rate override for inference-perf specs. When converting inference-perf specs via blis convert infperf, per-stage rates in the spec override a user-specified aggregate_rate. If the sum of stage rates differs from aggregate_rate, BLIS logs a warning and uses the stage-rate sum. This prevents silent rate scaling errors.
--total-kv-blocks phantom default. The CLI default is 1,000,000 blocks, but this value almost never takes effect. In roofline/crossmodel mode, auto-calculation from GPU memory supersedes it. In blackbox mode, defaults.yaml provides a per-model value (e.g., 17,600 for qwen3-14b/H100/TP=1). The 1M default is a last-resort fallback — if your simulation uses it, check whether your model/hardware combination has a defaults.yaml entry or whether auto-calculation is failing.
enable_multi_turn_chat semantic mismatch (issue #517). inference-perf's enable_multi_turn_chat creates one persistent session per virtual user. BLIS's closest equivalent is multi_turn.single_session: true in the workload YAML, but the session mechanics differ. When converting inference-perf specs, verify that the converted multi-turn behavior matches your intent.
Simulation Control¶
Top-level settings that control the simulation run.
| Flag | Type | Default | Description |
|---|---|---|---|
--seed |
int64 | 42 | Random seed for deterministic simulation. Same seed produces byte-identical stdout. |
--horizon |
int64 | MaxInt64 | Simulation time limit in ticks (microseconds). Simulation stops when clock exceeds horizon or all requests complete. |
--log |
string | "warn" | Log verbosity: trace, debug, info, warn, error, fatal, panic. Logs go to stderr. |
--metrics-path |
string | "" | File path to write MetricsOutput JSON (aggregate P50/P95/P99 TTFT, E2E, throughput stats). blis run only — blis replay uses --results-path instead. Empty = no file output. |
KV Cache Configuration¶
Controls GPU and CPU memory simulation for key-value cache blocks. Maps to KVCacheConfig.
| Flag | Type | Default | Description |
|---|---|---|---|
--total-kv-blocks |
int64 | 1000000* | Total GPU-tier KV blocks. |
--block-size-in-tokens |
int64 | 16 | Tokens per KV block. |
--kv-cpu-blocks |
int64 | 0 | CPU-tier blocks. 0 disables tiered caching. |
--kv-offload-threshold |
float64 | 0.9 | GPU utilization fraction above which blocks are offloaded to CPU. Range [0, 1]. |
--kv-transfer-bandwidth |
float64 | 100.0 | GPU-CPU transfer rate in blocks/tick. Required > 0 when CPU blocks > 0. |
--kv-transfer-base-latency |
int64 | 0 | Fixed per-transfer latency in ticks. |
* The effective value of --total-kv-blocks depends on the latency backend — see Resolution Process for the full priority chain. In blackbox mode, defaults.yaml overrides the 1,000,000 CLI default per model (e.g., qwen3-14b/H100/TP=1 uses 17,600 blocks). In roofline or crossmodel mode, the value is auto-calculated from model architecture and GPU memory via CalculateKVBlocks, which supersedes the defaults.yaml value. Explicit --total-kv-blocks always takes precedence.
Batch Formation¶
Controls how requests are selected for the running batch. Maps to BatchConfig.
| Flag | Type | Default | Description |
|---|---|---|---|
--max-num-running-reqs |
int64 | 256 | Maximum requests in the running batch simultaneously. |
--max-num-scheduled-tokens |
int64 | 2048 | Maximum total new tokens across all running requests per step (token budget). |
--long-prefill-token-threshold |
int64 | 0 | Prefill length threshold for chunked prefill. 0 = disabled (all prefill in one step). |
Latency Model¶
Regression Coefficients¶
Trained coefficients for the blackbox latency model. Maps to LatencyCoeffs.
| Flag | Type | Default | Description |
|---|---|---|---|
--alpha-coeffs |
float64 slice | [0, 0, 0] | Alpha coefficients [alpha0, alpha1, alpha2]. Models non-GPU overhead. Must be non-negative. |
--beta-coeffs |
float64 slice | [0, 0, 0] | Beta coefficients [beta0, beta1, beta2]. Models GPU step time. Must be non-negative. |
When --alpha-coeffs and --beta-coeffs are not explicitly provided on the CLI, BLIS automatically loads pre-trained coefficients from defaults.yaml based on the model, GPU, and TP configuration. Explicitly passing --alpha-coeffs 0,0,0 preserves zero coefficients (they are not overridden by defaults).
Model and Hardware Selection¶
Maps to ModelHardwareConfig.
| Flag | Type | Default | Description |
|---|---|---|---|
--model |
string | (required) | LLM model name (e.g., qwen/qwen3-14b). |
--hardware |
string | "" | GPU type. Bundled options: H100, A100-SXM, A100-80. If empty, loaded from defaults.yaml. Add new GPUs to hardware_config.json. |
--tp |
int | 0 | Tensor parallelism degree. If 0, loaded from defaults.yaml. |
--vllm-version |
string | "" | vLLM version string. If empty, loaded from defaults.yaml. |
--max-model-len |
int64 | 0 | Max total sequence length (input + output) in tokens. 0 = unlimited. Mirrors vLLM's --max-model-len. Auto-derived from max_position_embeddings in HuggingFace config.json for roofline/crossmodel backends. Applies rope_scaling factor for types linear, dynamic, yarn, default, mrope; excludes su, longrope, llama3; skips entirely for gemma3 models. Capped at KV-feasible maximum. |
Roofline Mode¶
For analytical step time estimation without trained coefficients.
| Flag | Type | Default | Description |
|---|---|---|---|
--latency-model |
string | "roofline" | Latency model backend: roofline (default), blackbox, crossmodel, trained-roofline. When set to roofline, crossmodel, or trained-roofline, auto-fetches HuggingFace config.json and resolves hardware config. Requires --hardware and --tp. Set HF_TOKEN for gated models. trained-roofline is recommended for new models (7% MAPE GPU step time). |
--model-config-folder |
string | "" | Path to folder containing HuggingFace config.json. Overrides --latency-model auto-resolution. |
--hardware-config |
string | "" | Path to hardware_config.json with GPU specifications. Overrides --latency-model auto-resolution. |
See Roofline Estimation for details on the analytical model.
Latency Mode Selection¶
The latency model mode is selected based on available configuration:
- Roofline mode (default): Auto-resolves model config from HuggingFace and hardware config from bundled
hardware_config.json. Requires--hardwareand--tp(loaded fromdefaults.yamlwhen available). - Blackbox mode: If
--latency-model blackboxis set. Uses trained alpha/beta coefficients fromdefaults.yaml. Requires a matching entry for the model/GPU/TP combination. - Cross-model mode: If
--latency-model crossmodelis set with--hardwareand--tp. Uses 7 globally-fitted coefficients (4 beta for step time + 3 alpha for CPU overhead) fromcrossmodel_defaultsindefaults.yaml. Architecture features derived from HuggingFace config.json. MoE-aware. - Trained-roofline mode: If
--latency-model trained-rooflineis set with--hardwareand--tp. Uses 10 globally-fitted coefficients (7 beta for roofline corrections + 3 alpha for CPU overhead) fromtrained_roofline_defaultsindefaults.yaml. Achieves 7% MAPE on GPU combined step time. - Error: If blackbox mode is selected and no coefficients can be resolved for the model/GPU/TP combination
Cluster Configuration¶
With --num-instances 1 (the default), BLIS runs a single-instance simulation — requests go directly to the wait queue with no admission or routing layer. With --num-instances N (N > 1), the cluster simulation activates: requests pass through the admission and routing pipeline before reaching per-instance wait queues. See Cluster Architecture for the multi-instance pipeline and Core Engine for single-instance internals.
| Flag | Type | Default | Description |
|---|---|---|---|
--num-instances |
int | 1 | Number of inference instances. 1 = single-instance mode; > 1 = cluster mode with admission and routing. |
Admission Policy¶
Controls which requests enter the routing pipeline. See Cluster Architecture: Admission.
| Flag | Type | Default | Description |
|---|---|---|---|
--admission-policy |
string | "always-admit" | Policy name: always-admit, token-bucket, reject-all, tier-shed. |
--admission-latency |
int64 | 0 | Admission decision latency in microseconds. Must be >= 0. |
--token-bucket-capacity |
float64 | 10000 | Token bucket maximum capacity. Required > 0 when using token-bucket. |
--token-bucket-refill-rate |
float64 | 1000 | Token bucket refill rate in tokens/second. Required > 0 when using token-bucket. |
Tier-shed admission (--admission-policy tier-shed): Sheds lower-priority SLO tiers under overload. Configured via --policy-config YAML only:
| YAML field | Type | Default | Description |
|---|---|---|---|
admission.tier_shed_threshold |
int | 0 | Per-instance in-flight threshold above which shedding activates. 0 = shed at any load. |
admission.tier_shed_min_priority |
int | 3 | Minimum SLO tier priority admitted under overload. 3 = admit Standard+Critical, shed the rest. 0 = admit all tiers (same as always-admit — footgun). |
SLO tier priorities: critical=4, standard=3, sheddable=2, batch=1, background=0.
Per-tenant fair-share budgets (tenant_budgets): A secondary admission layer that runs after the admission policy. If the admission policy rejects a request, tenant budgets are not consulted. If the admission policy admits a request, tenant budgets then apply: over-budget tenants have Sheddable-and-below requests (SLO class priority < 3) preferentially shed while Critical and Standard traffic is always protected. Configured via --policy-config YAML only (no CLI flag):
| YAML field | Type | Default | Description |
|---|---|---|---|
tenant_budgets |
map[string]float64 | nil | Per-tenant fraction of total cluster capacity (NumInstances × MaxRunningReqs). Absent key = unlimited. 0.0 = effectively zero concurrent slots (one request may slip through per admission tick due to DES admission-before-routing event ordering; see IsOverBudget docstring). Values must be in [0, 1]. |
Example:
admission:
policy: "tier-shed"
tier_shed_threshold: 0
tier_shed_min_priority: 2 # sheddable passes tier-shed; budget enforcement handles per-tenant limits
tenant_budgets:
alice: 0.3 # alice may use at most 30% of total cluster capacity
bob: 0.7 # bob may use at most 70% of total cluster capacity
Routing Policy¶
Controls how admitted requests are assigned to instances. See Cluster Architecture: Routing.
| Flag | Type | Default | Description |
|---|---|---|---|
--routing-policy |
string | "round-robin" | Policy name: round-robin, least-loaded, weighted, always-busiest. |
--routing-latency |
int64 | 0 | Routing decision latency in microseconds. Must be >= 0. |
--routing-scorers |
string | "" | Scorer configuration for weighted policy. Format: name:weight,name:weight,... |
--snapshot-refresh-interval |
int64 | 0 | Prometheus snapshot refresh interval for all instance metrics (QueueDepth, BatchSize, KVUtilization) in microseconds. 0 = immediate. |
Scorer Configuration¶
When using --routing-policy weighted, the --routing-scorers flag configures which scorers are used and their relative weights:
Available scorers: prefix-affinity, precise-prefix-cache, no-hit-lru, queue-depth, kv-utilization, load-balance.
Default (when --routing-scorers is empty): prefix-affinity:3, queue-depth:2, kv-utilization:2 (llm-d parity).
See Cluster Architecture: Scorer Composition for details on each scorer.
Scheduling and Priority¶
Per-instance policies that control request ordering within the wait queue. Maps to PolicyConfig.
| Flag | Type | Default | Description |
|---|---|---|---|
--scheduler |
string | "fcfs" | Scheduler: fcfs, priority-fcfs, sjf, reverse-priority. |
--priority-policy |
string | "constant" | Priority policy: constant, slo-based, inverted-slo. |
See Core Engine: Scheduling for policy details.
Workload Configuration¶
Workload Modes¶
BLIS supports three workload specification modes, in order of precedence:
| Mode | Trigger | Description |
|---|---|---|
| Workload-spec YAML | --workload-spec <path> |
Multi-client workload with per-client distributions. Highest priority. |
| CLI distribution | --workload distribution (default) |
Single-client Gaussian distribution controlled by CLI flags. |
| Preset | --workload <name> |
Named preset from defaults.yaml: chatbot, contentgen, summarization, multidoc. |
Distribution Mode Flags¶
Used when --workload distribution (the default) and no --workload-spec is set.
| Flag | Type | Default | Description |
|---|---|---|---|
--rate |
float64 | 1.0 | Request arrival rate in requests/second. |
--num-requests |
int | 100 | Total number of requests to generate. |
--prompt-tokens |
int | 512 | Mean prompt (input) token count. |
--prompt-tokens-stdev |
int | 256 | Standard deviation of prompt tokens. |
--prompt-tokens-min |
int | 2 | Minimum prompt token count. |
--prompt-tokens-max |
int | 7000 | Maximum prompt token count. |
--output-tokens |
int | 512 | Mean output token count. |
--output-tokens-stdev |
int | 256 | Standard deviation of output tokens. |
--output-tokens-min |
int | 2 | Minimum output token count. |
--output-tokens-max |
int | 7000 | Maximum output token count. |
--prefix-tokens |
int | 0 | Prefix token count for prefix caching simulation. Additive to prompt tokens. |
Workload-Spec YAML¶
The --workload-spec flag loads a YAML file defining multi-client workloads:
aggregate_rate: 100 # Total arrival rate in requests/second
num_requests: 1000
seed: 42
horizon: 1000000000 # Ticks (microseconds)
clients:
- id: "interactive"
rate_fraction: 0.6 # 60% of aggregate rate
prefix_group: "chat"
prefix_length: 512
arrival:
process: "poisson"
input_distribution:
type: "gaussian"
params:
mean: 256
std_dev: 128
min: 2
max: 4096
output_distribution:
type: "exponential"
params:
mean: 128
- id: "batch"
rate_fraction: 0.4
arrival:
process: "gamma"
cv: 2.0
input_distribution:
type: "gaussian"
params:
mean: 1024
std_dev: 512
min: 2
max: 7000
output_distribution:
type: "gaussian"
params:
mean: 512
std_dev: 256
min: 2
max: 7000
Supported arrival processes: poisson, gamma (with cv parameter), weibull (with cv parameter), constant.
Supported token distributions: gaussian, exponential, pareto_lognormal, constant, empirical.
When --workload-spec is set, CLI --seed, --horizon, and --num-requests still override the YAML values if explicitly provided.
Trace Files¶
| Flag | Type | Default | Description |
|---|---|---|---|
--workload-spec |
string | "" | Path to workload-spec YAML. |
--defaults-filepath |
string | "defaults.yaml" | Path to defaults.yaml. |
--trace-output |
string | "" | Export workload as TraceV2 files (<prefix>.yaml + <prefix>.csv). |
Policy Bundle¶
The --policy-config flag loads admission, routing, priority, and scheduling configuration from a single YAML file:
admission:
policy: "always-admit"
token_bucket_capacity: 10000.0
token_bucket_refill_rate: 1000.0
routing:
policy: "weighted"
scorers:
- name: "prefix-affinity"
weight: 3.0
- name: "queue-depth"
weight: 2.0
- name: "kv-utilization"
weight: 2.0
priority:
policy: "constant"
scheduler: "fcfs"
# Node pool infrastructure (Phase 1A — optional; omit for backward-compatible single-pool mode)
node_pools:
- name: "gpu-pool-1"
gpu_type: "H100" # pool-authoritative: overrides --gpu flag for GPU label (all backends) and hardware calibration (roofline/trained-roofline backends); see issues #892/#893
gpus_per_node: 8
gpu_memory_gib: 80.0
initial_nodes: 2
min_nodes: 1
max_nodes: 4
provisioning_delay:
mean: 30.0 # seconds
stddev: 5.0 # 0 = constant delay
# Per-GPU hardware calibration overrides for roofline/trained-roofline backends (issue #893 — optional)
# Key: GPU type string matching a pool's gpu_type. Value: HardwareCalib for that GPU.
# When a pool's gpu_type is found in this map, the matched calibration overrides the CLI
# --gpu calibration at instance construction time (both sync and deferred/NodeReadyEvent paths),
# ensuring pool-placed instances use the correct TFlopsPeak/BwPeakTBs for roofline math.
# Omitting this field (zero value) is safe: no override, backward-compatible with all callers.
# The blackbox backend does not use HWConfig and is unaffected by this field.
# Keys must exactly match the gpu_type strings used in the node_pools entries above.
hw_config_by_gpu:
H100:
tflops_peak: 1979.0 # FP16 TFLOPS
bw_peak_tbs: 3.35 # HBM bandwidth in TB/s
mfu_prefill: 0.5
mfu_decode: 0.5
A100:
tflops_peak: 1248.0
bw_peak_tbs: 2.0
mfu_prefill: 0.5
mfu_decode: 0.5
# Instance lifecycle (Phase 1A — all zero/empty = backward-compatible defaults)
instance_lifecycle:
loading_delay:
mean: 10.0 # seconds to load model weights onto GPU
stddev: 1.0 # 0 = constant delay
warm_up_request_count: 5 # requests served before leaving WarmingUp state
warm_up_ttft_factor: 2.0 # TTFT multiplier applied to warm-up requests (≥ 1.0)
drain_policy: "WAIT" # IMMEDIATE | WAIT | REDIRECT
# Per-tenant fair-share budgets (Phase 1B — optional; omit for no tenant enforcement)
# Each value is a fraction of total cluster capacity (NumInstances × MaxRunningReqs).
# Absent key = unlimited. 0.0 = effectively zero concurrent slots (DES ordering caveat: see IsOverBudget docstring). Values must be in [0, 1].
# Critical and Standard traffic is always protected from budget shedding.
tenant_budgets:
team-a: 0.4
team-b: 0.4
CLI flags override policy bundle values when explicitly set. For example, --routing-policy least-loaded overrides the bundle's routing.policy setting.
Node pools and instance lifecycle are YAML-only
node_pools and instance_lifecycle have no corresponding CLI flags. They must be set via --policy-config. Omitting them is safe — the simulator falls back to single-pool, no-lifecycle mode for full backward compatibility.
Decision Tracing¶
| Flag | Type | Default | Description |
|---|---|---|---|
--trace-level |
string | "none" | Trace verbosity: none or decisions. |
--counterfactual-k |
int | 0 | Number of counterfactual candidates per routing decision. Requires --trace-level decisions. |
--summarize-trace |
bool | false | Print trace summary after simulation. Requires --trace-level decisions. |
See Cluster Architecture: Counterfactual Regret.
Fitness Evaluation¶
| Flag | Type | Default | Description |
|---|---|---|---|
--fitness-weights |
string | "" | Fitness function weights. Format: metric:weight,metric:weight,... |
When configured, BLIS computes a single fitness score from aggregated metrics. Latency metrics are normalized via 1/(1 + value/1000) where value is in ticks (microseconds) and 1000 = 1ms reference (lower is better); throughput metrics via value/(value + reference) where referenceRPS = 100.0 and referenceTPS = 10000.0 (higher is better). Useful for automated policy comparison across multiple simulation runs.
defaults.yaml¶
The defaults.yaml file serves as a model registry and workload preset store:
# Section 1: Hardware/TP mappings (keyed by model ID)
defaults:
qwen/qwen3-14b:
GPU: H100
tensor_parallelism: 1
vllm_version: vllm/vllm-openai:v0.11.0
hf_repo: Qwen/Qwen3-14B
# Section 2: Workload presets
workloads:
chatbot:
prompt_tokens: 256
prompt_tokens_stdev: 100
output_tokens: 256
output_tokens_stdev: 100
# ... min/max bounds
# Section 3: Trained coefficients (keyed by model+GPU+TP)
models:
- id: qwen/qwen3-14b
GPU: H100
tensor_parallelism: 1
vllm_version: vllm/vllm-openai:v0.11.0
alpha_coeffs: [8888.09, 0.18, 0.0]
beta_coeffs: [13578.19, 39.44, 27.32]
total_kv_blocks: 17600
Resolution Process¶
When BLIS starts, it resolves latency configuration through a layered process. Explicit CLI flags always take precedence (R18).
Hardware and TP defaults resolution (all backends):
Before any backend-specific logic runs, BLIS loads hardware/TP/vLLM-version defaults from defaults.yaml for the specified --model when those flags are not explicitly provided. This ensures analytical backends (roofline, crossmodel, trained-roofline) can auto-resolve without requiring explicit --hardware and --tp for models listed in defaults.yaml.
Backend-specific resolution:
- If
--latency-model roofline(default),crossmodel, ortrained-roofline: - Auto-resolve model config: check
model_configs/for existingconfig.json, fetch from HuggingFace on miss (setHF_TOKENfor gated models) - Auto-resolve hardware config from bundled
hardware_config.json - For roofline: beta coefficients are computed analytically from model architecture and hardware specs
- For crossmodel: load global alpha + beta coefficients from
crossmodel_defaultsindefaults.yaml - For trained-roofline: load global correction coefficients from
trained_roofline_defaultsindefaults.yaml --model-config-folderand--hardware-configoverride auto-resolution when explicitly set- If
--latency-model blackbox: - Look up the model in
defaults.yamlusing--model,--hardware,--tp,--vllm-version - Load alpha/beta coefficients from the matching entry
- If no matching entry is found, exit with error suggesting roofline, crossmodel, or trained-roofline
- If
--alpha-coeffsand--beta-coeffsare explicitly provided via CLI: - Use them directly, no
defaults.yamllookup
--total-kv-blocks resolution (highest priority wins):
- Explicit CLI flag — if
--total-kv-blocksis set, that value is used regardless of backend - Auto-calculation (roofline/crossmodel only) — when
MemoryGiB > 0in the hardware config,CalculateKVBlocksderives the block count from model architecture and GPU memory, superseding thedefaults.yamlvalue. Three failure modes: (a) ifMemoryGiBis missing fromhardware_config.json, BLIS warns and falls back to thedefaults.yamlvalue (layer 3) or hardcoded default (layer 4); (b) if model architecture params cannot be extracted fromconfig.json, BLIS exits with an error; (c) if the calculation itself fails (e.g., unsupported activation function), BLIS exits with an error. Only theMemoryGiB-missing case is a graceful fallback — other failures are fatal. Auto-calculation currently requires SwiGLU-family activations (silu,swiglu,geglu); models with other activations (e.g., Falcon'sgelu) should set--total-kv-blocksexplicitly defaults.yaml— per-model block count loaded for the model/GPU/TP combination (e.g., 17,600 for qwen3-14b/H100/TP=1). For roofline/crossmodel withMemoryGiB > 0, this value is superseded by auto-calculation (layer 2). It remains the effective value only for blackbox mode or whenMemoryGiBis unavailable in the hardware config- Hardcoded default — 1,000,000 (CLI flag default, used only when no other source provides a value)
Coefficient Calibration¶
BLIS uses a data-driven calibration strategy to ensure simulation accuracy. This process runs once per environment configuration (model, GPU, TP degree, vLLM version):
- Initialization: Define baseline estimates for alpha and beta coefficients as starting points for optimization
- Profiling: Execute training workloads on a live vLLM instance to collect ground-truth mean and P90 metrics for TTFT, ITL, and E2E
- Optimization: Run BLIS iteratively using Blackbox Bayesian Optimization to minimize the multi-objective loss:
$$\text{Loss} = \sum_{m \in {\text{TTFT, ITL, E2E}}} \left( |GT_{\text{mean},m} - Sim_{\text{mean},m}| + |GT_{\text{p90},m} - Sim_{\text{p90},m}| \right)$$
- Artifact generation: Optimal alpha/beta coefficients are stored in
defaults.yamlfor production use
For environments where live profiling is not feasible, the Roofline model provides analytical step time estimation without any training data.
CLI Flag Summary by Sub-Config¶
| Sub-Config | Flags |
|---|---|
| KVCacheConfig | --total-kv-blocks, --block-size-in-tokens, --kv-cpu-blocks, --kv-offload-threshold, --kv-transfer-bandwidth, --kv-transfer-base-latency |
| BatchConfig | --max-num-running-reqs, --max-num-scheduled-tokens, --long-prefill-token-threshold |
| LatencyCoeffs | --alpha-coeffs, --beta-coeffs |
| ModelHardwareConfig | --model, --hardware, --tp, --vllm-version, --latency-model, --model-config-folder, --hardware-config, --max-model-len |
| PolicyConfig | --scheduler, --priority-policy |
| WorkloadConfig | --workload, --workload-spec, --defaults-filepath, --rate, --num-requests, --prompt-tokens*, --output-tokens*, --prefix-tokens |
| DeploymentConfig | --num-instances, --admission-policy, --admission-latency, --token-bucket-capacity, --token-bucket-refill-rate, --routing-policy, --routing-latency, --routing-scorers, --snapshot-refresh-interval, --trace-level, --counterfactual-k |
| Top-level | --seed, --horizon, --log, --metrics-path (run only), --trace-output, --policy-config, --fitness-weights, --summarize-trace |
blis observe¶
Dispatches a workload to a real inference server and records request-level timing into TraceV2 files for later replay and calibration.
Required¶
| Flag | Type | Default | Description |
|---|---|---|---|
--server-url |
string | "" | Inference server URL (required). |
--model |
string | "" | Model name for API requests (required). |
--trace-header |
string | "" | Output path for TraceV2 header YAML (required). |
--trace-data |
string | "" | Output path for TraceV2 data CSV (required). |
Workload Input¶
| Flag | Type | Default | Description |
|---|---|---|---|
--workload-spec |
string | "" | Path to WorkloadSpec YAML (alternative to --rate + distribution flags). |
--rate |
float64 | 0 | Requests per second for distribution synthesis. |
Optional¶
| Flag | Type | Default | Description |
|---|---|---|---|
--api-key |
string | "" | Bearer token for server authentication. |
--server-type |
string | "vllm" | Server type (vllm, tgi, etc.). |
--max-concurrency |
int | 256 | Maximum simultaneous in-flight requests. |
--warmup-requests |
int | 0 | Number of initial requests to exclude from trace. |
--no-streaming |
bool | false | Disable streaming (use non-streaming HTTP). |
--seed |
int64 | 42 | RNG seed for workload generation. |
--horizon |
int64 | 0 | Observation horizon in microseconds (0 = from spec or unlimited). |
--num-requests |
int | 0 | Maximum requests to generate (0 = from spec or unlimited). |
Distribution Synthesis¶
Used when --rate is set instead of --workload-spec. Same flag names as blis run but with different defaults tuned for observe workloads.
| Flag | Type | Default | Description |
|---|---|---|---|
--prompt-tokens |
int | 512 | Average prompt token count. |
--prompt-tokens-stdev |
int | 50 | Prompt token standard deviation. |
--prompt-tokens-min |
int | 1 | Minimum prompt tokens. |
--prompt-tokens-max |
int | 2048 | Maximum prompt tokens. |
--output-tokens |
int | 512 | Average output token count. |
--output-tokens-stdev |
int | 50 | Output token standard deviation. |
--output-tokens-min |
int | 1 | Minimum output tokens. |
--output-tokens-max |
int | 2048 | Maximum output tokens. |
--prefix-tokens |
int | 0 | Shared prefix token count. |
--api-format |
string | "completions" | API format: completions (/v1/completions) or chat (/v1/chat/completions). |
--unconstrained-output |
bool | false | Do not set max_tokens (let server decide output length). |
--rtt-ms |
float64 | 0 | Measured network round-trip time in milliseconds (recorded in trace header for calibrate). |
blis replay¶
Replays a captured TraceV2 file through the discrete-event simulator. Replay reuses the full simulation engine, so it accepts the same sim-config flags as blis run — see the sections above for Simulation Control, KV Cache Configuration, Batch Formation, Latency Model, Cluster Configuration, Admission Policy, Routing Policy, Scheduling and Priority, Decision Tracing, and Fitness Evaluation.
Replay-Specific Flags¶
| Flag | Type | Default | Description |
|---|---|---|---|
--trace-header |
string | "" | Path to TraceV2 header YAML file (required). |
--trace-data |
string | "" | Path to TraceV2 data CSV file (required). |
--results-path |
string | "" | File to write []SimResult JSON (fields: request_id, ttft_us, e2e_us, input_tokens, output_tokens) for blis calibrate consumption. |
blis calibrate¶
Compares real observed latencies (from blis observe) against simulator predictions (from blis replay) and produces a calibration report with per-metric MAPE, Pearson R, and quality grades.
| Flag | Type | Default | Description |
|---|---|---|---|
--trace-header |
string | "" | Path to TraceV2 header YAML file (from blis observe; required). |
--trace-data |
string | "" | Path to TraceV2 data CSV file (from blis observe; required). |
--sim-results |
string | "" | Path to SimResult JSON file (from blis replay --results-path; required). |
--report |
string | "" | Path to write calibration report JSON (required). |
--warmup-requests |
int | -1 | Number of initial requests to exclude. Default: from trace header warm_up_requests; pass 0 to include all. |
--network-rtt-us |
int64 | -1 | Network RTT in microseconds added to sim-side latencies. Default: from trace header network.measured_rtt_ms. |
--network-bandwidth-mbps |
float64 | 0 | Network bandwidth in Mbps for upload/download delay calculation (0 = no delay). |
blis convert¶
Converts external workload formats into BLIS WorkloadSpec v2 YAML. Three subcommands are available.
blis convert preset¶
Generates a WorkloadSpec from a named preset in defaults.yaml.
| Flag | Type | Default | Description |
|---|---|---|---|
--name |
string | "" | Preset name (e.g., chatbot, summarization, contentgen, multidoc). |
--rate |
float64 | 1.0 | Request rate in requests/second. |
--num-requests |
int | 100 | Number of requests. |
--defaults-filepath |
string | "defaults.yaml" | Path to defaults.yaml. |
blis convert servegen¶
Converts a ServeGen data directory into WorkloadSpec format.
| Flag | Type | Default | Description |
|---|---|---|---|
--path |
string | "" | Path to ServeGen data directory. |
blis convert infperf¶
Converts an inference-perf YAML specification into WorkloadSpec format.
| Flag | Type | Default | Description |
|---|---|---|---|
--spec |
string | "" | Path to inference-perf YAML spec. |
blis compose¶
Merges multiple WorkloadSpec v2 YAML files into a single combined specification.
| Flag | Type | Default | Description |
|---|---|---|---|
--from |
string (repeatable) | (none) | Path to v2 WorkloadSpec YAML file. Can be repeated to merge multiple specs. |