Metrics & Results¶
This guide covers how to read BLIS output — from the primary JSON metrics to anomaly counters, KV cache diagnostics, per-SLO breakdowns, fitness scores, and trace summaries.
# Quick example: run with all diagnostic output enabled
./blis run --model qwen/qwen3-14b \
--num-instances 4 --rate 200 --num-requests 1000 \
--trace-level decisions --summarize-trace \
--fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"
Primary Metrics (JSON Output)¶
The JSON output on stdout contains:
| Field | Unit | Description |
|---|---|---|
ttft_mean_ms |
ms | Mean Time To First Token — average responsiveness |
ttft_p90_ms |
ms | 90th percentile TTFT |
ttft_p99_ms |
ms | 99th percentile TTFT — tail latency |
e2e_mean_ms |
ms | Mean End-to-End latency — average total request time |
e2e_p99_ms |
ms | 99th percentile E2E |
itl_mean_ms |
ms | Mean Inter-Token Latency — streaming smoothness |
scheduling_delay_p99_ms |
ms | 99th percentile scheduling delay — queue wait time |
responses_per_sec |
req/s | Throughput |
completed_requests |
count | Requests that finished before horizon |
Scheduling Delay¶
Scheduling delay isolates the WaitQ wait time from compute time. High scheduling delay + low preemptions = queue saturation (add instances). Low scheduling delay + high TTFT = compute saturation (reduce batch size or use chunked prefill).
Per-request units
All per-request latency fields (ttft_ms, e2e_ms, itl_ms, scheduling_delay_ms) are in milliseconds — converted from internal ticks by dividing by 1,000. Aggregate metrics (scheduling_delay_p99_ms, etc.) are also in milliseconds. See Known Unit Gotchas for the full unit reference. Note: hypothesis scripts written before BC-14 may divide scheduling_delay_ms by 1,000 unnecessarily — that field is now already in ms.
Anomaly Counters¶
When anomalies are detected, BLIS prints === Anomaly Counters ===:
| Counter | What It Means | Action |
|---|---|---|
| Priority Inversions | A lower-priority request was scheduled before a higher-priority one | Check scheduler choice — use priority-fcfs for SLO workloads |
| HOL Blocking Events | A long prefill blocked shorter requests | Enable chunked prefill: --long-prefill-token-threshold 256 |
| Rejected Requests (Admission) | Admission policy rejected the request | Check token bucket capacity or admission policy |
| Rejected Requests (Routing) | No routable instances for the request's model — all instances are Loading or Draining |
Increase initial_nodes, reduce loading_delay.mean, or stagger drain operations |
| Dropped Unservable | Request exceeds --max-model-len context window or needs more KV blocks than exist |
Check --max-model-len setting; increase --total-kv-blocks or reduce max input tokens |
KV Cache Metrics¶
When KV cache activity is nonzero, BLIS prints === KV Cache Metrics ===:
| Metric | Meaning | Concern Threshold |
|---|---|---|
| Preemption Rate | Fraction of requests that were preempted (KV evicted) | > 5% indicates KV pressure |
| Cache Hit Rate | Fraction of blocks served from prefix cache | Higher is better — indicates prefix reuse |
| KV Thrashing Rate | Repeated preemption-reallocation cycles | > 0 indicates severe memory pressure |
Per-SLO-Class Metrics¶
When multiple SLO classes are present in the workload, BLIS prints per-class TTFT and E2E distributions. This lets you verify that critical requests meet SLOs even when batch traffic is heavy.
Per-Model Metrics¶
When instances serve different models (multi-model deployment), BLIS prints per-model TTFT mean/p99, E2E mean/p99, and throughput (req/s). This appears automatically when requests carry model tags. Output format:
=== Per-Model Metrics ===
qwen/qwen3-14b:
TTFT: p50=1234.56 p99=5678.90 (n=250)
E2E: p50=9876.54 p99=12345.67 (n=250)
Throughput: 50.00 req/s, 6400.00 tok/s
When --metrics-path is set, the JSON output includes a per_model key (omitted when no requests carry model tags). Each entry has:
| Field | Type | Description |
|---|---|---|
model |
string | Model name |
ttft |
Distribution | TTFT distribution (p50, p99, mean, count) |
e2e |
Distribution | E2E latency distribution |
throughput_rps |
float64 | Requests per second for this model |
tokens_per_sec |
float64 | Output tokens per second for this model |
total_requests |
int | Number of completed requests for this model |
Per-Tenant Metrics¶
When requests carry tenant_id labels, BLIS prints per-tenant request counts, total output tokens served, and a Jain Fairness Index over the token distribution. This section appears automatically and is omitted when no requests carry tenant labels (backward-compatible with legacy and untenanted workloads). A workload with a single named tenant shows the section with Jain=1.0 (trivially fair).
=== Per-Tenant Metrics ===
alice: requests=50, tokens=12500
bob: requests=50, tokens=12480
Jain Fairness Index: 0.9999
Tenants are listed in lexicographic order. The Jain Fairness Index ranges from 1/N (maximally unfair — one tenant receives everything) to 1.0 (perfectly fair — all tenants receive equal tokens). A balanced two-tenant workload produces a value ≥ 0.99.
To tag requests with tenant labels, set tenant_id in your workload spec cohort:
Fitness Evaluation¶
For automated multi-configuration comparison:
Valid metric keys: throughput, tokens_per_sec, p99_ttft, p50_ttft, mean_ttft, p99_e2e, p50_e2e, mean_e2e.
How Normalization Works¶
- Latency metrics:
1 / (1 + value/1000)— lower latency → higher score. Reference: 1000 ticks = 1ms - Throughput metrics:
value / (value + reference)— higher throughput → higher score. References: RPS=100, TPS=10,000
Normalization compresses large differences
The 1/(1+x/1000) function compresses large raw differences into small score differences. A 38% TTFT p99 improvement (39,000→64,000 ticks) maps to only 2-8% fitness score difference. Always examine raw metrics alongside fitness scores for meaningful comparison.
Common Patterns¶
Saturation Curves¶
As arrival rate increases past per-instance service capacity (μ ≈ 1/step_time), TTFT p99 grows super-linearly. The queue growth rate excess = λ/k - μ determines how quickly latency degrades.
Tail Latency Spikes¶
P99 diverges from mean sharply near saturation. A workload at 90% capacity may show 2x mean TTFT but 10x P99 TTFT.
Snapshot Staleness Effects¶
With kv-utilization scorer alone at --snapshot-refresh-interval 100ms: +354% TTFT degradation. The default composite scorer mitigates ~99% of this effect.
Policy Equivalence at Low Load¶
All routing policies produce equivalent results (within 5%) at low utilization. Differentiation requires moderate-to-high load where queueing dynamics dominate.
Alpha Overhead¶
BLIS models non-GPU overhead (tokenization, API serialization) as alpha coefficients. Alpha queueing time (alpha0 + alpha1 × inputLen) delays request enqueue, creating an event gap, but does not occupy the GPU. Alpha output processing time (alpha2) adds to TTFT/E2E metrics but does not affect step scheduling. This means:
- Simulated E2E > theoretical M/M/k E2E (especially at high load)
- The divergence is 28-71% at ρ ≥ 0.5 but only 0.3-3.3% at ρ ≤ 0.3
- To compare with theoretical models, use
rho_eff = lambda × step_totalnotlambda × E2E_total
Per-Request Results¶
For detailed analysis, save per-request data:
Each request record includes TTFT, E2E, scheduling delay, and completion status.
Further Reading¶
- Configuration Reference — fitness weight syntax
- Tutorial: Capacity Planning — applying results to capacity decisions