Metrics & Results¶

This guide covers how to read BLIS output — from the primary JSON metrics to anomaly counters, KV cache diagnostics, per-SLO breakdowns, fitness scores, and trace summaries.

# Quick example: run with all diagnostic output enabled
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 200 --num-requests 1000 \
  --trace-level decisions --summarize-trace \
  --fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"

Primary Metrics (JSON Output)¶

The JSON output on stdout contains:

Field	Unit	Description
`ttft_mean_ms`	ms	Mean Time To First Token — average responsiveness
`ttft_p90_ms`	ms	90th percentile TTFT
`ttft_p99_ms`	ms	99th percentile TTFT — tail latency
`e2e_mean_ms`	ms	Mean End-to-End latency — average total request time
`e2e_p99_ms`	ms	99th percentile E2E
`itl_mean_ms`	ms	Mean Inter-Token Latency — streaming smoothness
`scheduling_delay_p99_ms`	ms	99th percentile scheduling delay — queue wait time
`responses_per_sec`	req/s	Throughput
`completed_requests`	count	Requests that finished before horizon

Scheduling Delay¶

Scheduling delay isolates the WaitQ wait time from compute time. High scheduling delay + low preemptions = queue saturation (add instances). Low scheduling delay + high TTFT = compute saturation (reduce batch size or use chunked prefill).

Per-request units

All per-request latency fields (ttft_ms, e2e_ms, itl_ms, scheduling_delay_ms) are in milliseconds — converted from internal ticks by dividing by 1,000. Aggregate metrics (scheduling_delay_p99_ms, etc.) are also in milliseconds. See Known Unit Gotchas for the full unit reference. Note: hypothesis scripts written before BC-14 may divide scheduling_delay_ms by 1,000 unnecessarily — that field is now already in ms.

Anomaly Counters¶

When anomalies are detected, BLIS prints === Anomaly Counters ===:

Counter	What It Means	Action
Priority Inversions	A lower-priority request was scheduled before a higher-priority one	Check scheduler choice — use `priority-fcfs` for SLO workloads
HOL Blocking Events	A long prefill blocked shorter requests	Enable chunked prefill: `--long-prefill-token-threshold 256`
Rejected Requests (Admission)	Admission policy rejected the request	Check token bucket capacity or admission policy
Rejected Requests (Routing)	No routable instances for the request's model — all instances are `Loading` or `Draining`	Increase `initial_nodes`, reduce `loading_delay.mean`, or stagger drain operations
Dropped Unservable	Request exceeds `--max-model-len` context window or needs more KV blocks than exist	Check `--max-model-len` setting; increase `--total-kv-blocks` or reduce max input tokens

KV Cache Metrics¶

When KV cache activity is nonzero, BLIS prints === KV Cache Metrics ===:

Metric	Meaning	Concern Threshold
Preemption Rate	Fraction of requests that were preempted (KV evicted)	> 5% indicates KV pressure
Cache Hit Rate	Fraction of blocks served from prefix cache	Higher is better — indicates prefix reuse
KV Thrashing Rate	Repeated preemption-reallocation cycles	> 0 indicates severe memory pressure

Per-SLO-Class Metrics¶

When multiple SLO classes are present in the workload, BLIS prints per-class TTFT and E2E distributions. This lets you verify that critical requests meet SLOs even when batch traffic is heavy.

Per-Model Metrics¶

When instances serve different models (multi-model deployment), BLIS prints per-model TTFT mean/p99, E2E mean/p99, and throughput (req/s). This appears automatically when requests carry model tags. Output format:

=== Per-Model Metrics ===
  qwen/qwen3-14b:
    TTFT: p50=1234.56 p99=5678.90 (n=250)
    E2E:  p50=9876.54 p99=12345.67 (n=250)
    Throughput: 50.00 req/s, 6400.00 tok/s

When --metrics-path is set, the JSON output includes a per_model key (omitted when no requests carry model tags). Each entry has:

Field	Type	Description
`model`	string	Model name
`ttft`	Distribution	TTFT distribution (p50, p99, mean, count)
`e2e`	Distribution	E2E latency distribution
`throughput_rps`	float64	Requests per second for this model
`tokens_per_sec`	float64	Output tokens per second for this model
`total_requests`	int	Number of completed requests for this model

Per-Tenant Metrics¶

When requests carry tenant_id labels, BLIS prints per-tenant request counts, total output tokens served, and a Jain Fairness Index over the token distribution. This section appears automatically and is omitted when no requests carry tenant labels (backward-compatible with legacy and untenanted workloads). A workload with a single named tenant shows the section with Jain=1.0 (trivially fair).

=== Per-Tenant Metrics ===
  alice: requests=50, tokens=12500
  bob: requests=50, tokens=12480
  Jain Fairness Index: 0.9999

Tenants are listed in lexicographic order. The Jain Fairness Index ranges from 1/N (maximally unfair — one tenant receives everything) to 1.0 (perfectly fair — all tenants receive equal tokens). A balanced two-tenant workload produces a value ≥ 0.99.

To tag requests with tenant labels, set tenant_id in your workload spec cohort:

cohorts:
  - name: alice-traffic
    tenant_id: alice
    ...
  - name: bob-traffic
    tenant_id: bob
    ...

Fitness Evaluation¶

For automated multi-configuration comparison:

--fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"

Valid metric keys: throughput, tokens_per_sec, p99_ttft, p50_ttft, mean_ttft, p99_e2e, p50_e2e, mean_e2e.

How Normalization Works¶

Latency metrics: 1 / (1 + value/1000) — lower latency → higher score. Reference: 1000 ticks = 1ms
Throughput metrics: value / (value + reference) — higher throughput → higher score. References: RPS=100, TPS=10,000

Normalization compresses large differences

The 1/(1+x/1000) function compresses large raw differences into small score differences. A 38% TTFT p99 improvement (39,000→64,000 ticks) maps to only 2-8% fitness score difference. Always examine raw metrics alongside fitness scores for meaningful comparison.

Common Patterns¶

Saturation Curves¶

As arrival rate increases past per-instance service capacity (μ ≈ 1/step_time), TTFT p99 grows super-linearly. The queue growth rate excess = λ/k - μ determines how quickly latency degrades.

Tail Latency Spikes¶

P99 diverges from mean sharply near saturation. A workload at 90% capacity may show 2x mean TTFT but 10x P99 TTFT.

Snapshot Staleness Effects¶

With kv-utilization scorer alone at --snapshot-refresh-interval 100ms: +354% TTFT degradation. The default composite scorer mitigates ~99% of this effect.

Policy Equivalence at Low Load¶

All routing policies produce equivalent results (within 5%) at low utilization. Differentiation requires moderate-to-high load where queueing dynamics dominate.

Alpha Overhead¶

BLIS models non-GPU overhead (tokenization, API serialization) as alpha coefficients. Alpha queueing time (alpha0 + alpha1 × inputLen) delays request enqueue, creating an event gap, but does not occupy the GPU. Alpha output processing time (alpha2) adds to TTFT/E2E metrics but does not affect step scheduling. This means:

Simulated E2E > theoretical M/M/k E2E (especially at high load)
The divergence is 28-71% at ρ ≥ 0.5 but only 0.3-3.3% at ρ ≤ 0.3
To compare with theoretical models, use rho_eff = lambda × step_total not lambda × E2E_total

Per-Request Results¶

For detailed analysis, save per-request data:

./blis run --model qwen/qwen3-14b \
  --rate 100 --num-requests 500 --metrics-path metrics.json

Each request record includes TTFT, E2E, scheduling delay, and completion status.