Skip to content

Metrics & Results

This guide covers how to read BLIS output — from the primary JSON metrics to anomaly counters, KV cache diagnostics, per-SLO breakdowns, fitness scores, and trace summaries.

# Quick example: run with all diagnostic output enabled
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 200 --num-requests 1000 \
  --trace-level decisions --summarize-trace \
  --fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"

Primary Metrics (JSON Output)

The JSON output on stdout contains:

Field Unit Description
ttft_mean_ms ms Mean Time To First Token — average responsiveness
ttft_p90_ms ms 90th percentile TTFT
ttft_p99_ms ms 99th percentile TTFT — tail latency
e2e_mean_ms ms Mean End-to-End latency — average total request time
e2e_p99_ms ms 99th percentile E2E
itl_mean_ms ms Mean Inter-Token Latency — streaming smoothness
scheduling_delay_p99_ms ms 99th percentile scheduling delay — queue wait time
responses_per_sec req/s Throughput
completed_requests count Requests that finished before horizon

Scheduling Delay

Scheduling delay isolates the WaitQ wait time from compute time. High scheduling delay + low preemptions = queue saturation (add instances). Low scheduling delay + high TTFT = compute saturation (reduce batch size or use chunked prefill).

Per-request units

All per-request latency fields (ttft_ms, e2e_ms, itl_ms, scheduling_delay_ms) are in milliseconds — converted from internal ticks by dividing by 1,000. Aggregate metrics (scheduling_delay_p99_ms, etc.) are also in milliseconds. See Known Unit Gotchas for the full unit reference. Note: hypothesis scripts written before BC-14 may divide scheduling_delay_ms by 1,000 unnecessarily — that field is now already in ms.

Anomaly Counters

When anomalies are detected, BLIS prints === Anomaly Counters ===:

Counter What It Means Action
Priority Inversions A lower-priority request was scheduled before a higher-priority one Check scheduler choice — use priority-fcfs for SLO workloads
HOL Blocking Events A long prefill blocked shorter requests Enable chunked prefill: --long-prefill-token-threshold 256
Rejected Requests (Admission) Admission policy rejected the request Check token bucket capacity or admission policy
Rejected Requests (Routing) No routable instances for the request's model — all instances are Loading or Draining Increase initial_nodes, reduce loading_delay.mean, or stagger drain operations
Dropped Unservable Request exceeds --max-model-len context window or needs more KV blocks than exist Check --max-model-len setting; increase --total-kv-blocks or reduce max input tokens

KV Cache Metrics

When KV cache activity is nonzero, BLIS prints === KV Cache Metrics ===:

Metric Meaning Concern Threshold
Preemption Rate Fraction of requests that were preempted (KV evicted) > 5% indicates KV pressure
Cache Hit Rate Fraction of blocks served from prefix cache Higher is better — indicates prefix reuse
KV Thrashing Rate Repeated preemption-reallocation cycles > 0 indicates severe memory pressure

Per-SLO-Class Metrics

When multiple SLO classes are present in the workload, BLIS prints per-class TTFT and E2E distributions. This lets you verify that critical requests meet SLOs even when batch traffic is heavy.

Per-Model Metrics

When instances serve different models (multi-model deployment), BLIS prints per-model TTFT mean/p99, E2E mean/p99, and throughput (req/s). This appears automatically when requests carry model tags. Output format:

=== Per-Model Metrics ===
  qwen/qwen3-14b:
    TTFT: p50=1234.56 p99=5678.90 (n=250)
    E2E:  p50=9876.54 p99=12345.67 (n=250)
    Throughput: 50.00 req/s, 6400.00 tok/s

When --metrics-path is set, the JSON output includes a per_model key (omitted when no requests carry model tags). Each entry has:

Field Type Description
model string Model name
ttft Distribution TTFT distribution (p50, p99, mean, count)
e2e Distribution E2E latency distribution
throughput_rps float64 Requests per second for this model
tokens_per_sec float64 Output tokens per second for this model
total_requests int Number of completed requests for this model

Per-Tenant Metrics

When requests carry tenant_id labels, BLIS prints per-tenant request counts, total output tokens served, and a Jain Fairness Index over the token distribution. This section appears automatically and is omitted when no requests carry tenant labels (backward-compatible with legacy and untenanted workloads). A workload with a single named tenant shows the section with Jain=1.0 (trivially fair).

=== Per-Tenant Metrics ===
  alice: requests=50, tokens=12500
  bob: requests=50, tokens=12480
  Jain Fairness Index: 0.9999

Tenants are listed in lexicographic order. The Jain Fairness Index ranges from 1/N (maximally unfair — one tenant receives everything) to 1.0 (perfectly fair — all tenants receive equal tokens). A balanced two-tenant workload produces a value ≥ 0.99.

To tag requests with tenant labels, set tenant_id in your workload spec cohort:

cohorts:
  - name: alice-traffic
    tenant_id: alice
    ...
  - name: bob-traffic
    tenant_id: bob
    ...

Fitness Evaluation

For automated multi-configuration comparison:

--fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"

Valid metric keys: throughput, tokens_per_sec, p99_ttft, p50_ttft, mean_ttft, p99_e2e, p50_e2e, mean_e2e.

How Normalization Works

  • Latency metrics: 1 / (1 + value/1000) — lower latency → higher score. Reference: 1000 ticks = 1ms
  • Throughput metrics: value / (value + reference) — higher throughput → higher score. References: RPS=100, TPS=10,000

Normalization compresses large differences

The 1/(1+x/1000) function compresses large raw differences into small score differences. A 38% TTFT p99 improvement (39,000→64,000 ticks) maps to only 2-8% fitness score difference. Always examine raw metrics alongside fitness scores for meaningful comparison.

Common Patterns

Saturation Curves

As arrival rate increases past per-instance service capacity (μ ≈ 1/step_time), TTFT p99 grows super-linearly. The queue growth rate excess = λ/k - μ determines how quickly latency degrades.

Tail Latency Spikes

P99 diverges from mean sharply near saturation. A workload at 90% capacity may show 2x mean TTFT but 10x P99 TTFT.

Snapshot Staleness Effects

With kv-utilization scorer alone at --snapshot-refresh-interval 100ms: +354% TTFT degradation. The default composite scorer mitigates ~99% of this effect.

Policy Equivalence at Low Load

All routing policies produce equivalent results (within 5%) at low utilization. Differentiation requires moderate-to-high load where queueing dynamics dominate.

Alpha Overhead

BLIS models non-GPU overhead (tokenization, API serialization) as alpha coefficients. Alpha queueing time (alpha0 + alpha1 × inputLen) delays request enqueue, creating an event gap, but does not occupy the GPU. Alpha output processing time (alpha2) adds to TTFT/E2E metrics but does not affect step scheduling. This means:

  • Simulated E2E > theoretical M/M/k E2E (especially at high load)
  • The divergence is 28-71% at ρ ≥ 0.5 but only 0.3-3.3% at ρ ≤ 0.3
  • To compare with theoretical models, use rho_eff = lambda × step_total not lambda × E2E_total

Per-Request Results

For detailed analysis, save per-request data:

./blis run --model qwen/qwen3-14b \
  --rate 100 --num-requests 500 --metrics-path metrics.json

Each request record includes TTFT, E2E, scheduling delay, and completion status.

Further Reading