Skip to content

Metrics & Results

This guide covers how to read BLIS output — from the primary JSON metrics to anomaly counters, KV cache diagnostics, per-SLO breakdowns, fitness scores, and trace summaries.

# Quick example: run with all diagnostic output enabled
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 200 --num-requests 1000 \
  --trace-level decisions --summarize-trace \
  --fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"

Primary Metrics (JSON Output)

The JSON output on stdout (and in the --metrics-path file when set) contains:

Field Unit Description
instance_id string Instance identifier ("cluster" for aggregate output)
completed_requests count Requests that finished before horizon
still_queued count Requests in the wait queue at horizon end
still_running count Requests in the running batch at horizon end
injected_requests count Total requests that entered the simulator
total_input_tokens count Sum of input tokens across all completed requests
total_output_tokens count Sum of output tokens across all completed requests
vllm_estimated_duration_s s Estimated vLLM wall-clock duration for the workload
responses_per_sec req/s Throughput (completed requests / effective duration)
tokens_per_sec tok/s Output token throughput
e2e_mean_ms ms Mean End-to-End latency
e2e_p90_ms ms 90th percentile E2E
e2e_p95_ms ms 95th percentile E2E
e2e_p99_ms ms 99th percentile E2E — tail latency
ttft_mean_ms ms Mean Time To First Token — average responsiveness
ttft_p90_ms ms 90th percentile TTFT
ttft_p95_ms ms 95th percentile TTFT
ttft_p99_ms ms 99th percentile TTFT — tail latency
itl_mean_ms ms Mean Inter-Token Latency — streaming smoothness
itl_p90_ms ms 90th percentile ITL
itl_p95_ms ms 95th percentile ITL
itl_p99_ms ms 99th percentile ITL
scheduling_delay_p99_ms ms 99th percentile scheduling delay — queue wait time
kv_allocation_failures count KV cache allocation failures (omitted when zero)
preemption_count count Total preemption events (integer; see also KV Cache Metrics for the rate)
dropped_unservable count Requests dropped at enqueue due to context limit violations
length_capped_requests count Requests force-completed at MaxModelLen
timed_out_requests count Requests that exceeded their deadline
saturation object Post-hoc saturation detection result (omitted when --post-hoc-detector none; see Saturation Detection)
requests array Per-request detail records (omitted when empty; see Per-Request Fields)

Scheduling Delay

Scheduling delay isolates the WaitQ wait time from compute time. High scheduling delay + low preemptions = queue saturation (add instances). Low scheduling delay + high TTFT = compute saturation (reduce batch size or use chunked prefill).

Per-request units

All per-request latency fields (ttft_ms, e2e_ms, itl_ms, scheduling_delay_ms) are in milliseconds — converted from internal ticks by dividing by 1,000. Aggregate metrics (scheduling_delay_p99_ms, etc.) are also in milliseconds. See Known Unit Gotchas for the full unit reference. Note: hypothesis scripts written before BC-14 may divide scheduling_delay_ms by 1,000 unnecessarily — that field is now already in ms.

Saturation Detection

The saturation field provides automated classification of simulation runs using post-hoc analysis (#1369). This is enabled via the --post-hoc-detector flag and is distinct from the real-time flow control saturation detector.

Structure:

{
  "saturation": {
    "level": "STABLE",
    "score": 0.35,
    "confidence": 0.95,
    "signals": {
      "rate_deficit": 0.0,
      "latency_trend": 0.12
    }
  }
}

Field Type Description
level string Classification: STABLE, BACKLOGGED, or OVERLOADED
score float Composite score in [0, 1] range. STABLE: < 0.5, BACKLOGGED: [0.5, 0.75), OVERLOADED: ≥ 0.75
confidence float Statistical confidence based on sample size (min(1.0, sqrt(N)))
signals object Detector-specific metrics (varies by detector type)

Detectors:

  • composite (default when --post-hoc-detector composite): Combines two signals:
  • rate_deficit: max(0, 1 - completions/arrivals) — measures throughput saturation
  • latency_trend: (second_half_mean - first_half_mean) / first_half_mean — detects queue buildup

  • threshold (when --post-hoc-detector threshold): Simple mean E2E comparison against --saturation-threshold-ms (default 5000ms):

  • mean_e2e: Mean end-to-end latency across all completed requests
  • threshold: Configured threshold value
  • STABLE when mean_e2e < threshold, OVERLOADED when mean_e2e > threshold

  • none (default): No saturation analysis performed, saturation field omitted from output

Usage:

# Run with composite detector
./blis run --model qwen/qwen3-14b --post-hoc-detector composite

# Run with threshold detector (custom threshold)
./blis run --model qwen/qwen3-14b --post-hoc-detector threshold --saturation-threshold-ms 3000

Use cases: - Automated capacity planning: classify runs as under-provisioned (OVERLOADED), near-capacity (BACKLOGGED), or healthy (STABLE) - Experiment analysis: quickly identify which configurations saturate the system - CI/CD gates: fail builds if saturation score exceeds threshold

Per-Request Fields

When the requests array is non-empty, each entry contains:

Field Unit Description
requestID string Unique request identifier
arrived_at s Arrival timestamp in simulation seconds (ArrivalTime / 1,000,000)
num_prefill_tokens count Input (prompt) token count
num_decode_tokens count Output token count
ttft_ms ms Time To First Token for this request
itl_ms ms Mean Inter-Token Latency for this request
e2e_ms ms End-to-End latency for this request
scheduling_delay_ms ms Time spent in the wait queue before first scheduling
slo_class string SLO class (critical, standard, batch, etc.) — omitted if empty
tenant_id string Tenant label — omitted if empty
handled_by string Instance ID that processed the request — omitted if empty
model string Model tag — omitted if empty
length_capped bool true if request was force-completed at MaxModelLen — omitted when false
gateway_queue_delay_ms ms Time spent in the gateway queue — omitted when flow control is disabled
session_id string Multi-turn session link — omitted for single-turn requests
round_index int Round within session (0 = first turn); always present, defaults to 0 for non-session requests

Anomaly Counters

When anomalies are detected, BLIS prints === Anomaly Counters ===:

Counter What It Means Action
Priority Inversions A lower-priority request was scheduled before a higher-priority one Check scheduler choice — use priority-fcfs for SLO workloads
HOL Blocking Events A long prefill blocked shorter requests Enable chunked prefill: --long-prefill-token-threshold 256
Rejected Requests (Admission) Admission policy rejected the request at cluster ingress Check token bucket capacity or admission policy
Shed (tier) Per-SLO-class breakdown of admission rejections under overload — printed as indented sub-items beneath Rejected Requests (Admission) Adjust slo_priorities in the policy bundle or raise admission thresholds
Rejected Requests (Routing) No routable instances for the request's model — all instances are Loading or Draining Increase initial_nodes, reduce loading_delay.mean, or stagger drain operations
Dropped Unservable Request exceeds --max-model-len context window or needs more KV blocks than exist Check --max-model-len setting; increase --total-kv-blocks or reduce max input tokens
Timed Out Requests Request exceeded its client deadline before completing Increase --timeout or reduce load
Length-Capped Requests Request was force-completed when it reached MaxModelLen tokens during decode Expected if workloads push against --max-model-len; set --max-model-len 0 (unlimited) to disable the cap
Gateway Queue Depth (horizon) Requests still waiting in the gateway queue when the simulation ended — printed only when > 0 Reduce arrival rate or increase cluster capacity
Gateway Queue Shed Requests shed from the gateway queue because it was full — printed only when > 0 Increase --max-gateway-queue-depth or enable --flow-control with a saturation detector

Stdout-only vs JSON counters

Dropped Unservable, Timed Out Requests, and Length-Capped Requests also appear as fields in the --metrics-path JSON output (see Primary Metrics: dropped_unservable, timed_out_requests, length_capped_requests).

All other counters in this table — including Shed (tier), Gateway Queue Depth (horizon), and Gateway Queue Shed — are stdout-only and do not appear in the JSON file.

blis replay anomaly block

blis replay produces a subset of this output: Timed Out Requests, Gateway Queue Depth (horizon), and Gateway Queue Shed are currently missing from the blis replay anomaly block even when non-zero (tracked in issue #1184). PD Disaggregation Metrics are not supported in replay at all.

KV Cache Metrics

When KV cache activity is nonzero, BLIS prints === KV Cache Metrics ===:

Metric Meaning Concern Threshold
Preemption Rate Ratio of preemption events to completed requests — a single request can be preempted more than once > 5% indicates KV pressure
Cache Hit Rate Fraction of blocks served from prefix cache Higher is better — indicates prefix reuse
KV Thrashing Rate Repeated preemption-reallocation cycles > 0 indicates severe memory pressure

Per-SLO-Class Metrics

When multiple SLO classes are present in the workload, BLIS prints per-class TTFT and E2E distributions. This lets you verify that critical requests meet SLOs even when batch traffic is heavy.

Per-Model Metrics

When instances serve different models (multi-model deployment), BLIS prints per-model TTFT mean/p99, E2E mean/p99, and throughput (req/s). This appears automatically when requests carry model tags. Output format:

=== Per-Model Metrics ===
  qwen/qwen3-14b:
    TTFT: p50=1234.56 p99=5678.90 (n=250)
    E2E:  p50=9876.54 p99=12345.67 (n=250)
    Throughput: 50.00 req/s, 6400.00 tok/s

Per-model metrics appear on stdout only. The --metrics-path JSON file (see Primary Metrics) contains only the aggregate MetricsOutput fields. The same applies to per-tenant, session, and PD metrics — all are stdout-only sections.

Per-Tenant Metrics

When requests carry tenant_id labels, BLIS prints per-tenant request counts, total output tokens served, and a Jain Fairness Index over the token distribution. This section appears automatically and is omitted when no requests carry tenant labels (backward-compatible with legacy and untenanted workloads). A workload with a single named tenant shows the section with Jain=1.0 (trivially fair).

=== Per-Tenant Metrics ===
  alice: requests=50, tokens=12500
  bob: requests=50, tokens=12480
  Jain Fairness Index: 0.9999

Tenants are listed in lexicographic order. The Jain Fairness Index ranges from 1/N (maximally unfair — one tenant receives everything) to 1.0 (perfectly fair — all tenants receive equal tokens). A balanced two-tenant workload produces a value ≥ 0.99.

To tag requests with tenant labels, set tenant_id in your workload spec cohort:

cohorts:
  - name: alice-traffic
    tenant_id: alice
    ...
  - name: bob-traffic
    tenant_id: bob
    ...

Session Metrics

When a workload contains multi-turn sessions (requests with session_id), BLIS prints a === Session Metrics === block:

Line What It Means
Sessions Number of distinct multi-turn sessions in the workload
TTFT cold (round 0) TTFT distribution for the first round of each session — no KV cache warmth from prior context
TTFT warm (round≥1) TTFT distribution for follow-up rounds — benefits from cached context
Session duration End-to-end duration from session start (first round arrival) to last round completion

Cold vs. warm TTFT split reveals prefix cache effectiveness: if TTFT warmTTFT cold, prefix caching is not activating for continuations — check --cache-signal-delay and scorer configuration.

PD Disaggregation Metrics

When PD disaggregation is active (--prefill-instances > 0), BLIS prints a === PD Metrics === block:

Field What It Means
Disaggregated Requests Requests that completed KV transfer through the prefill→decode path
Dropped at Decode KV Requests whose transferred KV blocks could not be accepted by the decode instance
Prefill Throughput Sub-request completion rate on prefill instances (sub-req/s)
Decode Throughput Sub-request completion rate on decode instances (sub-req/s)
Load Imbalance Ratio max(prefill_load, decode_load) / min(...)1.0 = perfectly balanced; inf = one pool has no completions
Parent TTFT Client-visible TTFT (prefill TTFT + KV transfer duration + first decode step); distribution in microseconds
KV Transfer Duration Time to transfer KV blocks from prefill to decode instance; distribution in microseconds
Peak Concurrent Transfers Maximum simultaneous in-flight KV transfers (only with --pd-transfer-contention)
Mean Transfer Queue Depth Average queue depth at the transfer bandwidth bottleneck (only with --pd-transfer-contention)

blis run only

PD Disaggregation Metrics are produced by blis run only. blis replay does not support PD disaggregation (a warning is logged if PD flags are passed to replay). blis observe dispatches to real servers and produces no DES output.

Fitness Evaluation

For automated multi-configuration comparison:

--fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"

Valid metric keys: throughput, tokens_per_sec, p99_ttft, p50_ttft, mean_ttft, p99_e2e, p50_e2e, mean_e2e.

How Normalization Works

  • Latency metrics: 1 / (1 + value/1000) — lower latency → higher score. Reference: 1000 ticks = 1ms
  • Throughput metrics: value / (value + reference) — higher throughput → higher score. References: RPS=100, TPS=10,000

Normalization compresses large differences

The 1/(1+x/1000) function compresses large raw differences into small score differences. A 38% TTFT p99 improvement (39,000→64,000 ticks) maps to only 2-8% fitness score difference. Always examine raw metrics alongside fitness scores for meaningful comparison.

Common Patterns

Saturation Curves

As arrival rate increases past per-instance service capacity (μ ≈ 1/step_time), TTFT p99 grows super-linearly. The queue growth rate excess = λ/k - μ determines how quickly latency degrades.

Tail Latency Spikes

P99 diverges from mean sharply near saturation. A workload at 90% capacity may show 2x mean TTFT but 10x P99 TTFT.

Snapshot Staleness Effects

With kv-utilization scorer alone at --snapshot-refresh-interval 100ms: +354% TTFT degradation. The default composite scorer mitigates ~99% of this effect.

Policy Equivalence at Low Load

All routing policies produce equivalent results (within 5%) at low utilization. Differentiation requires moderate-to-high load where queueing dynamics dominate.

Alpha Overhead

BLIS models non-GPU overhead (tokenization, API serialization) as alpha coefficients. Alpha queueing time (alpha0 + alpha1 × inputLen) delays request enqueue, creating an event gap, but does not occupy the GPU. Alpha output processing time (alpha2) adds to TTFT/E2E metrics but does not affect step scheduling. This means:

  • Simulated E2E > theoretical M/M/k E2E (especially at high load)
  • The divergence is 28-71% at ρ ≥ 0.5 but only 0.3-3.3% at ρ ≤ 0.3
  • To compare with theoretical models, use rho_eff = lambda × step_total not lambda × E2E_total

Per-Request Results

For detailed analysis, save per-request data:

./blis run --model qwen/qwen3-14b \
  --rate 100 --num-requests 500 --metrics-path metrics.json

Each request record includes TTFT, E2E, scheduling delay, and completion status.

Further Reading