Skip to content

Observe / Replay / Calibrate

This guide covers the end-to-end pipeline for validating BLIS simulator accuracy against real inference servers: observe real latencies, replay the captured trace through the DES, and calibrate by comparing results.

# Quick example: observe a real server, replay through the simulator, compare
./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --workload-spec workload.yaml --trace-header trace.yaml --trace-data trace.csv
./blis replay --trace-header trace.yaml --trace-data trace.csv \
  --model qwen/qwen3-14b --results-path results.json
./blis calibrate --trace-header trace.yaml --trace-data trace.csv \
  --sim-results results.json --report calibration.json

Pipeline Overview

The observe/replay/calibrate pipeline has three stages:

Stage Command Input Output
Observe blis observe Workload spec or distribution params + real server TraceV2 (header YAML + data CSV)
Replay blis replay TraceV2 files + simulator config SimResult JSON
Calibrate blis calibrate TraceV2 + SimResult JSON Calibration report JSON

Data flow:

WorkloadSpec YAML ──► blis observe ──► TraceV2 (header.yaml + data.csv)
                        │                       │
                        ▼                       ▼
                   Real Server            blis replay ──► results.json
                              TraceV2 + results.json
                                        blis calibrate ──► calibration.json

Why three separate commands? Each stage is independently useful. You can observe without replaying (to collect latency baselines), replay without calibrating (to test simulator behavior on real traces), or re-calibrate with different simulator configs without re-observing.


blis observe

Dispatches requests to a real inference server, records per-request timing (TTFT, E2E latency, token counts), and exports the results as a TraceV2 file pair.

Required Flags

Flag Type Default Description
--server-url string "" Inference server URL
--model string "" Model name for API requests
--trace-header string "" Output path for TraceV2 header YAML
--trace-data string "" Output path for TraceV2 data CSV

Workload Input (one required)

Four input modes are available. At least one must be provided per invocation:

Mode Flags Description
Named preset --workload <name> --rate <N> Standard workload from defaults.yaml; identical token distributions to blis run --workload <name>
Workload spec --workload-spec <file> Multi-client workload from a YAML file
Distribution synthesis --rate <N> Single-client workload with custom token distributions (see Distribution Synthesis Flags)
Closed-loop --concurrency <N> Fixed pool of virtual users; arrival is response-driven (token distributions from Distribution Synthesis Flags)

Flag reference:

Flag Type Default Description
--workload string "" Preset name (chatbot, summarization, contentgen, multidoc); requires --rate
--workload-spec string "" Path to WorkloadSpec YAML (alternative to --workload or --rate)
--rate float64 0 Requests per second; required for --workload preset mode and distribution synthesis
--concurrency int 0 Number of closed-loop virtual users; mutually exclusive with --rate

Combinations that produce an error:

Combination Error
--workload without --rate preset requires a rate
--workload + --workload-spec mutually exclusive
--workload + --concurrency mutually exclusive
--rate + --concurrency mutually exclusive
--workload-spec + --concurrency use clients[].concurrency in the spec file instead

Note

--workload-spec takes priority over --rate if both are provided — the spec is used and --rate is ignored. All other distribution synthesis flags (--prompt-tokens, etc.) are similarly ignored when --workload-spec is active.

Optional Flags

Flag Type Default Description
--api-key string "" Bearer token for server authentication
--server-type string "vllm" Server type (vllm, tgi, etc.)
--max-concurrency int 256 Maximum simultaneous in-flight requests
--warmup-requests int 0 Number of initial requests to exclude from trace
--no-streaming bool false Disable streaming (use non-streaming HTTP)
--seed int64 42 RNG seed for workload generation
--horizon int64 0 Observation horizon in microseconds (0 = from spec or unlimited)
--num-requests int 0 Maximum requests to generate (0 = from spec or unlimited)
--think-time-ms int 0 Think time in ms between response and next request (concurrency mode only)
--api-format string "completions" API format: completions or chat
--unconstrained-output bool false Do not set max_tokens (let server decide output length)
--rtt-ms float64 0 Measured network round-trip time in milliseconds
--defaults-filepath string "defaults.yaml" Path to defaults.yaml containing preset definitions (preset mode only)

Distribution Synthesis Flags

Used when --rate or --concurrency mode is active (ignored when --workload-spec or --workload <preset> is provided):

Flag Type Default Description
--prompt-tokens int 512 Average prompt token count
--prompt-tokens-stdev int 256 Prompt token standard deviation
--prompt-tokens-min int 2 Minimum prompt tokens
--prompt-tokens-max int 7000 Maximum prompt tokens
--output-tokens int 512 Average output token count
--output-tokens-stdev int 256 Output token standard deviation
--output-tokens-min int 2 Minimum output tokens
--output-tokens-max int 7000 Maximum output tokens
--prefix-tokens int 0 Shared prefix token count

Examples

Named preset mode — drive the server with a standard workload (same shape as blis run --workload chatbot):

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --workload chatbot --rate 10 --num-requests 100 \
  --trace-header trace.yaml --trace-data trace.csv

Workload-spec mode — multi-client workload from a YAML spec:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --workload-spec workload.yaml \
  --trace-header trace.yaml --trace-data trace.csv

Rate mode — quick experiment with distribution synthesis:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --rate 10 --num-requests 100 \
  --prompt-tokens 256 --output-tokens 128 \
  --trace-header trace.yaml --trace-data trace.csv

Chat completions API — use /v1/chat/completions instead of /v1/completions:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --api-format chat --workload-spec workload.yaml \
  --trace-header trace.yaml --trace-data trace.csv

Non-streaming with network RTT — disable SSE streaming and record network latency:

./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
  --no-streaming --rtt-ms 2.5 --workload-spec workload.yaml \
  --trace-header trace.yaml --trace-data trace.csv

Streaming and token counts

By default, observe uses streaming (SSE) and sends stream_options: {include_usage: true} to capture accurate token counts from the final SSE chunk. Non-streaming mode (--no-streaming) parses the full response body instead. Both modes extract finish_reason from server responses.

Prefix sharing

When the workload spec defines prefix groups, observe builds deterministic prefix strings from a fixed vocabulary, seeded by the RNG seed and group name. This activates the server's prefix cache for realistic KV cache hit rates.

Before dispatching requests, observe sends a single calibration request to measure the server's tokens-per-word ratio (typically 1.5–1.7 for BPE tokenizers). Prefix word counts are then scaled so the server tokenizes them to approximately the target prefix_length in the spec — matching what blis run simulates. The calibration result is logged at startup:

INFO Prefix token calibration: 100 words → 167 server tokens (1.670 tokens/word)

If calibration fails (server unreachable, timeout, or abnormal ratio), observe falls back to 1:1 word-to-token mapping with a warning.

Session support

If the workload spec contains session clients, observe runs in closed-loop mode: each completed request may trigger follow-up requests from the session manager, interleaved with pre-generated arrivals by arrival time.


blis replay

Replays a captured TraceV2 file through the BLIS discrete-event simulator. Instead of generating synthetic requests, replay loads real request timing and token counts from the trace.

Replay-Specific Flags

Flag Type Default Description
--trace-header string "" Path to TraceV2 header YAML (required)
--trace-data string "" Path to TraceV2 data CSV (required)
--results-path string "" File to write SimResult JSON for blis calibrate consumption
--model string "" LLM name (required)
--trace-output string "" Export replay results as TraceV2 files (<prefix>.yaml + <prefix>.csv); header mode: "replayed"

Replay also accepts all shared simulation config flags (--latency-model, --total-kv-blocks, --max-num-running-reqs, etc.) — the same flags available in blis run. See Configuration for the full list.

How Replay Differs from blis run

Aspect blis run blis replay
Request source Generated from workload spec or CLI distributions Loaded from TraceV2 CSV
Arrival times Synthesized by arrival process (Poisson, etc.) Exact timestamps from trace
Token counts Sampled from distributions Actual observed values
Horizon From --horizon flag or spec Auto-computed as 2x max arrival time (override with --horizon)
Output format Full MetricsOutput JSON SimResult JSON array (request_id, ttft_us, e2e_us, input_tokens, output_tokens)
Session support Session manager creates follow-ups Session structure encoded in trace (no manager needed)
Trace export --trace-output (header mode: "generated") --trace-output (header mode: "replayed")

Latency model matters

The replay command simulates token generation using the configured latency model. For accurate calibration, choose the latency model that best matches the server's behavior. See Latency Models for guidance on selecting between roofline, blackbox, cross-model, and trained-roofline modes.


blis calibrate

Compares real observed latencies (from blis observe) against simulator predictions (from blis replay) and produces a calibration report.

Flags

Flag Type Default Description
--trace-header string "" Path to TraceV2 header YAML (required)
--trace-data string "" Path to TraceV2 data CSV (required)
--sim-results string "" Path to SimResult JSON from blis replay (required)
--report string "" Path to write calibration report JSON (required)
--warmup-requests int -1 Requests to exclude from comparison (-1 = use trace header value, 0 = include all)
--network-rtt-us int64 -1 Network RTT in microseconds added to sim-side latencies (-1 = use trace header value)
--network-bandwidth-mbps float64 0 Network bandwidth in Mbps for upload/download delay (0 = no delay)

Sentinel defaults

The --warmup-requests and --network-rtt-us flags use -1 as a sentinel meaning "read the value from the trace header." This allows the calibration to automatically use the warmup count and RTT recorded during observation. Pass 0 explicitly to override (include all requests or apply no RTT correction).

Interpreting the Calibration Report

The calibration report JSON contains four sections:

trace_info — Summary of the input data:

{
  "num_requests": 100,
  "warm_up_excluded": 5,
  "matched_pairs": 95,
  "token_mismatches": 2,
  "duration": "2m30s"
}
  • matched_pairs: Requests matched by ID between trace and sim results
  • token_mismatches: Pairs where observed and simulated token counts differ (indicates potential data quality issues)

metrics — Per-metric comparison (TTFT and E2E latency):

{
  "ttft": {
    "RealP50": 1234.5,
    "SimP50": 1200.0,
    "RealP99": 4567.8,
    "SimP99": 4500.0,
    "MAPE": 0.05,
    "PearsonR": 0.95,
    "BiasDirection": "over-predict",
    "Quality": "good",
    "Count": 95
  },
  "e2e": { ... }
}
Field Meaning
RealP50/P90/P95/P99 Real (observed) latency percentiles in microseconds
SimP50/P90/P95/P99 Simulated latency percentiles in microseconds
MAPE Mean Absolute Percentage Error (lower is better)
PearsonR Pearson correlation coefficient (closer to 1.0 is better)
BiasDirection over-predict, under-predict, or neutral
Quality Rating: excellent, good, fair, or poor

config_match — Tracks which simulator config parameters matched the observed server config (currently reports matched and defaulted arrays).

known_limitations — Documents known sources of sim/real divergence (batch step granularity, synthetic prefix tokens, speculative decoding).


Worked Example

This walkthrough demonstrates the full pipeline: define a workload, observe a real vLLM server, replay through the simulator, and interpret the calibration report.

Step 1: Define a workload

Create a workload spec (workload.yaml):

rate: 5.0
num_requests: 50
clients:
  - id: "chat-user"
    rate_fraction: 1.0
    slo_class: "standard"
    arrival:
      process: poisson
    input_distribution:
      type: gaussian
      params:
        mean: 256
        std_dev: 64
        min: 32
        max: 1024
    output_distribution:
      type: gaussian
      params:
        mean: 128
        std_dev: 32
        min: 16
        max: 512

Step 2: Observe the real server

./blis observe \
  --server-url http://localhost:8000 \
  --model qwen/qwen3-14b \
  --workload-spec workload.yaml \
  --warmup-requests 5 \
  --trace-header trace.yaml \
  --trace-data trace.csv

This sends 50 requests to the server at ~5 req/s, excludes the first 5 from the trace (warmup), and writes the TraceV2 files.

Step 3: Replay through the simulator

./blis replay \
  --trace-header trace.yaml \
  --trace-data trace.csv \
  --model qwen/qwen3-14b \
  --latency-model roofline \
  --results-path results.json

The simulator replays the same requests (arrival times, token counts) through the DES using the roofline latency model and writes per-request results.

Step 4: Calibrate

./blis calibrate \
  --trace-header trace.yaml \
  --trace-data trace.csv \
  --sim-results results.json \
  --report calibration.json

The calibration command matches requests by ID, applies warmup exclusion and RTT normalization from the trace header, and produces the report.

Step 5: Interpret results

cat calibration.json | python3 -m json.tool

Look for:

  • MAPE < 0.10 and Quality = "good" or "excellent" → simulator is well-calibrated for this workload
  • BiasDirection = "over-predict" → simulator latencies are higher than reality (conservative)
  • BiasDirection = "under-predict" → simulator latencies are lower than reality (optimistic — may need latency model tuning)
  • High token_mismatches → data quality issue; check if the server truncated outputs

If calibration quality is poor, try:

  1. Different latency model: Switch from roofline to blackbox or crossmodel (see Latency Models)
  2. Adjust server config flags: Match --max-num-running-reqs and --max-num-scheduled-tokens to the real server's settings
  3. Increase sample size: Use more requests (--num-requests) for statistical stability

GIE Headers for llm-d

When observing an llm-d cluster with Gateway Inference Extension (GIE), blis observe automatically sends two HTTP headers that GIE's Endpoint Picker (EPP) uses for admission control:

Header Workload spec field Purpose
x-gateway-inference-objective slo_class Name of an InferenceObjective CRD on the target cluster. EPP looks up this CRD and reads its spec.priority integer for queue ordering and shedding.
x-gateway-inference-fairness-id tenant_id Tenant key for per-tenant fair-share scheduling. Fairness is enforced between requests of the same priority level.

Headers are only sent when the field is non-empty, so non-GIE servers are unaffected.

How GIE resolves priority

GIE does not accept a priority integer directly from the client. Instead, priority is resolved server-side through a CRD lookup:

  1. Client sends x-gateway-inference-objective: critical (the slo_class value)
  2. EPP looks up InferenceObjective/critical CRD on the cluster
  3. EPP reads spec.priority (e.g. 100) from the CRD
  4. The integer priority is used for strict priority queue ordering and shedding decisions

If no matching CRD exists, EPP defaults to priority 0.

Prerequisite: deploy InferenceObjective CRDs

For GIE headers to have any effect, matching InferenceObjective CRDs must exist on the target cluster. The CRD names must match the slo_class values in your workload spec:

# On your Kubernetes cluster:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: critical              # must match slo_class in workload spec
spec:
  priority: 100               # higher = more important; negative = sheddable
  poolRef:
    name: my-pool
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: background
spec:
  priority: -10               # negative priority → shed under load (HTTP 503)
  poolRef:
    name: my-pool

Workload spec example

Set slo_class and tenant_id on your clients to activate GIE headers. The slo_class value must match an InferenceObjective CRD name on the target cluster:

clients:
  - id: "realtime-api"
    slo_class: "critical"       # → sent as x-gateway-inference-objective header
    tenant_id: "team-alpha"     # → sent as x-gateway-inference-fairness-id header
  - id: "batch-job"
    slo_class: "background"     # → GIE resolves to negative priority via CRD
    tenant_id: "team-beta"

Note: The GIE API version (v1alpha2) shown above may differ on your cluster. Check your installed CRD version with kubectl get crd inferenceobjectives.inference.networking.x-k8s.io.


Tips

  • Warmup requests: Always use --warmup-requests during observation to exclude cold-start latencies (JIT compilation, KV cache initialization) from the trace.
  • Network RTT: If observing a remote server, measure RTT with ping and pass --rtt-ms. The calibrate command uses this to normalize sim-side latencies.
  • Reproducibility: The --seed flag controls workload generation RNG. Same seed + same spec = same request sequence.
  • Graceful shutdown: Press Ctrl+C during observation to stop gracefully — in-flight requests complete and all recorded data is written to the trace files.
  • Large workloads: Use --max-concurrency to limit in-flight requests and avoid overwhelming the server.