Workload Specifications¶
This guide covers how to define the traffic patterns BLIS simulates — from simple CLI flags to complex multi-client YAML workload specs.
# Quick example: workload-spec YAML
./blis run --model qwen/qwen3-14b \
--num-instances 4 --workload-spec examples/multiturn-chat-demo.yaml
Workload Modes¶
BLIS supports three modes, in order of precedence:
| Mode | Flag | Best For |
|---|---|---|
| Workload-spec YAML | --workload-spec <path> |
Multi-client workloads with custom distributions |
| CLI distribution | --rate, --num-requests, --prompt-tokens |
Quick single-client experiments |
| Named presets | --workload chatbot |
Standard workload profiles |
Migration: --workload traces removed
The --workload traces and --workload-traces-filepath flags have been removed. They performed lossy statistical approximation (averaged token lengths, constant arrival) rather than faithful replay. For trace replay, use --workload-spec with a TraceV2 YAML file instead. To export simulation results as TraceV2, use --trace-output <prefix>.
Modeling Real Workloads¶
This section maps common traffic patterns to YAML workload spec configurations. For schema details, see the Workload Spec Schema.
Interactive Chat¶
User-facing chat applications need low latency, memoryless arrivals (users arrive independently), and moderate token variance around a central prompt length.
clients:
- id: "chat-user"
rate_fraction: 1.0
slo_class: "critical" # Latency-sensitive — tracked separately in metrics
prefix_group: "system-prompt" # Shared system prompt enables prefix caching
prefix_length: 512 # 512 tokens of shared context prepended to each request
arrival:
process: poisson # Memoryless: users arrive independently of each other
input_distribution:
type: gaussian # Moderate variance around a typical prompt length
params:
mean: 256
std_dev: 128
min: 2
max: 4096
output_distribution:
type: exponential # Most replies short, occasional long answers
params:
mean: 128
Pair with prefix-affinity routing for cache reuse:
./blis run --model qwen/qwen3-14b \
--num-instances 4 --workload-spec chat.yaml \
--routing-policy weighted --routing-scorers "prefix-affinity:3,queue-depth:2"
RAG with Shared Prefixes¶
Retrieval-augmented generation workloads share a common document context across requests. The prefix_group and prefix_length fields model this shared context, and prefix-affinity routing ensures requests with the same prefix hit cached KV blocks on the same instance.
clients:
- id: "rag-query"
rate_fraction: 1.0
slo_class: "standard"
prefix_group: "doc-context" # All requests share the retrieved document context
prefix_length: 2048 # Large shared prefix (retrieved passages)
arrival:
process: poisson
input_distribution:
type: gaussian
params:
mean: 128 # Short user queries appended after the prefix
std_dev: 64
min: 2
max: 512
output_distribution:
type: exponential # Short answers mostly, occasional long explanations
params:
mean: 64
Run with aggressive prefix-affinity to maximize cache reuse:
./blis run --model qwen/qwen3-14b \
--num-instances 4 --workload-spec rag.yaml \
--routing-policy weighted --routing-scorers "prefix-affinity:3,queue-depth:1"
Batch / Offline Processing¶
Non-interactive workloads (summarization, data extraction) tolerate latency and typically have higher token counts. Use batch or background SLO classes so per-class metrics track them separately from latency-sensitive traffic.
clients:
- id: "batch-summarize"
rate_fraction: 1.0
slo_class: "batch" # Latency-tolerant — won't pollute critical-class metrics
arrival:
process: gamma # Bursty job queue patterns (jobs submitted in waves)
cv: 2.0 # CV > 1 produces bursts; CV = 1 is Poisson-equivalent
input_distribution:
type: gaussian
params:
mean: 4096 # Long documents for summarization
std_dev: 1000
min: 100
max: 8192
output_distribution:
type: gaussian
params:
mean: 512
std_dev: 150
min: 10
max: 2048
Bursty Traffic¶
For flash sales or traffic spikes, use Gamma arrivals with high CV or cohort spike patterns:
# Option 1: Sustained burstiness via Gamma CV=3.5
clients:
- id: "bursty-client"
rate_fraction: 1.0
slo_class: "critical"
arrival:
process: gamma
cv: 3.5 # High CV produces sustained burst clusters
input_distribution:
type: exponential
params:
mean: 512
output_distribution:
type: exponential
params:
mean: 256
For time-bounded traffic spikes, use cohort spike patterns instead (see Cohort Dynamics below).
DES impact of burstiness
Gamma CV=3.5 produces 1.66x worse TTFT p99 at sub-saturation because burst events arrive before the prior burst drains. The effect is load-duration dependent: visible at moderate load, drowned by queue growth at high overload.
Multi-Turn Conversations¶
For multi-round chat with context accumulation (e.g., reasoning models), use the reasoning field:
clients:
- id: "reasoning-session"
rate_fraction: 1.0
slo_class: "standard"
arrival:
process: poisson
input_distribution:
type: gaussian
params:
mean: 256
std_dev: 128
min: 2
max: 2048
output_distribution:
type: gaussian
params:
mean: 256
std_dev: 128
min: 2
max: 2048
reasoning:
reason_ratio_distribution: # Fraction of output that is "reasoning" tokens
type: constant
params:
value: 50 # 50% reasoning ratio
multi_turn:
max_rounds: 4 # Up to 4 conversation rounds per session
think_time_us: 5000000 # 5 seconds between rounds (user think time)
context_growth: accumulate # Each round prepends full prior context
Context accumulation means round N sees all prior input+output tokens as prefix, creating growing KV cache pressure across rounds.
Open-loop vs closed-loop scheduling¶
By default, multi-turn sessions use closed-loop scheduling: each round's arrival time is set by the simulator after the prior round completes — next_arrival = completion_time + think_time_us. This means actual server latency propagates into inter-round spacing; under high load the session stretches to reflect real queuing delays. This is the behaviorally correct default for capacity planning.
Setting closed_loop: false switches to open-loop scheduling: all round arrival times are pre-stamped before simulation begins, using a 1 µs/output-token heuristic as the expected completion time. Server latency does not affect inter-round spacing — useful for reproducing pre-defined workload traces (e.g., inference-perf-style exports) where you want arrival times fixed regardless of how the simulator performs.
CLI Distribution Mode (Default)¶
The simplest way to generate traffic:
./blis run --model qwen/qwen3-14b \
--rate 100 --num-requests 500 \
--prompt-tokens 512 --prompt-tokens-stdev 256 \
--output-tokens 256 --output-tokens-stdev 128
Writing a Workload-Spec YAML¶
For complex workloads, use a YAML spec:
version: "2"
seed: 42
aggregate_rate: 100 # Total arrival rate (req/s)
num_requests: 1000
clients:
- id: "interactive"
rate_fraction: 0.6 # 60% of traffic — models a dominant chat workload
slo_class: "critical" # Latency-sensitive: per-class metrics tracked separately
prefix_group: "chat" # Shared system prompt — enables prefix cache reuse
prefix_length: 512 # 512-token system prompt prepended to each request
arrival:
process: poisson # Memoryless: independent user arrivals (typical for web traffic)
input_distribution:
type: gaussian # Moderate variance around a mean prompt length
params:
mean: 256
std_dev: 128
min: 2
max: 4096
output_distribution:
type: exponential # Right-skewed: most replies short, occasional long ones
params:
mean: 128
- id: "batch"
rate_fraction: 0.4 # 40% of traffic — background processing share
slo_class: "batch" # Latency-tolerant: won't pollute critical-class TTFT metrics
arrival:
process: gamma # Bursty: jobs submitted in waves from a job queue
cv: 2.0 # CV > 1 produces clustered arrivals (CV=1 ≈ Poisson)
input_distribution:
type: gaussian
params:
mean: 1024 # Longer inputs typical for summarization/extraction
std_dev: 512
min: 2
max: 7000
output_distribution:
type: gaussian
params:
mean: 512
std_dev: 256
min: 2
max: 7000
Arrival Processes¶
| Process | Behavior | DES Impact | Use When |
|---|---|---|---|
poisson |
Memoryless, exponentially distributed inter-arrival times | Steady event stream | Default; matches typical web traffic |
gamma |
Bursty (CV > 1) or regular (CV < 1) inter-arrivals | Burst events create temporary overloads | Modeling real traffic with bursts |
weibull |
Shape-controlled inter-arrival times | Similar to gamma, different tail behavior | Specific traffic shape matching |
constant |
Fixed inter-arrival time (deterministic) | Perfectly regular event stream | Controlled experiments, debugging |
DES implication
Arrival processes directly determine the timing of ArrivalEvent injections into the event queue. Gamma CV=3.5 produces 1.66x worse TTFT p99 at sub-saturation because burst events arrive before the prior burst drains.
What the arrival process governs for multi-turn clients¶
The unit controlled by the arrival process differs by client type:
| Client type | Arrival process governs |
|---|---|
| Non-multi-turn | Inter-arrival time between individual requests |
Multi-turn, single_session: false (default) |
Inter-arrival time between session starts — one new session begins per IAT tick, producing max_rounds rounds per session |
Multi-turn, single_session: true |
Fires exactly once per client, determining when that client's single session starts |
Within any session, inter-round spacing is always controlled by think_time_us, never by the arrival process.
ClientSpec is a traffic source, not a single user¶
A ClientSpec represents a traffic source that emits sessions over time, not a single user. One client with single_session: false and a long horizon generates many sequential sessions — it models a stream of independent users arriving over time. single_session: true changes the semantics to one persistent user who has exactly one conversation.
This distinction matters for modeling concurrent users correctly:
- Wrong for 50 concurrent users: one client with
rate_fraction: 1.0andsingle_session: false— this produces sessions sequentially from a single source, not 50 simultaneous sessions. - Right for 50 concurrent users: 50 clients (or a cohort with
population: 50) each withsingle_session: true— each client draws one independent IAT and starts its own session. Per-client RNGs are seeded independently so start times are staggered even when all clients share the same arrival spec.
Token Distributions¶
| Type | Parameters | Behavior |
|---|---|---|
gaussian |
mean, std_dev, min, max |
Normal distribution, clamped to range |
exponential |
mean |
Right-skewed, long tail |
pareto_lognormal |
alpha, xm, mu, sigma, mix_weight |
Heavy-tailed (Pareto-LogNormal mixture) |
constant |
value |
Fixed token count (useful for controlled experiments) |
empirical |
params |
Inline key-value map (token count → probability) |
SLO Classes¶
Requests can be tagged with SLO classes for per-class metric tracking:
| Class | Intended Use |
|---|---|
critical |
Latency-sensitive user-facing requests |
standard |
Normal priority |
sheddable |
Can be dropped under load |
batch |
Offline processing, latency-tolerant |
background |
Lowest priority |
Estimating Capacity for Your Workload¶
CLI mode and YAML mode have different defaults
CLI mode uses --prompt-tokens 512, --output-tokens 512 by default. With the default roofline latency model (Qwen3-14B / H100 / TP=1), a saturated single instance handles ~17 req/s. YAML workloads define their own distributions — a YAML with shorter sequences (e.g., mean=256/128) will have higher per-instance throughput. Don't reuse capacity estimates across modes or models.
Parameter resolution reference
For the complete precedence chain of how CLI flags, workload-spec YAML, and defaults.yaml interact, see Parameter Resolution by Category in the Configuration Reference. See also Common Pitfalls for documented gotchas like --rate vs aggregate_rate.
Multi-Client Composition¶
Use blis compose to merge multiple workload specs into a single spec:
# Merge a chat workload and a batch workload into one combined spec
./blis compose --from chat.yaml --from batch.yaml > combined.yaml
The compose operation:
- Concatenates all client lists from each input spec
- Sums aggregate rates (e.g., 60 req/s + 40 req/s = 100 req/s total)
- Renormalizes
rate_fractionvalues proportionally: each client's merged fraction =original_fraction * (spec_rate / total_rate), preserving absolute request rates
This lets you build complex mixed workloads from reusable single-purpose specs.
Cohort Dynamics¶
Cohorts model populations of similar clients with time-varying traffic patterns. Instead of defining individual clients, you specify a population count and a traffic pattern. BLIS expands each cohort into individual ClientSpec entries.
Three traffic patterns are available:
| Pattern | Behavior | Use Case |
|---|---|---|
diurnal |
Sinusoidal rate modulation over 24 hours (peak_hour, peak_to_trough_ratio) | Day/night traffic cycles |
spike |
Clients active only during [start_time_us, start_time_us + duration_us) |
Flash sales, traffic bursts |
drain |
Linear ramp-down to zero rate over ramp_duration_us |
Graceful shutdown, load shedding |
version: "2"
aggregate_rate: 200
num_requests: 5000
cohorts:
- id: "daytime-users"
population: 50 # Expands to 50 individual clients
slo_class: "critical"
rate_fraction: 0.7
arrival:
process: poisson
input_distribution:
type: gaussian
params: { mean: 256, std_dev: 100, min: 2, max: 2048 }
output_distribution:
type: exponential
params: { mean: 128 }
diurnal:
peak_hour: 14 # Peak at 2 PM
peak_to_trough_ratio: 3.0 # 3x more traffic at peak vs trough
- id: "flash-sale"
population: 20
slo_class: "standard"
rate_fraction: 0.3
arrival:
process: gamma
cv: 2.5
input_distribution:
type: gaussian
params: { mean: 128, std_dev: 50, min: 2, max: 512 }
output_distribution:
type: exponential
params: { mean: 64 }
spike:
start_time_us: 10000000 # Spike starts at 10 seconds
duration_us: 5000000 # Lasts 5 seconds
Multi-Turn Sessions¶
Cohorts support the same advanced client features as explicit clients. The fields below propagate to every expanded member client unchanged.
| Field | Type | Description |
|---|---|---|
reasoning |
object | Enables reasoning workload; requires reason_ratio_distribution (reasoning-to-output token ratio). multi_turn.max_rounds controls conversation depth |
closed_loop |
*bool | null (omitted): true for multi-turn, false for all others. true: each round waits for the previous reply. false: all rounds pre-generated at open-loop arrival times |
timeout |
*int64 | Per-request timeout in µs. null: 300 s default when closed-loop; no deadline when open-loop. 0 = no timeout |
prefix_length |
int | Shared prefix token count prepended to every request |
network |
object | Client-side network RTT and bandwidth simulation |
multimodal |
object | Mixed-modality token generation (text + image/audio/video) |
Example — 50 agentic clients, each running 10-round reasoning sessions with a 2-second think time between rounds:
version: "2"
category: reasoning
aggregate_rate: 5.0
cohorts:
- id: agents
population: 50
rate_fraction: 1.0
arrival:
process: poisson
input_distribution:
type: gaussian
params: { mean: 512, std_dev: 128, min: 64, max: 2048 }
output_distribution:
type: exponential
params: { mean: 256 }
reasoning:
reason_ratio_distribution:
type: constant
params: { value: 50 } # 50% reasoning tokens (0-100 scale)
multi_turn:
max_rounds: 10
think_time_us: 2000000 # 2 seconds between rounds
context_growth: accumulate
closed_loop: true
timeout: 300000000 # 300-second per-request timeout
Note: Do not combine
spike(which creates a short lifecycle window) with multi-turnreasoning(which generates rounds seconds apart). Rounds whose arrival times fall outside the spike window are silently filtered by the lifecycle window filter. Use diurnal or drain patterns — or no lifecycle modifier — with multi-turn cohorts.
Advanced Features¶
Multimodal Requests¶
The multimodal field on a client generates requests with combined text, image, audio, and video tokens. Total input = text + (image tokens x image count) + (audio tokens x audio count) + (video tokens x video count).
clients:
- id: "vision-model"
# ... arrival, rate_fraction, etc.
multimodal:
text_distribution:
type: gaussian
params: { mean: 128, std_dev: 50, min: 2, max: 512 }
image_distribution:
type: constant
params: { value: 576 } # Tokens per image (e.g., ViT patch count)
image_count_distribution:
type: constant
params: { value: 1 } # One image per request
Audio and video follow the same pattern with audio_distribution/audio_count_distribution and video_distribution/video_count_distribution.
Reasoning (Multi-Turn with Context Accumulation)¶
The reasoning field generates multi-turn conversation sessions where each round can accumulate prior context. See the Multi-Turn Conversations section above for a full example. Key fields:
reason_ratio_distribution: fraction of output tokens that represent "reasoning" (sampled as integer percentage, divided by 100)multi_turn.max_rounds: number of conversation rounds per sessionmulti_turn.think_time_us: inter-round delay (user think time, in microseconds)multi_turn.context_growth: controls how each round's input is constructed."accumulate": round N's input = all prior rounds' inputs + all prior rounds' outputs + freshly sampled new user turn. Input length grows linearly with round index, creating expanding KV cache pressure and enabling prefix-affinity routing reuse across rounds — use this for realistic chat or reasoning sessions.""(omit): each round uses only freshly sampled tokens. Input length is stationary across rounds; no cross-round prefix sharing. Use this for agent workloads that do not maintain conversation history.
multi_turn.single_session: iftrue, each client creates exactly one session (useful for modeling persistent chat sessions like inference-perf'senable_multi_turn_chat). Default:false(multiple independent sessions per client)closed_loop: controls whether round scheduling adapts to actual server latency.null(omit): closed-loop by default for multi-turn clients — each subsequent round is generated after the prior round completes, scheduling its arrival atcompletion_time + think_time_us.false: open-loop — all round arrival times are pre-stamped before simulation using a1 µs/output-tokenheuristic; inter-round spacing does not adapt to server latency. Usefalseto reproduce inference-perf-style pre-generated workloads.
Client-Side Network Latency¶
The network field adds client-perspective latency to server-side metrics. Useful for modeling geographically distributed users:
clients:
- id: "remote-user"
# ... arrival, rate_fraction, etc.
network:
rtt_ms: 50 # Round-trip time in milliseconds
bandwidth_mbps: 100 # Link bandwidth (affects upload/download delay)
Client TTFT = server TTFT + RTT + upload delay. Client E2E = server E2E + RTT + upload delay + download delay. Upload/download delays are computed from token counts (4 bytes per token ID).
Built-in Presets and Examples¶
Named Presets from defaults.yaml¶
BLIS ships with preset workload profiles in defaults.yaml. Use them with blis run, blis observe, or the convert command:
# Run simulation with a named preset
./blis run --model qwen/qwen3-14b --workload chatbot --rate 10
# Observe a real server with the same preset (identical token distributions as run)
./blis observe --server-url http://localhost:8000 --model qwen/qwen3-14b \
--workload chatbot --rate 10 --num-requests 100 \
--trace-header trace.yaml --trace-data trace.csv
# Convert a preset to a v2 WorkloadSpec YAML for customization
./blis convert preset --name chatbot --rate 10 --num-requests 100 > chatbot.yaml
Using the same preset for both run and observe ensures the observe→replay→calibrate pipeline compares identical workload shapes — eliminating workload skew as a calibration variable.
Available presets from defaults.yaml:
| Preset | Prompt Mean | Output Mean | Description |
|---|---|---|---|
chatbot |
256 | 256 | Interactive chat with moderate token lengths |
contentgen |
1024 | 1024 | Content generation with balanced I/O |
summarization |
4096 | 512 | Long-document summarization (high input, moderate output) |
multidoc |
10240 | 1536 | Multi-document processing (very long inputs) |
Scenario Presets (Programmatic)¶
The sim/workload/scenarios.go module provides scenario functions for common workload patterns. These are used internally by the simulator and in hypothesis experiments:
| Scenario | Function | Key Characteristics |
|---|---|---|
| Bursty traffic | ScenarioBurstyTraffic |
Gamma CV=3.5, exponential tokens, batch SLO |
| Unfair tenants | ScenarioUnfairTenants |
90% low-priority batch + 10% high-priority critical |
| Prefix-heavy | ScenarioPrefixHeavy |
80% shared-prefix + 20% unique, tests prefix caching |
| Mixed SLO | ScenarioMixedSLO |
Equal mix of critical/standard/batch classes |
Example Files¶
BLIS ships with example workload specs in examples/:
| File | Description |
|---|---|
multiturn-chat-demo.yaml |
Multi-turn chat with prefix-affinity routing |
prefix-affinity-demo.yaml |
Shared-prefix workload for cache testing |
servegen-language.yaml |
ServeGen-derived language workload |
inference-perf-shared-prefix.yaml |
inference-perf format compatibility |
Further Reading¶
- Workload Spec Schema — complete field reference
- Configuration Reference — all workload flags