Skip to content

Workload Spec Schema

Complete YAML schema reference for BLIS workload specifications (--workload-spec). For a guide-level introduction, see Workload Specifications.

Top-Level Fields

Field Type Required Description
version string No Schema version ("2" recommended; "1" auto-upgraded)
seed int64 No RNG seed (overridden by CLI --seed if set)
category string No language, multimodal, reasoning, or empty
aggregate_rate float64 Yes Total arrival rate in requests/second
num_requests int64 No Total requests to generate (0 = unlimited, use horizon)
horizon int64 No Simulation time limit in ticks (overridden by CLI --horizon if set)
clients list Yes* Client specifications (see below)
cohorts list No Cohort specifications with population dynamics (diurnal, spike, drain patterns)
servegen_data object No Native ServeGen data file loading
inference_perf object No inference-perf format compatibility

*At least one client, cohort, or servegen_data is required.

Client Specification

Each entry in the clients list defines a traffic source:

Field Type Required Description
id string No Client identifier (for metrics grouping)
tenant_id string No Tenant identifier
slo_class string No SLO tier: critical, standard, sheddable, batch, background, or empty
model string No Model name override (for multi-model workloads)
rate_fraction float64 Yes Fraction of aggregate_rate for this client (must be positive). When lifecycle windows are present, fractions are normalized per-phase (see Lifecycle Normalization)
arrival object Yes Arrival process configuration
input_distribution object Yes Input token length distribution
output_distribution object Yes Output token length distribution
prefix_group string No Prefix group name (requests in same group share prefixes)
prefix_length int No Shared prefix token count (additive to input_distribution)
streaming bool No Whether to simulate streaming output
network object No Client-side network characteristics
lifecycle object No Activity window configuration
multimodal object No Multimodal token generation
reasoning object No Reasoning multi-turn behavior
timeout int64 No Per-request timeout in µs. nil = default (300s for sessions). 0 = no timeout
slo_target_us int64 No Per-request SLO TTFT target in µs. nil/0 = no target. Used by --dispatch-order slo-deadline

Arrival Process

Field Type Values Description
process string poisson, gamma, weibull, constant Inter-arrival time distribution
cv *float64 Required for gamma and weibull Coefficient of variation (burstiness). CV > 1 = bursty, CV < 1 = regular

Distribution Specification

Used for input_distribution and output_distribution:

Field Type Description
type string gaussian, exponential, pareto_lognormal, lognormal, constant, empirical
params map Type-specific parameters (see below)
file string Reserved for future use (file-based loading not yet implemented). Use inline params instead.

Distribution Parameters

Type Parameters
gaussian mean, std_dev, min, max
exponential mean
pareto_lognormal alpha, xm, mu, sigma, mix_weight
lognormal mu, sigma (mean and standard deviation of log-transformed values; fitted via method of moments)
constant value
empirical inline params map (key=token count, value=probability)

Network Specification

Field Type Description
rtt_ms float64 Round-trip time in milliseconds
bandwidth_mbps float64 Bandwidth in Mbps

Reasoning Specification

Field Type Description
reason_ratio_distribution DistSpec Distribution of reasoning-to-output ratio
multi_turn object Multi-turn conversation configuration
multi_turn.max_rounds int Maximum conversation rounds
multi_turn.think_time_us int64 User think time between rounds (microseconds)
multi_turn.context_growth string accumulate (prepend prior context) or empty (fixed-length)
multi_turn.single_session bool If true, each client creates exactly one session instead of spawning new sessions per arrival. Used by inference-perf multi-turn expansion. Default: false

Cohort Specification

Each entry in the cohorts list defines a population with lifecycle dynamics. Cohorts expand into individual clients with lifecycle windows derived from diurnal, spike, or drain patterns.

Field Type Required Description
id string No Cohort identifier
population int Yes Number of clients in this cohort (max 100,000)
tenant_id string No Tenant identifier
slo_class string No SLO tier: critical, standard, sheddable, batch, background
model string No Model name override
arrival object Yes Arrival process configuration (same as Client)
input_distribution object Yes Input token length distribution
output_distribution object Yes Output token length distribution
prefix_group string No Prefix group name
streaming bool No Whether to simulate streaming output
rate_fraction float64 Yes Fraction of aggregate_rate for each client in this cohort
diurnal object No Sinusoidal rate modulation (see below)
spike object No Traffic spike configuration (see below)
drain object No Linear ramp-down to zero (see below)
timeout int64 No Per-request timeout in µs (same as Client)
slo_target_us int64 No Per-request SLO TTFT target in µs (same as Client)

Diurnal Pattern

Field Type Description
peak_hour int Hour of peak traffic (0-23)
peak_to_trough_ratio float64 Ratio of peak to trough rate (≥ 1.0)

Spike Pattern

Field Type Description
start_time_us int64 Spike start time in microseconds
duration_us int64 Spike duration in microseconds
trace_rate float64 Cohort-level arrival rate in req/s (required when aggregate_rate: 0); divided evenly across population members

Drain Pattern

Field Type Description
start_time_us int64 Drain start time in microseconds
ramp_duration_us int64 Ramp-down duration in microseconds

Lifecycle Specification

Activity window configuration for clients (used in the lifecycle field of Client Specification). Cohort patterns (diurnal, spike, drain) are converted into lifecycle windows internally.

Field Type Description
windows list List of active time windows

Active Window

Field Type Description
start_us int64 Window start time in microseconds
end_us int64 Window end time in microseconds
trace_rate float64 Per-window rate override. In absolute rate mode (aggregate_rate: 0), this is the absolute arrival rate (req/s). In proportional mode, this is a weight for rate allocation.
arrival ArrivalSpec Per-window arrival process override (overrides client-level arrival)
input_distribution DistSpec Per-window input token distribution override (overrides client-level input_distribution)
output_distribution DistSpec Per-window output token distribution override (overrides client-level output_distribution)

Lifecycle Normalization

Proportional mode (aggregate_rate > 0): When clients have lifecycle windows, rate_fraction values are normalized per-phase rather than globally. For each client, the simulator sums the rate_fraction of all co-active clients (those whose lifecycle windows overlap) and divides by that sum. This ensures aggregate_rate is achieved during every active phase.

Absolute rate mode (aggregate_rate = 0): Each window's trace_rate is used directly as the arrival rate (requests/second) for that window, without scaling or normalization. This mode preserves time-varying aggregate load patterns from traces (e.g., ServeGen workloads) where the aggregate rate varies over time. Validation requires all rate-based clients to have explicit trace_rate on every window.

Clients without lifecycle windows are "always-on" and are counted as co-active with every phase.

Example: A two-phase workload with aggregate_rate: 40:

  • Phase 1 (0–50s): clients A (rate_fraction: 0.7) and B (rate_fraction: 0.3)
  • Phase 2 (50–100s): client C (rate_fraction: 1.0)

Each phase's fractions are normalized independently: A gets 40 × 0.7/1.0 = 28 req/s, B gets 40 × 0.3/1.0 = 12 req/s, C gets 40 × 1.0/1.0 = 40 req/s. Both phases produce the full 40 req/s.

Without per-phase normalization, the global sum would be 2.0, and every client's rate would be halved.

Limitation: Always-on clients compute a single rate using co-active sums across all phases they overlap with. When an always-on client coexists with multiple non-overlapping phased clients, per-phase totals may be less than aggregate_rate. For predictable results, use either all-phased or all-always-on clients.

Multimodal Specification

Configures multimodal request generation (used in the multimodal field of Client Specification). Each distribution follows the same Distribution Specification format.

Field Type Description
text_distribution DistSpec Text token distribution
image_distribution DistSpec Image token distribution
image_count_distribution DistSpec Number of images per request
audio_distribution DistSpec Audio token distribution
audio_count_distribution DistSpec Number of audio segments per request
video_distribution DistSpec Video token distribution
video_count_distribution DistSpec Number of video segments per request

ServeGen Data Specification

Native ServeGen data file loading (used in the servegen_data top-level field):

Field Type Required Description
path string Yes Path to ServeGen data directory (containing chunk-*-trace.csv and dataset.json)
time_window string No Temporal snapshot extraction: midnight (0:00-0:30), morning (8:00-8:30), or afternoon (14:00-14:30). Filters chunks to the specified 30-minute window.
span_start int64 No Trace span start filter (microseconds)
span_end int64 No Trace span end filter (microseconds)

InferencePerf Specification

inference-perf format compatibility (used in the inference_perf top-level field):

Field Type Required Description
stages list Yes Rate/duration stages for load patterns
shared_prefix object Yes Shared prefix expansion configuration

Stage

Field Type Description
rate float64 Requests per second for this stage
duration int64 Stage duration in seconds (note: unlike other time fields which use microseconds, this field uses seconds)

Shared Prefix

Field Type Description
num_unique_system_prompts int Number of unique system prompts
num_users_per_system_prompt int Users per system prompt
system_prompt_len int System prompt length in tokens
question_len int Question length in tokens
output_len int Output length in tokens
enable_multi_turn_chat bool When true, maps to BLIS reasoning.multi_turn with SingleSession mode and fixed-length inputs (no context accumulation). Computes MaxRounds and ThinkTimeUs from stage parameters. See #514.

Complete Example

version: "2"
seed: 42
category: reasoning
aggregate_rate: 500.0
num_requests: 500

clients:
  - id: "multi-turn-chat"
    tenant_id: "chat-users"
    slo_class: "standard"
    rate_fraction: 1.0
    streaming: true
    arrival:
      process: poisson
    input_distribution:
      type: gaussian
      params:
        mean: 128
        std_dev: 30
        min: 32
        max: 512
    output_distribution:
      type: gaussian
      params:
        mean: 64
        std_dev: 20
        min: 16
        max: 256
    reasoning:
      reason_ratio_distribution:
        type: gaussian
        params:
          mean: 0
          std_dev: 0
          min: 0
          max: 0
      multi_turn:
        max_rounds: 5
        think_time_us: 500000
        context_growth: accumulate

Validation

BLIS validates workload specs with strict YAML parsing (KnownFields(true)) — typos in field names cause errors. Additional validation:

  • aggregate_rate must be positive
  • Each client's rate_fraction must be positive
  • arrival.process must be one of the valid processes
  • cv for gamma/weibull must be finite and positive
  • Weibull cv must be in [0.01, 10.4]
  • Distribution types must be recognized
  • All numeric params must be finite (no NaN or Inf)
  • At least one client, cohort, or servegen_data is required
  • Cohort population must be positive and ≤ 100,000