Hypothesis-Driven Experimentation¶

This guide covers using BLIS as a platform for rigorous, reproducible experiments. Because BLIS is a deterministic discrete-event simulator (same seed → byte-identical output), you can run controlled experiments that are impossible with real hardware.

# Quick example: compare chunked prefill vs no chunked prefill
./blis run --model qwen/qwen3-14b \
  --rate 100 --num-requests 500 --long-prefill-token-threshold 0

./blis run --model qwen/qwen3-14b \
  --rate 100 --num-requests 500 --long-prefill-token-threshold 256

Why Experiment with a Simulator?¶

Real GPU benchmarks suffer from noise: wall-clock jitter, OS scheduling, GPU thermal throttling, network variability. With BLIS:

Deterministic replay — change exactly one variable, attribute all output differences to that change
No hardware cost — run thousands of configurations on a laptop
Controlled conditions — isolate the effect of a single parameter while holding everything else constant
Reproducible — share your seed, workload spec, and CLI flags; anyone can reproduce your results

Capacity Planning Validation¶

The most common experiment workflow for platform engineers:

Define your deployment: model, GPU, TP, instance count
Define your workload: arrival rate, token distributions (from production logs if available)
Define your SLO: TTFT p99 < 200ms, E2E p99 < 5s, etc.
Run the simulation with these parameters
Interpret: Does the simulated TTFT p99 meet your SLO? If not, add instances or tune routing.

# Example: Will 8 instances of Qwen3 14B handle 20 req/s at TTFT p99 < 500ms?
./blis run --model qwen/qwen3-14b \
  --num-instances 8 --rate 20 --num-requests 200 \
  --routing-policy weighted --seed 42

The `/hypothesis-experiment` Skill¶

For structured hypothesis-driven research, BLIS includes a guided experimentation workflow via the /hypothesis-experiment Claude Code skill:

/hypothesis-experiment

This skill guides you through:

Formulate — state a testable prediction (e.g., "chunked prefill reduces short-request TTFT p99 by > 30%")
Classify — identify the hypothesis family (scheduler invariants, performance-regime, etc.)
Design — specify parameters, controls, success criteria
Implement — create run.sh (experiment script) and analyze.py (analysis script)
Run — execute the experiment
Analyze — parse results, compute statistics
Document — write FINDINGS.md with conclusions and evidence

The Experiment Harness¶

All experiments use a shared harness for consistency. The harness lives in the hypothesis-archive branch — copy it into your experiment branch at Step 4:

# Copy harness from archive branch into your experiment branch
git show hypothesis-archive:hypotheses/lib/harness.sh > hypotheses/lib/harness.sh
git show hypothesis-archive:hypotheses/lib/analyze_helpers.py > hypotheses/lib/analyze_helpers.py

source hypotheses/lib/harness.sh

# Run a simulation with standard setup
blis_run 60 results/baseline.json \
  --model qwen/qwen3-14b \
  --num-instances 4 --rate 100 --num-requests 500

The harness provides: - blis_run() — wrapper around the simulation binary - setup_experiment() — create output directories - preflight_kv_check() — verify KV configuration - hypotheses/lib/analyze_helpers.py — common analysis functions (parse_blis_output(), etc.)

Note: The archive is pinned at commit cad4191. Verify harness output format matches the current CLI before use — see Step 4 in Hypothesis Process.

Case Studies¶

Completed experiments demonstrate the power of hypothesis-driven analysis:

Experiment	Finding	Impact
H7 (Horizontal Scaling)	TTFT p99 scales 7.4x (not 2x) when doubling instances near saturation	Super-linear benefit from queue growth rate reduction
H27 (Chunked Prefill)	`--long-prefill-token-threshold=256` reduces short-request TTFT p99 by 52%	But ITL is unaffected — chunked prefill benefits scheduling, not decode
H29 (Snapshot Staleness)	`--snapshot-refresh-interval` 100ms degrades TTFT p99 by +354% for kv-utilization scorer	Safe zone < 5ms; composite scorer mitigates ~99%
H20 (Heavy-Tailed)	ParetoLogNormal produces fewer preemptions than Gaussian despite similar means	Distribution median, not mean, drives KV pressure

All findings are documented in hypotheses/*/FINDINGS.md in their respective experiment branches. Completed experiments are also collected in the hypothesis-archive branch.

Convergence Review¶

Experiments go through a multi-perspective review process to ensure rigor:

Design Review (5 perspectives) — validates hypothesis quality and experiment design
Code Review (5 perspectives) — checks run.sh/analyze.py for correctness
FINDINGS Review (10 perspectives) — validates conclusions against evidence

The /convergence-review skill automates this process. Zero CRITICAL + zero IMPORTANT findings = converged.