Hypothesis-Driven Experimentation¶
This guide covers using BLIS as a platform for rigorous, reproducible experiments. Because BLIS is a deterministic discrete-event simulator (same seed → byte-identical output), you can run controlled experiments that are impossible with real hardware.
# Quick example: compare chunked prefill vs no chunked prefill
./blis run --model qwen/qwen3-14b \
--rate 100 --num-requests 500 --long-prefill-token-threshold 0
./blis run --model qwen/qwen3-14b \
--rate 100 --num-requests 500 --long-prefill-token-threshold 256
Why Experiment with a Simulator?¶
Real GPU benchmarks suffer from noise: wall-clock jitter, OS scheduling, GPU thermal throttling, network variability. With BLIS:
- Deterministic replay — change exactly one variable, attribute all output differences to that change
- No hardware cost — run thousands of configurations on a laptop
- Controlled conditions — isolate the effect of a single parameter while holding everything else constant
- Reproducible — share your seed, workload spec, and CLI flags; anyone can reproduce your results
Capacity Planning Validation¶
The most common experiment workflow for platform engineers:
- Define your deployment: model, GPU, TP, instance count
- Define your workload: arrival rate, token distributions (from production logs if available)
- Define your SLO: TTFT p99 < 200ms, E2E p99 < 5s, etc.
- Run the simulation with these parameters
- Interpret: Does the simulated TTFT p99 meet your SLO? If not, add instances or tune routing.
# Example: Will 8 instances of Qwen3 14B handle 20 req/s at TTFT p99 < 500ms?
./blis run --model qwen/qwen3-14b \
--num-instances 8 --rate 20 --num-requests 200 \
--routing-policy weighted --seed 42
The /hypothesis-experiment Skill¶
For structured hypothesis-driven research, BLIS includes a guided experimentation workflow via the /hypothesis-experiment Claude Code skill:
This skill guides you through:
- Formulate — state a testable prediction (e.g., "chunked prefill reduces short-request TTFT p99 by > 30%")
- Classify — identify the hypothesis family (scheduler invariants, performance-regime, etc.)
- Design — specify parameters, controls, success criteria
- Implement — create
run.sh(experiment script) andanalyze.py(analysis script) - Run — execute the experiment
- Analyze — parse results, compute statistics
- Document — write FINDINGS.md with conclusions and evidence
The Experiment Harness¶
All experiments use a shared harness for consistency. The harness lives in the hypothesis-archive branch — copy it into your experiment branch at Step 4:
# Copy harness from archive branch into your experiment branch
git show hypothesis-archive:hypotheses/lib/harness.sh > hypotheses/lib/harness.sh
git show hypothesis-archive:hypotheses/lib/analyze_helpers.py > hypotheses/lib/analyze_helpers.py
source hypotheses/lib/harness.sh
# Run a simulation with standard setup
blis_run 60 results/baseline.json \
--model qwen/qwen3-14b \
--num-instances 4 --rate 100 --num-requests 500
The harness provides:
- blis_run() — wrapper around the simulation binary
- setup_experiment() — create output directories
- preflight_kv_check() — verify KV configuration
- hypotheses/lib/analyze_helpers.py — common analysis functions (parse_blis_output(), etc.)
Note: The archive is pinned at commit
cad4191. Verify harness output format matches the current CLI before use — see Step 4 in Hypothesis Process.
Case Studies¶
Completed experiments demonstrate the power of hypothesis-driven analysis:
| Experiment | Finding | Impact |
|---|---|---|
| H7 (Horizontal Scaling) | TTFT p99 scales 7.4x (not 2x) when doubling instances near saturation | Super-linear benefit from queue growth rate reduction |
| H27 (Chunked Prefill) | --long-prefill-token-threshold=256 reduces short-request TTFT p99 by 52% |
But ITL is unaffected — chunked prefill benefits scheduling, not decode |
| H29 (Snapshot Staleness) | --snapshot-refresh-interval 100ms degrades TTFT p99 by +354% for kv-utilization scorer |
Safe zone < 5ms; composite scorer mitigates ~99% |
| H20 (Heavy-Tailed) | ParetoLogNormal produces fewer preemptions than Gaussian despite similar means | Distribution median, not mean, drives KV pressure |
All findings are documented in hypotheses/*/FINDINGS.md in their respective experiment branches. Completed experiments are also collected in the hypothesis-archive branch.
Convergence Review¶
Experiments go through a multi-perspective review process to ensure rigor:
- Design Review (5 perspectives) — validates hypothesis quality and experiment design
- Code Review (5 perspectives) — checks run.sh/analyze.py for correctness
- FINDINGS Review (10 perspectives) — validates conclusions against evidence
The /convergence-review skill automates this process. Zero CRITICAL + zero IMPORTANT findings = converged.
Further Reading¶
- Hypothesis Process — full 10-step process for contributors
- Experiment Standards — rigor requirements (ED-1 through ED-6, RCV-1 through RCV-6)
- Metrics & Results — understanding the metrics your experiments produce