Skip to content

BLIS Experiment Standards

Hypothesis-driven experimentation is a first-class activity in BLIS — equal in rigor to implementation and design. The experiment framework is grounded in Verification, Validation, and Uncertainty Quantification (VV&UQ) from simulation science.

VV&UQ Framing

Every hypothesis falls into one of three VV&UQ categories. This determines what kind of evidence is needed.

Category Question Evidence required Examples
Verification Does the code implement the intended math/logic? Exact invariant checks. Failure = bug. H12 (conservation), H13 (determinism), H22 (input validation)
Validation Does the model match expected system behavior? Statistical comparison against analytical baselines or real data within a pre-specified accuracy interval. Cross-validation against M/M/k; H19 (roofline vs blackbox)
Uncertainty Quantification How confident are we in the region where a finding holds? Confidence intervals on thresholds; probability statements on properties. H8 (preemption cliff at 2100±? blocks); H10 (28% improvement at this operating point — what about others?)

Most current experiments are Verification (invariant checking) or informal Validation (metric comparison). Future experiments should increasingly incorporate UQ — every threshold finding should include a confidence interval, every "confirmed" result should quantify the probability of holding under parameter variation.

Purposes

The VV&UQ categories above classify what kind of evidence an experiment needs. But experiments also serve broader purposes beyond their category:

  • Verification and Validation — the primary purposes, aligned with the VV&UQ categories above
  • Discovery — a secondary outcome of any experiment. Bugs, design gaps, and undocumented limitations often surface during verification or validation experiments. Discovery is not a separate VV&UQ category — it's a valuable byproduct. Example: H5 (a Validation experiment) discovered that the per-input-token cost model makes burst smoothing structurally impossible — a design limitation finding that was not the hypothesis being tested.

How to choose your VV&UQ category

Is your hypothesis about whether the CODE is correct?
  → Yes: Verification (e.g., "conservation holds," "deterministic output")
  → No: ↓
Is your hypothesis comparing the MODEL's output to expected behavior?
  → Yes: Validation (e.g., "policy A beats B," "TTFT ∝ input tokens")
  → No: ↓
Is your hypothesis about the BOUNDARIES or CONFIDENCE of a finding?
  → Yes: UQ (e.g., "preemption cliff at 2100±100 blocks," "P(stable) > 0.95")

The VV&UQ category determines what counts as evidence: - Verification: Exact invariant checks. One failure = bug. Single seed sufficient. - Validation: Statistical comparison within pre-specified accuracy interval. 3+ seeds. Formal tests (KS, Mann-Whitney U). - UQ: Confidence intervals on thresholds. Parameter sweeps. Sensitivity analysis.

Formal statistical rigor

Experiments involving statistical claims must use proper hypothesis tests, not ad-hoc thresholds:

  • Distribution validation (workload/arrival family): Kolmogorov-Smirnov test — compares a sample against a theoretical CDF. Reject if p < 0.05. In Python: from scipy.stats import kstest; stat, p = kstest(samples, 'expon', args=(0, 1/rate)).
  • Metric comparison (cross-policy family): Mann-Whitney U test — non-parametric comparison of two independent samples. Report effect size AND confidence interval, not just "X% better." In Python: from scipy.stats import mannwhitneyu; stat, p = mannwhitneyu(a_values, b_values).
  • Threshold estimation (performance-regime family): Report thresholds with confidence intervals. "Preemption cliff at 2100 blocks" → "Preemption cliff at 2100 ± 100 blocks (95% CI across seeds 42, 123, 456)."
  • Invariant probability (scheduler invariants family): For stochastic invariants, estimate P(invariant holds) with a confidence interval, not just "holds for 3 seeds."

Note on scipy: The tests above use scipy.stats. Install with pip install scipy if needed. For experiments that only use standard-library Python, the legacy thresholds (below) remain acceptable.

Legacy thresholds (still valid for experiments without scipy): - >20% improvement consistent across all seeds = significant - <10% in any seed = inconclusive - Within 5% across all seeds = equivalent (for equivalence tests)

These thresholds were chosen pragmatically — 20% ensures the effect is visible above seed-to-seed variance in typical BLIS experiments; 5% accounts for floating-point and timing noise. They are not derived from formal power analysis. New experiments should prefer formal tests where scipy is available.

Cross-validation against analytical models

Where applicable, validate DES outputs against analytically-tractable models under matching assumptions. This grounds the simulator in theory.

  • M/M/k baseline: M/M/k is the standard queueing model with Markovian (Poisson) arrivals, Markovian (exponential) service times, and k servers. Under matching assumptions, compare DES queue length distribution against the M/M/k analytical solution. Caveat: BLIS uses batching and deterministic service times (alpha/beta coefficients), so exact M/M/k matching is not possible. The comparison requires configuring BLIS with --max-batch-size 1 and interpreting the service time distribution as approximately exponential. Divergence may indicate modeling errors OR fundamental architectural differences from M/M/k assumptions.
  • Little's Law: For any stable configuration, verify L = λW (average queue length = arrival rate × average wait time). This is a universal law that must hold. In BLIS terms: L = mean still_queued from per-instance metrics; λ = injected_requests / (sim_duration_us / 1e6); W = mean scheduling delay from scheduling_delay_p99_ms (approximate). Extract from JSON output.
  • Phase structure: Verify that prefill time ∝ prompt tokens and decode time ∝ output tokens by fitting linear models and checking R² > 0.95. In BLIS terms: prefill time ≈ TTFT (time to first token); decode time ≈ E2E - TTFT. Vary input_distribution mean while holding output_distribution constant, and vice versa.
  • Replay validation: Compare BLIS simulation output against real inference server traces (e.g., vLLM experiment logs). This tests system composition — whether the simulator's component interactions produce realistic end-to-end behavior — which analytical evaluation of individual components cannot. Caveat (T_queue_obs masking): Analytical evaluation that substitutes real observed queue waits into the prediction formula (e.g., TTFT_pred = α(pre-scheduling overhead) + T_queue_obs + T_prefill_pred) never tests queueing prediction. At saturation, T_queue dominates (~120s), making the <0.02% coefficient error an artifact of masking, not evidence of good coefficients. Only full BLIS replay tests the complete prediction pipeline. In BLIS terms: Use blis convert inference-perf to convert real experiment specs (verify enable_multi_turn_chat semantics match between the real server config and BLIS conversion — see #517), then compare BLIS metrics (TTFT, E2E, throughput) against real server metrics at matching operating points. H30-H32 established this pattern, finding that both crossmodel and blackbox backends show identical TTFT underprediction — -17% to -56% depending on model and workload profile, all at sub-saturation; at actual saturation, regime transition amplification produces 2000x divergence — proving the gap is in the scheduling model (zero inter-step overhead), not coefficient accuracy.

Parameter resolution

Experiment setup errors often stem from parameter precedence confusion — e.g., CLI defaults overriding intended workload values (#390), or --rate being ignored when --workload-spec is set. See Common Pitfalls and Known Unit Gotchas in the Configuration Reference.

Experiment Classification

Every hypothesis must be classified before designing the experiment. The classification determines rigor requirements.

Type 1: Deterministic Experiments

Definition: Verify exact properties — invariants, conservation laws, error handling boundaries. Same seed = same result, guaranteed.

Requirements: - Single seed sufficient (determinism is the point) - Pass/fail is exact — the invariant holds or it doesn't - Failure is ALWAYS a bug (never noise) - No statistical analysis needed

Examples: - H12: Request conservation (INV-1) holds across 10 policy configurations (67 checks) - H13: Same seed produces byte-identical output - H22: Zero KV blocks panics at CLI boundary, not deep in simulation

Pass criteria: The invariant holds for every configuration tested. One failure = bug.


Type 2: Statistical Experiments

Definition: Compare metrics (TTFT, throughput, distribution uniformity) across configurations. Results vary by seed.

Requirements: - Minimum 3 seeds (42, 123, 456) for each configuration - Effect size thresholds: - Significant: >20% improvement consistent across ALL seeds - Inconclusive: <10% in any seed - Equivalent: within 5% across all seeds (for equivalence tests) - Directional consistency: the predicted direction must hold across ALL seeds. One contradicting seed = hypothesis not confirmed - Report: mean, min, max across seeds for primary metric. Include per-seed values for transparency.

Subtypes:

Dominance

A is strictly better than B on metric M.

  • Analysis: Compare metric M for A vs B across all seeds. Compute ratio per seed.
  • Pass: A beats B on M for all seeds, with >20% effect size in every seed.
  • Examples: H3 — queue-depth TTFT is 1.7-2.8x better than kv-utilization across 3 seeds. H14 — always-busiest routing produces 4.6x worse TTFT and routes all 500 requests to a single instance.

Monotonicity

Increasing X should monotonically increase/decrease Y.

  • Analysis: Run at >=3 values of X. Verify Y changes monotonically.
  • Pass: Y is strictly monotonic in X across all seeds. No inversions.
  • Example: H8 — reducing total KV blocks increases preemption frequency. H9 — increasing prefix_length decreases TTFT.

Equivalence

A ~ B within tolerance (baseline sanity checks).

  • Analysis: Compare metric M for A vs B. Compute percentage difference per seed.
  • Pass: |A - B| / max(A, B) < 5% across all seeds.
  • Example: H4 — round-robin ~ least-loaded for uniform workloads at low rates. H23 — all policies equivalent at near-zero load.

Pareto

No single configuration dominates all metrics simultaneously.

  • Analysis: Run N configurations, measure multiple metrics. Identify Pareto-optimal set.
  • Pass: At least 2 configurations are Pareto-optimal (each best on >=1 metric).
  • Example: H17 — different scorer weights optimize for different objectives (TTFT vs throughput).

Hypothesis Formation

Hypotheses must be conceptual and behavioral, not code-grounded. This is the experimental analogue of behavioral vs structural testing.

Conceptual hypotheses test system behavior

A good hypothesis is an intuitive claim about system behavior: "burst smoothing should reduce tail latency," "tiered storage should reduce preemptions," "same seed should produce identical output." These claims are based on systems thinking, not on reading the implementation.

Do NOT read the code before forming hypotheses

Reading the code before hypothesizing is like writing structural tests — you end up testing the implementation, not the behavior. The value of hypothesis-driven experimentation is that conceptual claims failing against the implementation surfaces design limitations that code-aware experiments would avoid.

Evidence: If H5 had read admission.go:45 before hypothesizing, the experimenter would have designed a "correct" experiment with cap=100K, confirmed a tiny effect, and missed the discovery that the per-input-token cost model makes burst smoothing structurally impossible at practical parameters. The conceptual hypothesis exposed a design limitation that a code-grounded hypothesis would have sidestepped.

"Mechanism not plausible" is a valid resolution

When a conceptual hypothesis fails because the implementation doesn't support the assumed mechanism, this is the resolution "Refuted — mechanism not plausible." This is a design limitation finding, not an experimenter error. The hypothesis did its job — it revealed a gap between how users think the system works and how it actually works.


Hypothesis Families

Every hypothesis belongs to a family (what domain is being tested) AND a type (how rigor is assessed). These are orthogonal — a scheduler invariant can be deterministic or statistical; a cross-policy comparison is always statistical.

The six families

Family Tests Hypothesis shape Typical type Examples
Workload/arrival Input generation: distributions, rates, burstiness, mix proportions "Generator X produces arrivals matching distribution D within tolerance T" Statistical H16, H20
Scheduler invariants (safety/liveness) Conservation, determinism, lifecycle, livelock protection "For ALL configurations, property P holds" (universally quantified) Deterministic H12 (conservation), H13 (determinism), H25
Performance-regime (scaling laws) Saturation curves, throughput-latency tradeoffs, horizontal scaling "Metric M is monotonic/convex in parameter P" Statistical/Monotonicity H7 (scaling), H8 (KV pressure), H11 (batch formation)
Structural model DES model assumptions: phase structure, KV mechanics, signal freshness, prefix caching "Component C behaves according to model assumption A" Mixed H3 (signal freshness), H9 (prefix caching), H10 (tiered KV), H26
Robustness/failure-mode Overload, misconfiguration, degenerate inputs, pathological policies "Under stress condition S, the system exhibits defined behavior B (not undefined state)" Deterministic or Statistical H5 (token-bucket), H14 (pathological), H21, H22, H24
Cross-policy comparative Policy ordering, Pareto frontiers, robustness to workload shifts "There EXISTS a workload where policy A beats B on metric M" (existentially quantified) Statistical/Dominance or Pareto H1, H2, H4, H6, H15, H17, H18, H19, H23

Family-specific hypothesis sentence patterns

Use these templates when generating new hypotheses. Each family has a characteristic sentence shape that ensures testability. See also docs/contributing/hypothesis.md for the full generation guide.

Family Sentence pattern Example
Workload/arrival "Generator G with parameters P should produce distribution D with property X within tolerance T" "Gamma sampler with CV=3.5 should produce inter-arrival times with CV within 10% of 3.5 over 10K samples"
Scheduler invariants "For ALL configurations C, invariant I holds at simulation end" "For all routing × scheduling × admission combinations, injected == completed + queued + running"
Performance-regime "Metric M should be monotonically non-decreasing/non-increasing in parameter P across range [a, b]" "TTFT P99 should be monotonically non-decreasing in offered load from 500 to 5000 req/s"
Structural model "Component C should behave according to assumption A, verified by observable O" "Prefill time should be proportional to input token count (R² > 0.95 for linear fit)"
Robustness "Under stress condition S, the system should exhibit behavior B and NOT exhibit behavior X" "Under 10x overload, the system should reject excess requests and NOT deadlock or panic"
Cross-policy "Under workload W, policy A should produce better metric M than policy B because of mechanism Z" "Under mixed-SLO workload, priority-FCFS should produce lower realtime TTFT than FCFS because realtime requests get scheduled first"

Family × Type matrix

Deterministic Statistical/Dominance Statistical/Monotonicity Statistical/Equivalence Statistical/Pareto
Workload/arrival Seed reproducibility Distribution match Rate scaling
Scheduler invariants Primary (INV-1 through INV-6)
Performance-regime Primary (scaling curves) Baseline sanity Knee behavior
Structural model Phase structure Signal freshness Cache effectiveness
Robustness Input validation Overload behavior
Cross-policy Primary (A vs B) Low-load equivalence Multi-scorer tradeoffs

Family determines rigor requirements

  • Scheduler invariants: Single seed sufficient. Pass/fail is exact. One failure = bug.
  • Cross-policy comparative: 3+ seeds minimum. Must control confounding variables (ED-1, ED-6).
  • Performance-regime: Sweep points (≥3 values of the independent variable), not just pairwise comparison.
  • Workload/arrival: Statistical tests on generated distributions. Long runs for accurate rate estimation.
  • Structural model: Code-level verification (RCV-1, RCV-4) is essential — these test implementation assumptions.
  • Robustness: Must test BOTH the defined behavior AND verify no undefined states (deadlock, panic, data loss).

Relationship to existing invariants and rules

Family Related invariants Related rules
Scheduler invariants INV-1 (conservation), INV-2 (lifecycle), INV-3 (clock monotonicity), INV-5 (causality), INV-6 (determinism) R1 (no silent data loss), R5 (transactional mutation), R21 (mutable slice iteration)
Performance-regime INV-8 (work-conserving)
Structural model INV-4 (KV conservation), INV-7 (signal freshness) R2 (sort map keys), R11 (guard division), R17 (signal freshness), R22 (pre-check consistency)
Workload/arrival R23 (code path parity)
Robustness R3 (validate numeric parameters), R19 (livelock protection), R20 (degenerate inputs)
Cross-policy R18 (CLI flag precedence)

Experiment Design Rules

ED-1: Controlled comparison

Vary exactly one dimension between configurations. Everything else held constant (same model, same instances, same workload, same seed). If the experiment requires varying multiple dimensions, decompose into separate sub-experiments.

ED-2: Rate awareness

Many effects are rate-dependent (e.g., signal freshness only matters at high rates). When the hypothesis involves load-dependent behavior: - Run at the target rate where the effect is expected - Also run at a rate where the effect should vanish (to confirm the mechanism, not just the outcome) - Document the rate-dependent transition point if observed

ED-3: Precondition verification

Before comparing configurations, verify the experiment preconditions hold. Examples: - Testing SJF vs FCFS? Verify queue depth exceeds batch size (otherwise both produce identical batches). - Testing cache hit benefit? Verify KV blocks are large enough to hold the prefix (otherwise LRU eviction destroys it).

Document the precondition check in the experiment script (not just in prose).

ED-4: Workload seed independence

Resolved (#284): CLI --seed now overrides the workload-spec YAML seed: field when explicitly passed. Behavior: - --seed N --workload-spec w.yaml → workload uses seed N (CLI override) - --workload-spec w.yaml (no --seed) → workload uses YAML seed: value (backward compatible) - CLI-generated workloads (--rate, --num-requests) → --seed controls everything (unchanged)

For multi-seed experiments: simply vary --seed on the command line. No need to generate per-seed YAML copies.

Note: The YAML seed: field still serves as the default seed for the workload when --seed is not explicitly specified. This enables the "shareable workload" pattern — distributing a YAML file that always produces the same workload by default.

ED-5: Reproducibility

Every experiment must be reproducible from its artifacts alone: - run.sh must build the binary and run all variants - Exact seed values documented - Exact commit hash recorded (or the experiment is tied to a specific branch/PR) - No manual steps between script invocation and results

ED-6: Config diff against reference experiments

When an experiment reuses calibration data from a prior experiment (e.g., "H8 found the preemption cliff at 2100 blocks, so we use 2100"), diff every CLI flag and YAML field between the two experiments. Document any differences. Even a single changed flag (e.g., routing policy) can invalidate the calibration.

Evidence: H10 used --routing-policy least-loaded while H8 used the default round-robin. This shifted the preemption cliff, producing zero preemptions where H8 found 11%. The mismatch was not caught until post-publication code review.


Root Cause Verification

After analyzing results and before finalizing FINDINGS.md, every experiment MUST verify its causal explanations. This step exists because plausible narratives can pass review without being correct.

RCV-1: Every causal claim must cite file:line

A root cause analysis that says "the tiered cache increases total capacity" without citing the code that does this is a hypothesis about the root cause, not a verified root cause. Trace the claim through the code: - Which function implements the claimed behavior? - What are the exact conditions under which it fires? - Does the claimed mechanism actually change the measured metric? - Tracing depth: The citation must trace to the code that directly modifies the measured metric, not just to the constructor or factory that creates the relevant object. Citing NewKVStore and claiming "this creates a tiered cache with more capacity" is insufficient — you must verify that the GPU block count actually changes in the created object.

Evidence: H10 claimed "CPU tier increases total effective capacity" — but NewKVStore (kv_store.go:31-36) does not change GPU block count. The actual mechanism was maybeOffload preserving prefix hashes (sim/kv/tiered.go).

RCV-2: Every "surprise" must have a first-principles calculation

Before labeling a result as "surprising," compute the expected value from the system's parameters. If the result matches the calculation, it is not a surprise — it is the expected outcome of a mechanism you didn't initially consider.

Evidence: H5 labeled 96% rejection as a "surprise." But admission.go:45 charges len(req.InputTokens) per request (mean=512). Token demand (1,024,000 tokens/s) exceeds supply (400 tokens/s) by 2,560x. The 96% rejection is the mathematically inevitable steady state.

RCV-3: Check the mechanism, not just the direction

Confirming that "A is better than B" is necessary but not sufficient. The root cause analysis must explain why through a specific code path. A correct directional result with an incorrect explanation is a ticking time bomb — the explanation will mislead future experiments.

Paradox flag: If the proposed mechanism predicts the opposite direction of what would be intuitive (e.g., "fewer cache hits improving performance"), treat this as a red flag. Before accepting a paradoxical explanation, independently verify the underlying data. In H10, the claim "fewer cache hits → better TTFT" survived two rounds because the data (from a buggy analyzer) appeared to support it. The corrected data showed cache hits increased, resolving the paradox. When mechanism and intuition disagree, verify the data first.

RCV-4: Validate causal claims with control experiments

When a mechanism is proposed (e.g., "maybeOffload causes the TTFT improvement"), design a control experiment that disables only that mechanism (e.g., --kv-offload-threshold 1.0). If the effect vanishes, the mechanism is confirmed. If it persists, the explanation is wrong.

Evidence: H10 proposed maybeOffload as the mechanism. The control experiment (threshold=1.0) produced output byte-identical to single-tier, confirming maybeOffload as the sole cause. Without this control, the mechanism question ("does maybeOffload cause the TTFT improvement?") would have remained unverified.

RCV-5: Confirmation bias guard (Devil's Advocate)

Before sending FINDINGS.md to external review, the experimenter must write a Devil's Advocate section: 2-3 sentences arguing the opposite of the conclusion. This is a pre-review self-check that forces consideration of alternative interpretations.

## Devil's Advocate

**If this is "Confirmed," argue why it might be Refuted:**
The 69x TTFT improvement could be entirely from load shedding (96% rejection)
rather than burst smoothing. A firewall that blocks all traffic also has great
latency for the requests that pass.

**If this is "Refuted," argue why it might be Confirmed:**
The calibrated bucket (cap=100K) showed a 4% improvement — small but consistent
across 2 of 3 seeds. This might be a real but tiny burst-smoothing effect masked
by workload noise.

The reviewers see both the conclusion AND the counter-argument. This prevents the failure mode where the experimenter writes "Confirmed" and the reviewers are anchored by that label.

Evidence: H5 was labeled "Confirmed" for three rounds. Nobody argued the alternative until Round 3's honest reassessment. A Devil's Advocate section in Round 1 would have surfaced "could this be load shedding?" immediately.

RCV-6: Mandatory Scope and Limitations

Every FINDINGS.md must include a Scope and Limitations section documenting: - Exact operating point tested (blocks, rate, seeds, instances, routing) - Parameters the findings depend on - What was NOT tested that could change the conclusion - Whether the finding generalizes or is specific to the tested configuration

Evidence: H10's "28% TTFT improvement" is specific to GPU=2100 blocks near the preemption cliff. Without the scope section, this number would be cited as a general property of tiered KV caching.


Iterative Review Protocol

Canonical source: docs/contributing/convergence.md. If this section diverges, convergence.md is authoritative.

Every hypothesis experiment iterates until convergence (max 10 rounds per gate) through three review gates, each using the universal convergence protocol:

  1. Design Review (5 perspectives) — after experiment design, before implementation
  2. Code Review (5 perspectives) — after implementing run.sh/analyze.py, before execution
  3. FINDINGS Review (10 perspectives) — after documenting results, before finalization

Convergence: Zero CRITICAL and zero IMPORTANT items from any reviewer perspective in the current round. No minimum round count — convergence in Round 1 is valid if no reviewer flags any CRITICAL or IMPORTANT item. SUGGESTION-level items do not block convergence. See docs/contributing/convergence.md for the full protocol and severity definitions, and docs/contributing/hypothesis.md for reviewer prompts and perspective checklists.

Why internal agents instead of external LLM reviews: Internal Task agents can read the actual source files, verify file:line citations, and cross-reference analyzer regexes against simulator output format strings — capabilities external LLM reviews lack. See docs/contributing/hypothesis.md for the full evidence and comparison table.


Hypothesis Resolution

Every hypothesis resolves to a status (did the prediction hold?) and a resolution (what do we do about it?). These are distinct — a "confirmed" hypothesis can still have a wrong-mechanism resolution that changes user guidance entirely.

Status (the prediction)

Status Definition Example
Confirmed The predicted directional outcome holds across all seeds H13: same seed → byte-identical output
Confirmed with nuance The prediction holds but the mechanism or practical implications differ from expected H5: token-bucket reduces TTFT 69x but via 96% load shedding, not burst smoothing; no practical sweet spot
Partially confirmed Some predictions hold, others don't, or the experiment tested something different than intended H14: routing pathological confirmed, scheduling showed double-inversion cancellation
Refuted The predicted outcome does not hold across seeds (not yet observed — the refutation IS the value)
Inconclusive Effect is within noise (<10% in any seed) or parameter-dependent H5 exp4: calibrated bucket shows <5% TTFT improvement

Resolution (what we learned and what to do)

Resolution Definition Action Example
Clean confirmation Hypothesis holds, mechanism matches prediction Document. No further action. H13, H3, H8
Confirmation with wrong mechanism Prediction holds directionally but the underlying cause differs Correct the explanation. May change user guidance entirely. H5: improvement is load shedding, not burst smoothing
Confirmation with bug discovery Prediction holds but experiment surfaces code defects File issues (--label bug). Fix in separate PRs. H12: conservation holds but preemption panics. H14: routing works but 3 detector bugs. H10: tiered KV confirmed but analyzer bug masked preemptions for 2 rounds.
Partial confirmation with surprise Some predictions fail; unexpected useful insights emerge Document surprise. May spawn new hypotheses. (use when the experiment finds something valuable but different from what was hypothesized)
Refuted — mechanism not plausible The hypothesis assumed a mechanism that the implementation doesn't support File design issue if the mechanism should exist but doesn't. Document the actual mechanism. H5: hypothesis assumed burst smoothing, but per-input-token cost model (admission.go:45) makes burst smoothing structurally impossible at practical parameters
Refuted — system design flaw Prediction fails because system doesn't work as designed File design issue (--label design). May require architectural change. (not yet observed)
Refuted — wrong mental model Prediction fails because experimenter's assumptions were wrong Correct understanding. Document what the system actually does. (not yet observed)
Inconclusive — parameter-dependent Effect exists at some parameters but not others Document the parameter boundary. May need recalibration. H5 exp4: <5% effect with calibrated bucket
Converged to open question Mechanism identified but directional explanation requires different tooling Mark as open. Propose specific tooling needed. (use when remaining questions require code instrumentation, not more experiment sweeps)

Choosing status vs resolution

The status answers "did the number go the way we predicted?" The resolution answers "do we understand why and what to do about it?" Always report both:

**Status:** Confirmed with nuance
**Resolution:** Confirmation with wrong mechanism — token-bucket reduces TTFT
via load shedding (96% rejection), not burst smoothing. No practical sweet spot
under Gamma CV=3.5.

A common mistake is declaring "Confirmed" and stopping. The resolution is where the real value lives.


Findings Classification

Every experiment produces individual findings. Each finding MUST be classified independently of the hypothesis status:

Finding Type Definition Action Required
Confirmation The hypothesis holds; the system works as designed Document in FINDINGS.md. No issues needed.
Bug discovery The hypothesis failed due to a code defect File GitHub issue with --label bug. Fix in separate PR.
New rule The experiment revealed a pattern that should be checked in all future PRs Add to docs/contributing/standards/rules.md with evidence. File issue with --label enhancement if code changes needed.
New invariant The experiment revealed a property that must always hold Add to docs/contributing/standards/invariants.md.
Design limitation The system works as coded but has an undocumented behavioral limitation Document in FINDINGS.md + file issue with --label design for design doc update.
Surprise An unexpected result that doesn't fit other categories Document in FINDINGS.md. May spawn new hypotheses.
Open question Mechanism identified but explanation incomplete; requires different tooling to resolve Mark explicitly in FINDINGS.md with proposed tooling/experiment.

The Audit Step

After analyzing results, EVERY experiment MUST audit findings against docs/contributing/standards/:

  1. Do any findings reveal violations of existing rules or principles?
  2. Do any findings suggest a new rule, invariant, or principle is needed?
  3. Do any findings confirm that existing rules/invariants hold under new conditions?

This audit is what makes experiments a feedback loop into the standards. Example: H3 confirmed that the llm-d default config is robust (confirmation) AND revealed that KV utilization is stale at high rates (design limitation -> new rule R17 + new invariant INV-7 + 3 issues).


Experiment Artifacts

Each hypothesis experiment lives in hypotheses/<name>/ within its feature branch with:

File Purpose
run.sh Self-contained script: builds binary, runs all variants, calls analyzer
analyze.py Output parser producing formatted comparison tables
FINDINGS.md Results, root cause analysis, findings classification, standards audit
*.yaml (optional) Custom workload specs for this experiment

Scripts must be reproducible — running ./run.sh on the same commit produces deterministic output.

Artifact lifecycle: Experiment artifacts are not merged to main. They stay in the feature branch; the PR itself is the permanent, linkable record. Completed experiments are collected in the hypothesis-archive branch.