Hypothesis Experiment Template¶

For Claude: Use this template when creating a new hypothesis experiment in hypotheses/<name>/.

FINDINGS.md Structure¶

Every experiment's FINDINGS.md MUST contain these sections:

# <Hypothesis Name>

**Status:** Confirmed | Confirmed with nuance | Partially confirmed | Refuted | Inconclusive
**Resolution:** <one of: Clean confirmation | Confirmation with wrong mechanism | Confirmation with bug discovery | Partial confirmation with surprise | Refuted — mechanism not plausible | Refuted — system design flaw | Refuted — wrong mental model | Inconclusive — parameter-dependent | Converged to open question>
**Family:** <one of: Workload/arrival | Scheduler invariants | Performance-regime | Structural model | Robustness/failure-mode | Cross-policy comparative>
**VV&UQ:** <one of: Verification | Validation | UQ>
**Tier:** <tier number — see hypothesis-archive branch for definitions>
**Type:** Deterministic | Statistical (<subtype>)
**Date:** YYYY-MM-DD
**Rounds:** <number of experiment-review rounds to convergence>

## Hypothesis

> <Quoted hypothesis statement — intuitive claim about system behavior>

## Experiment Design

**Classification:** <Deterministic | Statistical/Dominance | Statistical/Monotonicity | Statistical/Equivalence | Statistical/Pareto>

**Configurations compared:**
- A: <description + exact CLI flags>
- B: <description + exact CLI flags>

**Controlled variables:** <what is held constant>
**Varied variable:** <what differs between A and B>
**Seeds:** <list of seeds used>
**Preconditions verified:** <what was checked before running>

## Results

<Comparison tables with per-seed values>

## Root Cause Analysis

<Why the results are what they are — trace through the code/architecture.
Every causal claim MUST cite file:line (RCV-1).
Every "surprise" MUST include a first-principles calculation (RCV-2).
Must explain the mechanism AND its direction (RCV-3).
If a mechanism is proposed, describe the control experiment that would confirm it (RCV-4).>

## Devil's Advocate (RCV-5)

<Before sending to review, argue the OPPOSITE of your conclusion.>

**If this is "Confirmed," argue why it might be Refuted:**
<2-3 sentences>

**If this is "Refuted," argue why it might be Confirmed:**
<2-3 sentences>

## Findings Classification

| Finding | Type | Action |
|---------|------|--------|
| <finding 1> | Confirmation / Bug / New rule / New invariant / Design limitation / Surprise / Open question | <issue number or "documented here"> |

## Standards Audit

Findings checked against docs/contributing/standards/:
- [ ] Any violations of existing rules? <list or "none found">
- [ ] Any new rules needed? <list or "none">
- [ ] Any new invariants needed? <list or "none">
- [ ] Any existing rules/invariants confirmed? <list or "none">

## Scope and Limitations (RCV-6)

- **Operating point tested:** <blocks, rate, seeds, instances, routing, etc.>
- **Parameters findings depend on:** <what must be true for these results to hold>
- **What was NOT tested:** <parameter ranges, workloads, configs not covered>
- **Generalizability:** <does this finding generalize, or is it specific to this config?>
- **Uncertainty quantification:** <for any threshold or boundary finding, report confidence intervals. For any "confirmed" result, estimate the probability of holding under parameter variation. If UQ was not performed, state "UQ not performed — single operating point.">

## Evidence Quality

| Metric | Value | Confidence |
|--------|-------|------------|
| <primary metric> | <value> | High / Medium / Low — <why> |
| Sample size | <seeds × configs × requests> | <assessment> |
| Mechanism | <proposed mechanism> | <confidence + whether control confirms> |

## Implications for Users

<Practical guidance derived from this experiment>

## Reproducing

cd hypotheses/<name>
./run.sh

run.sh Structure¶

#!/bin/bash
# <Hypothesis name>
# <One-line description>
# Usage: ./run.sh [--rebuild]

set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/../lib/harness.sh"

setup_experiment "${1:-}"

# -- Experiment sections -----------------------------------------------
# Each experiment: use blis_run with appropriate timeout tier.
# NOTE: blis_run (not run_sim) — define your own run_sim() wrapper if needed.
#
# Example (basic):
#   blis_run $TIMEOUT_STANDARD "$RESULTS_DIR/config_a.txt" \
#       --model "$MODEL" --num-instances 4 --seed 42 \
#       --workload-spec "$WORKLOAD_YAML" --log error
#
# Example (with stderr capture for robustness experiments):
#   blis_run $TIMEOUT_STANDARD "$RESULTS_DIR/config_a.txt" \
#       --stderr "$RESULTS_DIR/config_a_stderr.txt" \
#       --model "$MODEL" --num-instances 4 --seed 42 --log error
#
# Example (with per-request JSON):
#   blis_run $TIMEOUT_STANDARD "$RESULTS_DIR/config_a.txt" \
#       --model "$MODEL" --num-instances 4 --seed 42 --log error \
#       --metrics-path "$RESULTS_DIR/config_a_results.json"
#
# Example (robustness/stress — non-zero exit expected, use || true under set -e):
#   blis_run $TIMEOUT_EXTENDED "$RESULTS_DIR/stress.txt" \
#       --stderr "$RESULTS_DIR/stress_stderr.txt" \
#       --model "$MODEL" --num-instances 4 --seed 42 --log error || true
#
# For KV-constrained experiments, add pre-flight check (advisory, never aborts):
#   preflight_kv_check 800 16 512  # total_blocks, block_size (default: 16), max_input
# ----------------------------------------------------------------------

analyze.py Structure¶

#!/usr/bin/env python3
"""Analysis script for <hypothesis name>.

Parses BLIS multi-block output and produces comparison tables.
"""
import sys
from pathlib import Path

# Import shared helpers
sys.path.insert(0, str(Path(__file__).resolve().parent.parent / "lib"))
from analyze_helpers import parse_blis_output, check_for_timeout

# -- Analysis code --------------------------------------------------------
# Use parse_blis_output(filepath) to get metrics dict.
# The dict includes a 'timed_out' flag — check it before computing ratios.
#
# Example:
#   metrics = parse_blis_output(sys.argv[1])
#   if metrics["timed_out"]:
#       print(f"  SKIPPED (timeout)", file=sys.stderr)
#   else:
#       print(f"  TTFT mean: {metrics['ttft_mean']:.2f} ms")
# -------------------------------------------------------------------------