Strategy Evolution¶

A structured, iterative search methodology for discovering high-performing system configurations in complex, multi-dimensional policy spaces. The methodology structure is domain-agnostic; the specific instantiation described here was developed for LLM inference serving using BLIS.

Overview¶

In systems with multiple interacting policy layers — routing, scheduling, memory management, admission control — the optimal configuration cannot be derived analytically or guessed by experts. Interactions between layers produce non-obvious emergent behaviors: super-additive effects, signal cancellation, and regime-dependent dominance.

Strategy Evolution discovers optimal configurations through disciplined experimentation: human-guided mechanism design combined with machine-guided parameter optimization, organized into iterative cycles with rigorous measurement and cumulative principle extraction.

The central idea: a strategy is a hypothesis bundle. Every candidate mechanism is formulated as a set of testable predictions — a main hypothesis, ablation hypotheses, controls, and robustness checks — all designed before any code is written. Prediction errors, not just fitness scores, are the primary signal for learning.

flowchart TD
    P1["Phase 1<br/>Problem Framing"] --> P2["Phase 2<br/>Hypothesis Bundle Design"]
    P2 --> P3["Phase 3<br/>Implement & Verify"]
    P3 --> P4["Phase 4<br/>Bayesian Optimization"]
    P4 --> P5["Phase 5<br/>Principles & Iteration"]
    P5 -->|Iterate| P2
    P5 -->|Converged| Done["Definitive Strategy<br/>+ Principles"]

    style P2 fill:#e1f5fe
    style P3 fill:#e8f5e9
    style P5 fill:#fff3e0

For detailed examples of hypothesis bundles drawn from PR #452 and PR #447, see Hypothesis Bundles in Practice.

Phase 1: Problem Framing¶

Write a precise problem statement (problem.md) that specifies:

The baseline to beat — exact configuration parameters, measured across 3+ seeds
The target workload — designed to prevent shortcutting (see below)
Quantitative success criteria — e.g., ">15% TTFT P99 improvement, >5% throughput improvement"
Hard constraints — must be implementable, defensible to domain experts, robust across seeds
Prior knowledge inventory — experimental findings that narrow the design space

Workload Design is Critical

Design the workload to prevent strategies from gaming the metric. In our work, we used orthogonal SLO tiers — all tiers share identical request shapes, so strategies cannot use token-length as a proxy for priority. The SLO class metadata is the only differentiator.

Artifacts produced: problem.md, baseline.json

Phase 2: Hypothesis Bundle Design¶

Phase 2 is the heart of Strategy Evolution. Each iteration generates candidate strategies and formulates the winner as a hypothesis bundle — a set of testable predictions designed before any code is written.

Step 2a: Generate and review candidates¶

Generate 2–3 candidate strategies. Each must be:

A parameterized template — the mechanism defines what the strategy does; tunable parameters control how aggressively
Self-critiqued — identify weaknesses before external review
Reviewed by multiple independent judges — we used Claude Opus, GPT-4o, and Gemini 2.5 Flash

sequenceDiagram
    participant A as Author
    participant J1 as Judge 1 (Claude)
    participant J2 as Judge 2 (GPT-4o)
    participant J3 as Judge 3 (Gemini)

    A->>A: Generate 2-3 candidates + self-critique
    par Multi-judge review
        A->>J1: Candidate designs
        A->>J2: Candidate designs
        A->>J3: Candidate designs
        J1-->>A: Design flaws + improvements
        J2-->>A: Design flaws + improvements
        J3-->>A: Design flaws + improvements
    end
    A->>A: Select winner by consensus

Multi-Judge Catches Real Design Flaws

Claude Opus identified that a proposed cache score (1 - KVUtilization) was a free-capacity signal, not a cache-affinity signal — a subtle conflation that would have wasted an entire iteration
GPT-4o caught a bang-bang oscillation problem in an online learning controller
Gemini caught a numerical instability in hyperbolic starvation protection

Step 2b: Decompose winner into hypothesis bundle¶

After selecting the winning strategy, decompose it into a hypothesis bundle — a set of testable, falsifiable predictions:

flowchart LR
    S["Selected Strategy"]

    S --> HM["H-main<br/>Mechanism claim +<br/>predicted effect size"]
    S --> HA["H-ablation<br/>One per component:<br/>isolate contribution"]
    S --> HSA["H-super-additivity<br/>Compound effect vs<br/>sum of parts"]
    S --> HC["H-control-negative<br/>Condition where effect<br/>should vanish"]
    S --> HR["H-robustness<br/>Generalization<br/>boundaries"]

    style HM fill:#c8e6c9
    style HA fill:#ffecb3
    style HSA fill:#ffecb3
    style HC fill:#e1bee7
    style HR fill:#bbdefb

Arm	What it tests	Purpose
H-main	Mechanism's predicted effect + causal explanation	Does the strategy work, and why?
H-ablation-{component}	Each component's individual contribution	Which parts matter? Are any redundant?
H-super-additivity	Whether compound effect exceeds sum of parts	Do components interact?
H-control-negative	Where the effect should vanish	Confirms mechanism specificity
H-robustness	Generalization across workloads, resources, scale	Where does the strategy break?

Each arm follows the hypothesis experiment workflow: experiment design standards (ED-1 through ED-6), convergence-gated review, formal controls, FINDINGS documentation. Every arm includes a diagnostic clause ("if this fails, it indicates...") that directs investigation when predictions don't match outcomes.

Pre-commit ablation, don't bolt it on

Ablation hypotheses are designed HERE, before any code is written. This prevents confirmation bias — you predict each component's contribution before seeing whether the compound strategy works. If you can't articulate what removing a component should do, you don't understand the mechanism well enough to implement it.

For a complete worked example of a hypothesis bundle, see Hypothesis Bundles in Practice — Scheduling Example.

Step 2c: Design Review¶

Run the 5-perspective Design Review on the hypothesis bundle using the universal convergence protocol.

/convergence-review h-design

The review covers the full bundle — not just H-main, but all ablation, control, and robustness arms.

Step 2d: Human approval gate¶

Present the hypothesis bundle for human approval. The human reviews: H-main predictions, ablation predictions, control designs, and robustness boundaries.

This is a hard gate. Do not proceed to implementation until the human approves.

Artifacts produced: research.md, iter<N>-bundle.yaml, Design Review convergence log

Phase 3: Implement and Verify¶

Implement the strategy code AND the experiment code for the full hypothesis bundle, then execute, analyze, and review — all before moving to parameter optimization.

flowchart TD
    A["3a: Implement strategy code"]
    B["3b: Implement experiment code"]
    C["3c: Code Review<br/>(5 perspectives)"]
    D["3d: Execute all arms<br/>(parallel, 3+ seeds)"]
    E["3e: Compare predictions<br/>to outcomes"]
    F["3f: Document FINDINGS.md"]
    G["3g: FINDINGS Review<br/>(10 perspectives)"]
    H["3h: Self-audit<br/>(6 dimensions)"]
    I["3i: Record in ledger"]

    A --> B --> C --> D --> E --> F --> G --> H --> I

    style C fill:#fff3e0
    style E fill:#e8f5e9
    style G fill:#fff3e0

3a — Implement strategy. New policy code, CLI flags, or configuration.

3b — Implement experiments. For each hypothesis arm, create run.sh (sources shared harness) and analyze.py. All arms share one run.sh — they are independent and can execute in parallel.

3c — Code Review. 5-perspective review on all experiment code BEFORE running experiments (/convergence-review h-code).

3d — Execute. Run all hypothesis arms across 3+ seeds.

3e — Compare predictions to outcomes. For each arm, record:

What was predicted (from the bundle design in Phase 2)
What was observed (from the experiment)
Whether the prediction was confirmed (direction correct AND exceeds threshold), partially confirmed (direction correct, magnitude falls short), or refuted (direction wrong or magnitude negligible <5%)
If refuted: what the discrepancy reveals about the causal model

3f — Document FINDINGS. Write FINDINGS.md using the hypothesis template. The prediction-vs-outcome comparison is a required section.

3g — FINDINGS Review. 10-perspective review using the convergence protocol (/convergence-review h-findings).

3h — Self-audit. Six dimensions of critical self-review.

3i — Record in ledger. One row per iteration, including prediction accuracy.

The Ledger is the Single Source of Truth

Never delete rows. Failed approaches are as valuable as successes. The ledger prevents revisiting failed approaches and makes the full exploration path auditable.

Extended ledger format:

Iter	Strategy	TTFT P99 Δ%	Throughput Δ%	Key Mechanism	Prediction Accuracy	Status
0	Baseline	—	—	FCFS + constant priority	—	Measured
1	SLO-Gated Priority	-50.8%	~0%	Priority cascade	H-main confirmed; H-zero-sum refuted (zero-sum)	Bundle verified

Artifacts produced: ledger.md, iter<N>-FINDINGS.md, convergence logs

Phase 4: Bayesian Parameter Optimization¶

Once a mechanism's H-main is confirmed in Phase 3, optimize its parameters. Mechanisms whose H-main was refuted skip this phase — their prediction errors are analyzed for principles instead.

flowchart TD
    A["Define parameter ranges<br/>in strategy YAML"] --> B["Bayesian optimizer<br/>selects next point"]
    B --> C["Run simulator<br/>3 seeds x N params"]
    C --> D{"Constraint<br/>violation?"}
    D -->|Yes| E["Add penalty<br/>to objective"]
    D -->|No| F["Record metric"]
    E --> G["Update GP<br/>surrogate model"]
    F --> G
    G -->|Budget remaining| B
    G -->|Budget exhausted| H["Extract best<br/>parameters"]

This separates mechanism design (human creativity + hypothesis testing) from parameter tuning (machine search). Every strategy gets the benefit of optimization, so comparisons are fair.

Strategy YAML format:

name: "slo-gated-priority-cascade"
parameters:
  - name: "base_critical"
    flag: "--slo-base-critical"
    type: "real"
    low: 5.0
    high: 20.0
  - name: "age_weight"
    flag: "--slo-age-weight"
    type: "real"
    low: 0.000001
    high: 0.0001
    prior: "log-uniform"
objective:
  metric: "ttft_p99_ms"
  direction: "min"
constraints:
  - metric: "throughput_tps"
    direction: "max"
    threshold: 15000
    weight: 10

Budget: 30–50 evaluations × 3 seeds = 90–150 simulator runs per strategy.

Artifacts produced: optimize.py, *-optimization-results.json

Phase 5: Principle Extraction and Iteration¶

Extract principles from the iteration's results and decide whether to iterate.

Principle extraction¶

Distill findings into numbered principles — concise, falsifiable statements grounded in experimental evidence. Principles come from two sources:

Confirmed predictions — the mechanism works as theorized (e.g., RP-1: "Orthogonal signals > pre-combined signals")
Prediction errors — the most valuable source. When a prediction fails, the discrepancy reveals something about the system that wasn't understood (e.g., S6: "Scheduling is zero-sum at saturation")

flowchart TD
    subgraph "Phase 3 outputs"
        C["Confirmed predictions"]
        R["Refuted predictions"]
    end

    C --> P1["New principle<br/>(mechanism verified)"]
    R --> D["Discrepancy analysis"]
    D --> P2["New principle<br/>(causal model corrected)"]
    D --> RD["Redesign<br/>(next iteration)"]

    P1 --> PC["Principles Catalog"]
    P2 --> PC

    PC -->|Constrains| Next["Phase 2<br/>(next iteration)"]

    style R fill:#ffcdd2
    style D fill:#fff3e0
    style P2 fill:#c8e6c9

Principles function as hard constraints on subsequent iterations:

RP-1 (orthogonality) prevented building a combined cache-load scorer in iterations 5–19
RP-6 (KV-util counterproductive) eliminated KV-utilization from all subsequent strategies
S6 (scheduling is zero-sum) redirected effort from scheduler optimization to admission control
RP-10 (PA:QD safety rule) prevented ratio violations in Bayesian search bounds

When a new iteration proposes a mechanism that contradicts an existing principle, it must either provide experimental evidence that the principle doesn't hold in the new regime, or redesign to work within the principle's constraints.

Fast-fail rules¶

Condition	Action
H-main refuted	H-main refutation still requires a minimum 5-perspective FINDINGS Review to confirm the refutation is genuine (not an analyzer bug). After verification, skip ablation arms. Record refutation + diagnostic. Iterate.
H-main confirmed, single dominant component (>80% of effect)	Simplify strategy to that component. Iterate.
H-control-negative fails (effect appears when it shouldn't)	Mechanism is confounded. Redesign before continuing.

Stopping criterion¶

Stop when multiple consecutive iterations produce null or marginal results — you have found the basin of the optimal strategy. The principles catalog is the durable output.

Practical Considerations¶

Bundle size management¶

The number of hypothesis arms scales with strategy complexity (see the Bundle Size Guide for detailed breakdowns). Three mitigations keep this manageable:

Tiered review depth. H-main gets the full 10-perspective FINDINGS review. Ablation arms get a 5-perspective review. Byte-identical controls get a spot-check that the treatment was active, then automatic validation.
Parallel execution. Hypothesis arms are independent. Use parallel execution mode. All arms must use the same seed set for valid comparison.
Fast-fail. If H-main is refuted (after FINDINGS review confirms the refutation), skip remaining arms.

When to use the full bundle vs. a lighter iteration¶

Iteration type	Required arms	Optional arms
New compound mechanism (≥2 components)	H-main, all H-ablation, H-super-additivity, H-control-negative	H-robustness
Component removal/simplification	H-main, H-control-negative, H-ablation for removed component	H-robustness
Single-component mechanism	H-main, H-control-negative	H-robustness
Parameter-only change (same active components, only numeric values change)	H-main only	Proceed directly to Bayesian optimization
Robustness sweep (post-confirmation)	H-robustness arms only	—

How Strategy Evolution connects to existing workflows¶

Skill / Process	Role in Strategy Evolution
`/research-ideas`	Generates candidate strategies (Step 2a)
`/hypothesis-experiment`	Guided Steps 0-10 experiment workflow (project skill)
`/hypothesis-test`	Experiment scaffolding: run.sh, analyze.py templates (sdlc-plugins)
`/convergence-review`	Gates Design Review, Code Review, FINDINGS Review
`/brainstorming`	Explores solution space before committing to candidates
Hypothesis experiment process	Defines the detailed per-arm workflow

Skills and Tools Inventory¶

Skill	Phase	Purpose
`/research-ideas`	2a	Structured idea generation with iterative external LLM review
`/brainstorming`	2a	Explore solution space before committing to an approach
`/review-plan`	2a	Send candidate designs to external LLMs for technical review
`/convergence-review` (h-design)	2c	5-perspective design review on hypothesis bundle
`/hypothesis-experiment`	3b	Guided Steps 0-10 experiment workflow (project skill)
`/hypothesis-test`	3b	Experiment scaffolding: run.sh, analyze.py templates (sdlc-plugins)
`/test-driven-development`	3a	TDD for new policy implementations
`/convergence-review` (h-code)	3c	5-perspective code review on experiment code
`/convergence-review` (h-findings)	3g	10-perspective findings review
`/verification-before-completion`	3h	Confirm results before claiming success
`/dispatching-parallel-agents`	3d	Parallel hypothesis arm execution
`/commit-push-pr`	5	Clean git integration after validation

Where to Get These Skills

These skills are Claude Code plugins. To install them:

/brainstorming, /test-driven-development, /verification-before-completion, /dispatching-parallel-agents, /commit-push-pr: Install the superpowers plugin — claude plugins add superpowers
/convergence-review, /hypothesis-experiment: Project-local skills defined in this repository's .claude/skills/ directory. Available automatically when Claude Code is run from the repo root.
/hypothesis-test: Install the sdlc-plugins plugin — claude plugins add sdlc-plugins
/research-ideas, /review-plan: Install the research-ideas plugin — claude plugins add research-ideas

Non-skill tools:

Tool	Phase	Purpose
`optimize.py` + scikit-optimize	4	Bayesian optimization harness (`gp_minimize`)
`strategy_template.yaml`	4	Parameterized strategy configuration format
`ledger.md`	All	Single source of truth (1 row per iteration, with prediction accuracy)
Git worktrees	All	Isolation per experiment track

Applying to a New Problem¶

Write problem.md — baseline, workload, success criteria, constraints, prior knowledge. Design the workload to prevent shortcutting.
Build the measurement harness — deterministic simulator or benchmark that accepts parameterized configuration, produces machine-parseable metrics, and runs fast enough for 100–200 evaluations. For noisy real-system benchmarks, increase seed count and evaluation budget proportionally.
Start the ledger — ledger.md with baseline row. One row per iteration with a prediction accuracy column. Never delete rows.
Run the loop — for each iteration:
- Generate 2-3 candidates with multi-judge review (Phase 2a)
- Decompose winner into a hypothesis bundle with predictions, ablation, controls, and robustness checks (Phase 2b)
- Convergence-gated Design Review + human approval (Phase 2c-2d)
- Implement strategy + experiments, Code Review, execute all arms, compare predictions to outcomes, FINDINGS Review (Phase 3)
- Bayesian optimization for confirmed mechanisms (Phase 4)
- Extract principles from both confirmed predictions and prediction errors (Phase 5)
Know when to stop — when multiple consecutive iterations produce null or marginal results, you have found the basin of the optimal strategy. The principles catalog is the durable output.

Results: How Two Tracks Converged¶

Strategy Evolution was applied in parallel on two complementary problem spaces:

Dimension	Scheduling Track (11 iters)	Routing Track (19 iters)
Primary lever	Priority ordering + admission control	Scorer weights + signal selection
Best result	73.7% critical TTFT improvement	65% combined improvement (bursty)
Key discovery	Priority is zero-sum; admission is non-zero-sum	KV-utilization scorer is counterproductive
Winning strategy	SLO-tiered priority + no-chunk prefill	`pa:4,qd:3` + SLO-gated admission

Both tracks converged on SLO-gated admission control as the breakthrough "third lever" and on prefix-affinity + queue-depth as the optimal signal pair.

gantt
    title Two-Track Convergence Timeline
    dateFormat X
    axisFormat %s
    section Scheduling Track
    Baseline + SLO priority     :s1, 0, 2
    Bayesian optimization       :s2, 2, 3
    Admission control discovery :crit, s3, 3, 5
    Prefill threshold tuning    :s4, 5, 7
    Robustness + multi-turn     :s5, 7, 11
    section Routing Track
    P2C + dynamic weights       :r1, 0, 3
    Signal orthogonality        :r2, 3, 5
    KV-util counterproductive   :crit, r3, 5, 8
    Compound + admission        :crit, r4, 8, 14
    Bayesian + scaling sweep    :r5, 14, 19
    section Convergence
    Both discover admission ctrl :milestone, 10, 10
    Both drop kv-utilization     :milestone, 12, 12

Experimental Configurations

The winning strategies described above were discovered during Strategy Evolution experiments using custom configurations. Some components (SLO-gated admission, SLO-tiered priority as compound strategies) are not yet available as standard BLIS policy templates. The current BLIS default (pa:3,qd:2,kv:2) is maintained for llm-d parity. The regime-dependent recommendation (normal KV: pa:3,qd:2,kv:2; under pressure: pa:3,qd:2; high load with admission: pa:4,qd:3) will be documented in the principles catalog (currently in draft).