BLIS Extension Recipes¶
Step-by-step guides for extending BLIS. Each recipe lists the exact files to touch, the order, and examples to follow.
Adding New Policy Templates¶
To add a new policy template (e.g., a new routing algorithm):
- Implement the interface in the corresponding file:
AdmissionPolicy→sim/admission.go(cluster-level: receives*RouterStatewith snapshots + clock)RoutingPolicy→sim/routing.go(cluster-level: receives*RouterStatewith snapshots + clock)InstanceScheduler→sim/scheduler.go(instance-level: receivesrequests+clockonly)-
Note:
RouterStateis a bridge type insim/to avoid import cycles — seesim/router_state.go -
Register in two places (both required):
- Add policy name to valid names map in
sim/bundle.go(e.g.,validRoutingPolicies) and correspondingIsValid*function - Add
caseto factory function in the same policy file (e.g.,NewRoutingPolicyinsim/routing.go) -
CLI error messages auto-derive from
ValidAdmissionPolicyNames()etc. — no manual update needed -
Add tests following BDD naming:
TestMyPolicy_Scenario_Behavior - Test observable behavior, not internal structure
- Include empty-snapshots panic test for routing policies (defensive programming convention)
-
Use
&RouterState{Snapshots: snapshots, Clock: clock}in test setup -
Update documentation: CLAUDE.md file organization, README policy lists
Important: For composite load signals, use snap.EffectiveLoad() — never compute QueueDepth + BatchSize + InFlightRequests inline. For queue-depth-only signals, use snap.QueueDepth directly.
Examples:
- See RejectAll in sim/admission.go for a simple admission template (constant return)
- See newPrefixAffinityScorer in sim/routing_prefix_scorer.go for a stateful scorer with observer-based state updates (the prefix-affinity scorer uses a router-side PrefixCacheIndex to track per-instance block hash history)
Adding New Scorers (Weighted Routing)¶
To add a new scoring dimension for the weighted routing policy (e.g., predicted-latency):
- Implement the scorer function in
sim/routing_scorers.go(stateless) or a new file (stateful) — ascorerFuncthat takes(*Request, []RoutingSnapshot)and returnsmap[string]float64with scores in [0,1] per instance. Stateful scorers also return anobserverFunccalled after each routing decision. - Register the scorer in
sim/routing_scorers.go: add tovalidScorerNamesmap +newScorerWithObserverfactory switch - Add behavioral tests — monotonicity, boundary values, INV-1/INV-2 conformance
- Extension friction: 2 touch points (implementation + registration in
newScorerWithObserver). Stateful scorers (like prefix-affinity) may use a separate file (e.g.,sim/routing_prefix_scorer.go) but the registration point is the samenewScorerWithObserverswitch insim/routing_scorers.go. - Stateful scorers return an
observerFuncalongside thescorerFuncfromnewScorerWithObserver. TheobserverFuncsignature isfunc(req *Request, targetInstance string)and is called after each routing decision to update scorer state. The scorer and observer share state via closure.
Examples:
- See scoreLoadBalance in sim/routing_scorers.go for a simple stateless scorer
- See scoreQueueDepth for a scorer with edge case handling (uniform load)
- See newPrefixAffinityScorer in sim/routing_prefix_scorer.go for a stateful scorer with observer and router-side cache
Extending KV Cache Tiers¶
To add a new KV tier (e.g., NVMe offloading for 3-tier GPU+CPU+NVMe):
- Implement the
KVStoreinterface insim/kv/(11 methods: allocate, get cached, release, capacity queries, metrics,SetClock,ConsumePendingTransferLatency) - Compose existing tiers — e.g., wrap
TieredKVCache(GPU+CPU) with NVMe logic, following the same delegation pattern - Update
NewKVStorefactory insim/kv/register.goto instantiate your tier based onKVCacheConfigfields (add new fields toKVCacheConfiginsim/config.go) - Add CLI flags in
cmd/root.gofor new parameters (e.g.,--kv-nvme-blocks) and wire them into theKVCacheConfigsub-config - Aggregate metrics — combine hit/miss/thrashing counters from all tiers; see
TieredKVCache.CacheHitRate()for the 2-tier pattern - Add behavioral tests in
sim/kv/*_test.go - Check-then-act allocation (no rollback) —
KVCacheState.AllocateKVBlocksuses a pre-check gate: it computes the total blocks needed (new blocks + cached blocks leaving the free list) and compares againstcountFreeBlocks()before any state mutation. If insufficient, it returnsfalseimmediately with zero side effects. Post-pre-checkpopFreeBlock() == nilis apanic(INV-4 violation, structurally unreachable in single-threaded DES). This mirrors vLLM'skv_cache_manager.py:334-336universal pre-check. If your tier adds mutations before delegating togpu.AllocateKVBlocks(), ensure the inner pre-check sees the updatedFreeBlockCnt(e.g.,commitCachedBlockscallsremoveFromFreeListwhich decrementsFreeBlockCntbefore the inner call). GetCachedBlocksis a pure query — it returns cached block IDs without side effects.CacheHitsare counted byAllocateKVBlockswhen cached blocks are committed to an allocation. The pre-check accounts for cached blocks with!InUse(on the free list) via thecachedFromFreeListbudget, mirroring vLLM'snum_evictable_blocks.
Examples:
- See TieredKVCache in sim/kv/tiered.go for 2-tier GPU+CPU composition
- See KVCacheState in sim/kv/cache.go for single-tier baseline (also implements KVStore)
- See docs/plans/archive/pr12-architectural-predesign.md for the design decisions behind the tiered architecture
Adding New Trace Record Types¶
To add a new trace record type (e.g., ScaleRecord for autoscaling events):
- Define the record struct in
sim/trace/record.go(pure data, nosim/dependency) - Add a slice field to
SimulationTraceinsim/trace/trace.go(e.g.,Scales []ScaleRecord) - Add a recording method to
SimulationTrace(e.g.,RecordScale(ScaleRecord)) - Hook recording into the cluster event pipeline in
sim/cluster/cluster_event.go(guard withif cs.trace != nilfor zero-overhead default) - Update
Summarize()insim/trace/summary.goto aggregate the new record type - Add behavioral tests in
sim/trace/*_test.go
Examples:
- See AdmissionRecord in sim/trace/record.go for a simple record
- See RoutingRecord with CandidateScore for a record with nested counterfactual data
- See computeCounterfactual() in sim/cluster/counterfactual.go for derived computation that lives in sim/cluster/ (not sim/trace/) because it needs sim.RoutingSnapshot
Adding Fields to TraceV2 Format¶
To add a new field to TraceV2 CSV output (e.g., observability metadata like vllm_priority):
- Add field to
TraceRecordinsim/workload/tracev2.go(place logically near related fields) - Add field to
RequestRecordincmd/observe.go(if captured during observation) - Capture the value in
RealClient.Send()incmd/observe.go(set field onRequestRecord) - Wire through Recorder in
cmd/observe.go: UpdateRecordRequest()to copy field fromRequestRecordtoTraceRecord - Update ExportTraceV2 in
sim/workload/tracev2.go: - For optional columns: scan records to determine if column is needed, conditionally add to header, conditionally write values
- For always-present columns: add to
traceV2Columnsslice and write in row construction - Update LoadTraceV2/parseTraceRecord in
sim/workload/tracev2.go: - For optional columns: detect column presence from CSV header, apply offset to subsequent column indices
- For always-present columns: parse at fixed index, update all downstream indices
- Enforce simulation isolation (if observability-only): Verify
LoadTraceV2RequestsandLoadTraceV2SessionBlueprintsdo NOT read the field intosim.Request— add tests for this - Add behavioral tests covering round-trip, conditional logic, and simulation isolation
Column position guidelines:
- Observability metadata: place near related fields (e.g., vllm_priority after slo_class)
- Timing data: group with other timestamps
- Optional columns: prefer conditional inclusion over always-writing zero values
Examples:
- See PR #1220 (VLLMPriority) — optional column, simulation-isolated, conditional on SLOClass presence
- See FinishReason/ErrorMessage fields — always-present optional strings (empty when not set)
- See ServerInputTokens field — observability metadata that differs from InputTokens
Adding New Latency Model Backends¶
To add a new latency estimation backend (e.g., SGLang RadixAttention, TensorRT-LLM, neural surrogate):
- Implement the
LatencyModelinterface insim/latency/latency.go(or a new file insim/latency/for complex models) — 4 methods: StepTime(batch []*Request) int64— estimate batch step duration from request statesQueueingTime(req *Request) int64— estimate arrival-to-queue delayOutputTokenProcessingTime() int64— per-token post-processing overheadPostDecodeFixedOverhead() int64— fixed per-request completion overhead (return 0 if not applicable)- All
float64 → int64conversions MUST useclampToInt64(v)(defined insim/latency/latency.go). Directint64(v)casts on float64 values are undefined behavior in Go when the value is out of range.clampToInt64handles NaN and positive overflow correctly. - Register the backend name in
sim/bundle.go: add"your-backend": truetovalidLatencyBackendsmap. - Register in
NewLatencyModelfactory insim/latency/latency.go: add acasebranch in theswitch hw.Backendblock. The backend string (e.g.,"trained-physics") is set by the--latency-modelCLI flag and stored inModelHardwareConfig.Backend. The factory signature isNewLatencyModel(LatencyCoeffs, ModelHardwareConfig). - Add CLI wiring (if needed) in
cmd/root.go: add a loading block for your backend's coefficients fromdefaults.yaml. If your backend needs a custom defaults section, add a struct tocmd/default_config.go. - Add behavioral tests in
sim/latency/— monotonicity (more tokens → longer step time), positive output, boundary cases (empty batch) - Extension friction: 3-5 touch points (implementation + bundle map + factory branch; optionally CLI wiring + defaults struct)
Examples:
- See RooflineLatencyModel in sim/latency/latency.go for a simple stateless analytical model (FLOPs/bandwidth roofline)
- See TrainedPhysicsModel in sim/latency/trained_physics_model.go for a physics-informed model with roofline basis functions, learned corrections, and MoE-aware overhead modeling
Adding New Batch Formation Strategies¶
To add a new batch formation strategy (e.g., disaggregated prefill/decode, speculative decoding, continuous batching without preemption):
- Implement the
BatchFormationinterface insim/batch_formation.go(or a new file for complex strategies) — 1 method: FormBatch(ctx BatchContext) BatchResult— compose the running batch for the next step- The implementation receives
BatchContextwith: RunningBatch, WaitQ, KVCache, token budget, batch size limit, chunked prefill threshold, MaxModelLen (0 = unlimited; implementations should clamp token scheduling tomaxModelLen-1-ProgressIndexwhen > 0), simulation time, step count, and ComputedTokens map - The implementation MUST update
ctx.ComputedTokens[req.ID]for each request that receives new tokens (Phase 2 ofStep()reads this map to advanceProgressIndex) - The implementation may mutate
WaitQ(dequeue/prepend) andKVCache(allocate/release) during batch formation - The implementation MUST NOT schedule events or record metrics — return decisions in
BatchResult, the Simulator applies them - Register in
NewBatchFormationfactory insim/batch_formation.go: add a selection branch. The factory signature isNewBatchFormation(preemptionPolicy string). For a new batch formation strategy (not just a preemption variant), add aBatchFormation stringfield toPolicyConfigand a selection branch inNewBatchFormation - Add behavioral tests — token budget enforcement, batch size limits, KV conservation, preemption behavior (if applicable), FCFS ordering
- Extension friction: 2 touch points (implementation + factory registration)
Note: Currently only VLLMBatchFormation exists (with configurable preemption via --preemption-policy fcfs|priority). Adding a second batch formation strategy will also require: (a) a BatchFormation string field in PolicyConfig or BatchConfig (in sim/config.go), (b) a CLI flag in cmd/root.go, (c) validation in sim/bundle.go, (d) selection logic in NewBatchFormation. For adding a new preemption variant (not a new strategy), add a constant to batch_formation.go, a case to the switch in preemptForTokens, and an entry in validPreemptionPolicies in bundle.go.
Examples:
- See VLLMBatchFormation in sim/batch_formation.go for the vLLM FCFS + chunked-prefill + preemption strategy
- See preemptForTokens for the KV allocation + eviction loop pattern
Adding New Quantization Formats¶
To add support for a new quantization format (e.g., GGUF, HQQ, Marlin):
- Add
quantization_configparsing insim/latency/config.goinsideParseHuggingFaceConfig(). The three-tier detection order: - Tier 1 —
quantization_config: Add a newelse ifbranch after the existingcompressed-tensorscase (~line 240). Extract the weight bit-width from the format's config structure and setweightBytesPerParam = bits / 8.0. Use case-insensitive matching forquant_method(strings.EqualFold). - Tier 2 — Model name conventions: If the format has recognizable naming patterns (like
w4a16orFP8), add a regex to the compiled patterns (reWxAy,reFP8Name) or add a new pattern near line 283. UpdateInferWeightBytesFromModelName()accordingly. -
Tier 3 —
torch_dtypefallback: No changes needed — this is automatic viaBytesPerParam. -
Add tests in
sim/latency/config_test.go: - A
TestParseHuggingFaceConfig_YourFormat_*test with a syntheticconfig.jsoncontaining the newquantization_configstructure - Verify
WeightBytesPerParamis set correctly -
If adding name-based detection, add cases to
TestInferWeightBytesFromModelName -
No changes needed to roofline/KV capacity code — they already use
EffectiveWeightBytesPerParam()which automatically picks up the new format's weight precision. -
Extension friction: 1-2 touch points (config parsing + optional name regex)
Examples:
- See the GPTQ/AWQ bits extraction (~line 229) for formats with a top-level bits field
- See the compressed-tensors branch (~line 240) for formats with nested config structures
- See InferWeightBytesFromModelName() for regex-based name pattern detection
Adding a New Engine¶
To add a new autoscaler Engine implementation (e.g., a cost-minimizing MIP solver or an OpenEvolve-evolved policy):
- Implement the
Engineinterface insim/cluster/engine.go(or a new file for complex engines) — 1 method: Optimize(results []AnalyzerResult, inventory GPUInventory) []ScaleDecision- Inputs: one
AnalyzerResultper model (supply/demand signals + per-variant breakdown) and aGPUInventorysnapshot (free GPU slots per variant, pre-subtracted for Loading/Active/Draining instances). - Output: at most one
ScaleDecisionper model per call.Delta > 0= add replicas;Delta < 0= remove replicas. -
Must not read
RouterStateorModelSignalsdirectly — onlyAnalyzerResult. -
Reuse the shared helpers when appropriate:
scaleUpN(requiredCapacity float64, vcs []VariantCapacity) int— exact replica count viaceil(requiredCapacity / prc); fallback to 1 whenprc == 0.scaleDownN(spareCapacity float64, vc VariantCapacity) int— exact replica count viafloor(spareCapacity / prc), clamped to[1, ReplicaCount]; fallback to 1.sortedByAscCost,sortedByDescCost— deterministic variant sort (R2: copy-and-sort to avoid mutating caller data).-
Pass only the selected variant to
scaleUpN(vcs[0:1], not the full slice) soperReplicaCapacityForScaleUpuses the chosen variant's own capacity — not a different active variant's. -
Wire the engine in
sim/cluster/cluster.go— search for&UnlimitedEngine{}in the autoscaler pipeline construction and replace it with your engine (or add a config field + factory once multi-engine selection is implemented in the follow-up wiring PR). -
Add behavioral tests in
sim/cluster/engine_test.go: - Scale-up: correct
Delta, correctVariant, inventory check pass/fail. - Scale-down: correct
Deltafrom spare capacity,ReplicaCountclamp. - Edge cases: zero
RequiredCapacity, zeroSpareCapacity, all variants inactive,TPDegree > 1. -
Cross-model: multiple
AnalyzerResultentries — verify decisions for each model independently. -
Extension friction: 2 touch points (implementation + wiring via
&UnlimitedEngine{}incluster.go). For research variants behind a config field, 4 touch points: implementation +AutoscalerConfigfield + factory function + CLI flag.
Examples:
- See UnlimitedEngine in sim/cluster/engine.go for a simple inventory-ignoring engine
- See GreedyEngine in sim/cluster/engine.go for an inventory-respecting greedy engine with exact-N sizing
Adding New Per-Request Metric Fields¶
To add a new field to per-request JSON output (appears in --metrics-path output):
- Add field to
Requestinsim/request.go(runtime state, zero-value safe). When constructingRequeststructs, useRequestStatetyped constants (StateQueued,StateRunning,StateCompleted) — never bare strings. - Add field to
RequestMetricsinsim/metrics_utils.go(JSON output struct, useomitemptyfor backward compatibility) - Update
NewRequestMetrics()constructor insim/metrics_utils.goto propagate the new field fromRequesttoRequestMetrics - Set the field at the appropriate event (e.g.,
RoutingDecisionEventfor cluster-level, or completion for computed metrics) - Add behavioral tests covering multi-instance, single-instance, and standalone boundaries
Examples:
- See HandledBy (#181) — set by RoutingDecisionEvent, zero-value when used outside cluster pipeline (suppressed from JSON via omitempty)
- See SLOClass/TenantID (PR10) — set during workload generation, propagated at injection