BLIS Extension Recipes¶
Step-by-step guides for extending BLIS. Each recipe lists the exact files to touch, the order, and examples to follow.
Adding New Policy Templates¶
To add a new policy template (e.g., a new routing algorithm):
- Implement the interface in the corresponding file:
AdmissionPolicy→sim/admission.go(cluster-level: receives*RouterStatewith snapshots + clock)RoutingPolicy→sim/routing.go(cluster-level: receives*RouterStatewith snapshots + clock)PriorityPolicy→sim/priority.go(instance-level: receivesreq+clockonly)InstanceScheduler→sim/scheduler.go(instance-level: receivesrequests+clockonly)-
Note:
RouterStateis a bridge type insim/to avoid import cycles — seesim/router_state.go -
Register in two places (both required):
- Add policy name to valid names map in
sim/bundle.go(e.g.,validRoutingPolicies) and correspondingIsValid*function - Add
caseto factory function in the same policy file (e.g.,NewRoutingPolicyinsim/routing.go) -
CLI error messages auto-derive from
ValidAdmissionPolicyNames()etc. — no manual update needed -
Add tests following BDD naming:
TestMyPolicy_Scenario_Behavior - Test observable behavior, not internal structure
- Include empty-snapshots panic test for routing policies (defensive programming convention)
-
Use
&RouterState{Snapshots: snapshots, Clock: clock}in test setup -
Update documentation: CLAUDE.md file organization, README policy lists
Important: For load-based routing, use snap.EffectiveLoad() — never compute QueueDepth + BatchSize + InFlightRequests inline. This ensures all routing policies use the same formula.
Examples:
- See RejectAll in sim/admission.go for a simple admission template (constant return)
- See newPrefixAffinityScorer in sim/routing_prefix_scorer.go for a stateful scorer with observer-based state updates (the prefix-affinity scorer uses a router-side PrefixCacheIndex to track per-instance block hash history)
Adding New Scorers (Weighted Routing)¶
To add a new scoring dimension for the weighted routing policy (e.g., predicted-latency):
- Implement the scorer function in
sim/routing_scorers.go(stateless) or a new file (stateful) — ascorerFuncthat takes(*Request, []RoutingSnapshot)and returnsmap[string]float64with scores in [0,1] per instance. Stateful scorers also return anobserverFunccalled after each routing decision. - Register the scorer in
sim/routing_scorers.go: add tovalidScorerNamesmap +newScorerWithObserverfactory switch - Add behavioral tests — monotonicity, boundary values, INV-1/INV-2 conformance
- Extension friction: 2 touch points (implementation + registration in
newScorerWithObserver). Stateful scorers (like prefix-affinity) may use a separate file (e.g.,sim/routing_prefix_scorer.go) but the registration point is the samenewScorerWithObserverswitch insim/routing_scorers.go. - Stateful scorers return an
observerFuncalongside thescorerFuncfromnewScorerWithObserver. TheobserverFuncsignature isfunc(req *Request, targetInstance string)and is called after each routing decision to update scorer state. The scorer and observer share state via closure.
Examples:
- See scoreLoadBalance in sim/routing_scorers.go for a simple stateless scorer
- See scoreQueueDepth for a scorer with edge case handling (uniform load)
- See newPrefixAffinityScorer in sim/routing_prefix_scorer.go for a stateful scorer with observer and router-side cache
Extending KV Cache Tiers¶
To add a new KV tier (e.g., NVMe offloading for 3-tier GPU+CPU+NVMe):
- Implement the
KVStoreinterface insim/kv/(11 methods: allocate, get cached, release, capacity queries, metrics,SetClock,ConsumePendingTransferLatency) - Compose existing tiers — e.g., wrap
TieredKVCache(GPU+CPU) with NVMe logic, following the same delegation pattern - Update
NewKVStorefactory insim/kv/register.goto instantiate your tier based onKVCacheConfigfields (add new fields toKVCacheConfiginsim/config.go) - Add CLI flags in
cmd/root.gofor new parameters (e.g.,--kv-nvme-blocks) and wire them into theKVCacheConfigsub-config - Aggregate metrics — combine hit/miss/thrashing counters from all tiers; see
TieredKVCache.CacheHitRate()for the 2-tier pattern - Add behavioral tests in
sim/kv/*_test.go - Preserve rollback semantics —
KVCacheState.AllocateKVBlocksis transactional: on mid-loop failure,rollbackAllocation()undoes all mutations (UsedBlockCnt, CacheMisses, CacheHits, RefCount, InUse, free list, HashToBlock, RequestMap). If your tier adds mutations beyond what delegation togpu.AllocateKVBlocks()handles, you must roll those back too. SeecachedBlockMutationandnewBlockMutationtypes insim/kv/cache.go. GetCachedBlocksis a pure query — it returns cached block IDs without side effects.CacheHitsare counted byAllocateKVBlockswhen cached blocks are committed to an allocation (and rolled back on failure). This was fixed in the Phase 3 hardening PR; the previous implementation incremented CacheHits in GetCachedBlocks, causing double-counting in tiered mode.
Examples:
- See TieredKVCache in sim/kv/tiered.go for 2-tier GPU+CPU composition
- See KVCacheState in sim/kv/cache.go for single-tier baseline (also implements KVStore)
- See docs/plans/archive/pr12-architectural-predesign.md for the design decisions behind the tiered architecture
Adding New Trace Record Types¶
To add a new trace record type (e.g., ScaleRecord for autoscaling events):
- Define the record struct in
sim/trace/record.go(pure data, nosim/dependency) - Add a slice field to
SimulationTraceinsim/trace/trace.go(e.g.,Scales []ScaleRecord) - Add a recording method to
SimulationTrace(e.g.,RecordScale(ScaleRecord)) - Hook recording into the cluster event pipeline in
sim/cluster/cluster_event.go(guard withif cs.trace != nilfor zero-overhead default) - Update
Summarize()insim/trace/summary.goto aggregate the new record type - Add behavioral tests in
sim/trace/*_test.go
Examples:
- See AdmissionRecord in sim/trace/record.go for a simple record
- See RoutingRecord with CandidateScore for a record with nested counterfactual data
- See computeCounterfactual() in sim/cluster/counterfactual.go for derived computation that lives in sim/cluster/ (not sim/trace/) because it needs sim.RoutingSnapshot
Adding New Latency Model Backends¶
To add a new latency estimation backend (e.g., SGLang RadixAttention, TensorRT-LLM, neural surrogate):
- Implement the
LatencyModelinterface insim/latency/latency.go(or a new file insim/latency/for complex models) — 4 methods: StepTime(batch []*Request) int64— estimate batch step duration from request statesQueueingTime(req *Request) int64— estimate arrival-to-queue delayOutputTokenProcessingTime() int64— per-token post-processing overheadPostDecodeFixedOverhead() int64— fixed per-request completion overhead (return 0 if not applicable)- All
float64 → int64conversions MUST useclampToInt64(v)(defined insim/latency/latency.go). Directint64(v)casts on float64 values are undefined behavior in Go when the value is out of range.clampToInt64handles NaN and positive overflow correctly. - Register the backend name in
sim/bundle.go: add"your-backend": truetovalidLatencyBackendsmap. - Register in
NewLatencyModelfactory insim/latency/latency.go: add acasebranch in theswitch hw.Backendblock. The backend string (e.g.,"trained-roofline") is set by the--latency-modelCLI flag and stored inModelHardwareConfig.Backend. The factory signature isNewLatencyModel(LatencyCoeffs, ModelHardwareConfig). - Add CLI wiring (if needed) in
cmd/root.go: add a loading block for your backend's coefficients fromdefaults.yaml. If your backend needs a custom defaults section, add a struct tocmd/default_config.go. - Add behavioral tests in
sim/latency/— monotonicity (more tokens → longer step time), positive output, boundary cases (empty batch) - Extension friction: 3-5 touch points (implementation + bundle map + factory branch; optionally CLI wiring + defaults struct)
Examples:
- See BlackboxLatencyModel in sim/latency/latency.go for a simple stateless model (alpha/beta regression)
- See RooflineLatencyModel in sim/latency/latency.go for a model that uses hardware config (FLOPs/bandwidth)
- See CrossModelLatencyModel in sim/latency/crossmodel.go for a physics-informed model that derives step time from HuggingFace architecture features (MoE-aware)
- See TrainedRooflineLatencyModel in sim/latency/trained_roofline.go for a data-driven model with roofline basis functions × learned corrections (zero-allocation hot path, 7% MAPE)
Adding New Batch Formation Strategies¶
To add a new batch formation strategy (e.g., disaggregated prefill/decode, speculative decoding, continuous batching without preemption):
- Implement the
BatchFormationinterface insim/batch_formation.go(or a new file for complex strategies) — 1 method: FormBatch(ctx BatchContext) BatchResult— compose the running batch for the next step- The implementation receives
BatchContextwith: RunningBatch, WaitQ, KVCache, token budget, batch size limit, chunked prefill threshold, MaxModelLen (0 = unlimited; implementations should clamp token scheduling tomaxModelLen-1-ProgressIndexwhen > 0), simulation time, step count, and ComputedTokens map - The implementation MUST update
ctx.ComputedTokens[req.ID]for each request that receives new tokens (Phase 2 ofStep()reads this map to advanceProgressIndex) - The implementation may mutate
WaitQ(dequeue/prepend) andKVCache(allocate/release) during batch formation - The implementation MUST NOT schedule events or record metrics — return decisions in
BatchResult, the Simulator applies them - Register in
NewBatchFormationfactory insim/batch_formation.go: add a selection branch. The factory signature isNewBatchFormation()— a future PR will add a strategy selection parameter (e.g., a string field inPolicyConfigorBatchConfig) - Add behavioral tests — token budget enforcement, batch size limits, KV conservation, preemption behavior (if applicable), FCFS ordering
- Extension friction: 2 touch points (implementation + factory registration)
Note: Currently only VLLMBatchFormation exists. Adding a second strategy will also require: (a) a BatchFormation string field in PolicyConfig or BatchConfig (in sim/config.go), (b) a CLI flag in cmd/root.go, (c) validation in sim/bundle.go, (d) selection logic in NewBatchFormation.
Examples:
- See VLLMBatchFormation in sim/batch_formation.go for the vLLM FCFS + chunked-prefill + preemption strategy
- See preemptForTokens for the KV allocation + eviction loop pattern
Adding New Quantization Formats¶
To add support for a new quantization format (e.g., GGUF, HQQ, Marlin):
- Add
quantization_configparsing insim/latency/config.goinsideParseHuggingFaceConfig(). The three-tier detection order: - Tier 1 —
quantization_config: Add a newelse ifbranch after the existingcompressed-tensorscase (~line 240). Extract the weight bit-width from the format's config structure and setweightBytesPerParam = bits / 8.0. Use case-insensitive matching forquant_method(strings.EqualFold). - Tier 2 — Model name conventions: If the format has recognizable naming patterns (like
w4a16orFP8), add a regex to the compiled patterns (reWxAy,reFP8Name) or add a new pattern near line 283. UpdateInferWeightBytesFromModelName()accordingly. -
Tier 3 —
torch_dtypefallback: No changes needed — this is automatic viaBytesPerParam. -
Add tests in
sim/latency/config_test.go: - A
TestParseHuggingFaceConfig_YourFormat_*test with a syntheticconfig.jsoncontaining the newquantization_configstructure - Verify
WeightBytesPerParamis set correctly -
If adding name-based detection, add cases to
TestInferWeightBytesFromModelName -
No changes needed to roofline/KV capacity code — they already use
EffectiveWeightBytesPerParam()which automatically picks up the new format's weight precision. -
Extension friction: 1-2 touch points (config parsing + optional name regex)
Examples:
- See the GPTQ/AWQ bits extraction (~line 229) for formats with a top-level bits field
- See the compressed-tensors branch (~line 240) for formats with nested config structures
- See InferWeightBytesFromModelName() for regex-based name pattern detection
Adding New Per-Request Metric Fields¶
To add a new field to per-request JSON output (appears in --metrics-path output):
- Add field to
Requestinsim/request.go(runtime state, zero-value safe). When constructingRequeststructs, useRequestStatetyped constants (StateQueued,StateRunning,StateCompleted) — never bare strings. - Add field to
RequestMetricsinsim/metrics_utils.go(JSON output struct, useomitemptyfor backward compatibility) - Update
NewRequestMetrics()constructor insim/metrics_utils.goto propagate the new field fromRequesttoRequestMetrics - Set the field at the appropriate event (e.g.,
RoutingDecisionEventfor cluster-level, or completion for computed metrics) - Add behavioral tests covering multi-instance, single-instance, and standalone boundaries
Examples:
- See HandledBy (#181) — set by RoutingDecisionEvent, zero-value when used outside cluster pipeline (suppressed from JSON via omitempty)
- See SLOClass/TenantID (PR10) — set during workload generation, propagated at injection