Project Structure¶
The simulator uses a discrete-event architecture with a min-heap event queue.
inference-sim/
├── .claude/commands/ # Claude Code slash commands (speckit.specify, speckit.plan, speckit.tasks, speckit.implement, speckit.clarify, speckit.checklist, speckit.analyze, speckit.constitution, speckit.taskstoissues)
├── .specify/ # Speckit feature-development toolkit
│ ├── memory/constitution.md # BLIS project constitution (principles, invariants, rules)
│ ├── templates/ # Spec, plan, tasks, checklist, agent-file templates
│ └── scripts/bash/ # Bash scripts: create-new-feature.sh, setup-plan.sh, update-agent-context.sh, check-prerequisites.sh, common.sh
├── .github/workflows/ # CI configuration (build, lint, test)
├── main.go # CLI entry point (Cobra)
├── cmd/
│ ├── root.go # CLI commands and flags (--num-instances, --policy-config, --routing-scorers, --workload-spec, --trace-level, --fitness-weights, --kv-cpu-blocks, --kv-offload-threshold, --kv-transfer-bandwidth, --kv-transfer-base-latency, --snapshot-refresh-interval, --latency-model, --max-model-len, --trace-output)
│ ├── replay.go # `blis replay` command: replays TraceV2 file through DES; flags: --trace-header, --trace-data (required), all sim config flags shared via registerSimConfigFlags(); --results-path writes []workload.SimResult (integer request_id, ttft_us/e2e_us in µs); SimResult type lives in sim/workload/calibrate.go
│ ├── calibrate.go # `blis calibrate` command: compares real observed latencies (TraceV2 from blis observe) against sim predictions ([]SimResult JSON from blis replay --results-path); flags: --trace-header, --trace-data, --sim-results, --report (required), --warmup-requests (default: from header, sentinel -1), --network-rtt-us (default: from header, sentinel -1), --network-bandwidth-mbps; writes CalibrationReport JSON with MAPE/PearsonR/percentiles per metric
│ ├── observe.go # Real mode HTTP client (RealClient with functional options: WithAPIFormat for completions/chat, stream_options for streaming usage, finish_reason extraction, configurable max_tokens); Recorder for TraceV2 output
│ ├── observe_cmd.go # `blis observe` command: flags --server-url, --model, --api-format (completions/chat), --unconstrained-output, --rtt-ms, --workload-spec/--rate; prefix string generation (buildPrefixStrings with FNV-seeded vocabulary); dispatch orchestrator with session support
│ ├── convert.go # `blis convert` subcommands (servegen, preset, inference-perf)
│ ├── compose.go # `blis compose` for merging v2 specs
│ ├── hfconfig.go # HuggingFace config resolution chain (--latency-model auto-fetch, caching)
│ └── default_config.go # defaults.yaml loading (includes GetHFRepo for HF repo name mapping)
├── sim/ # Core single-instance simulator
│ ├── config.go # Module-scoped sub-config types (KVCacheConfig, BatchConfig, LatencyCoeffs, ModelHardwareConfig, PolicyConfig, WorkloadConfig) — composed into SimConfig via embedding (R16)
│ ├── doc.go # Package reading guide: start with request.go, event.go, simulator.go
│ ├── simulator.go # SimConfig struct (composed of embedded sub-configs + Horizon/Seed), NewSimulator(SimConfig) (*Simulator, error) constructor (validates MaxModelLen vs KV capacity), event loop (Run()), batch formation (delegated to BatchFormation interface), step execution with phased metric recording, EnqueueRequest (MaxModelLen + KV capacity guards), processCompletions (proactive MaxModelLen cap at maxModelLen-1 boundary), observation methods (QueueDepth(), BatchSize(), CurrentClock(), SimHorizon()). All workload generation external via InjectArrival().
│ ├── admission.go # AdmissionPolicy interface (accepts *RouterState), AlwaysAdmit, TokenBucket, RejectAll, NewAdmissionPolicy factory
│ ├── routing.go # RoutingPolicy interface (accepts *RouterState), RoutingSnapshot (with EffectiveLoad() for canonical load calculation), RoutingDecision (with Priority hint), RoundRobin, LeastLoaded, WeightedScoring (composable scorer pipeline), AlwaysBusiest templates, NewRoutingPolicy factory
│ ├── routing_scorers.go # ScorerConfig, scorer implementations (queue-depth, kv-utilization, load-balance), ParseScorerConfigs, IsValidScorer, DefaultScorerConfigs, newScorerWithObserver factory
│ ├── routing_prefix_scorer.go # Prefix-affinity scorer + observer (proportional prefix matching)
│ ├── prefix_cache_index.go # PrefixCacheIndex: per-instance LRU of hierarchical block hashes
│ ├── priority.go # PriorityPolicy interface with ConstantPriority, SLOBasedPriority, and InvertedSLO templates, NewPriorityPolicy factory
│ ├── scheduler.go # InstanceScheduler interface with FCFSScheduler, PriorityFCFSScheduler, SJFScheduler, and ReversePriority templates, NewScheduler factory
│ ├── latency_model.go # LatencyModel interface (3 methods), NewLatencyModelFunc registration variable, MustNewLatencyModel nil-guarded wrapper
│ ├── router_state.go # RouterState bridge type (Snapshots + Clock) for cluster-level policies
│ ├── bundle.go # PolicyBundle YAML loading, LoadPolicyBundle, Validate
│ ├── event.go # Event types (Arrival, Queued, Step, Scheduled, RequestLeft, Timeout) with (timestamp, priority, seqID) ordering
│ ├── request.go # RequestState typed constants (StateQueued, StateRunning, StateCompleted, StateTimedOut), Request lifecycle and state machine, Deadline field for client timeout, Priority field for scheduler-aware ordering, AssignedInstance for cluster routing provenance (#181), workload metadata (TenantID, SLOClass, etc.), MaxOutputLen (client output budget for enqueue guard)
│ ├── kv_store.go # KVStore interface (12 methods: +SetClock, +ConsumePendingTransferLatency, +MirrorToCPU), NewKVStoreFromConfig registration variable, MustNewKVCacheState/MustNewKVStoreFromConfig nil-guarded wrappers
│ ├── batch.go # Batch struct
│ ├── batch_formation.go # BatchFormation interface, BatchContext/BatchResult types, VLLMBatchFormation (FCFS + chunked-prefill + preemption), NewBatchFormation() factory
│ ├── queue.go # FIFO wait queue
│ ├── metrics.go # TTFT, TPOT, E2E collection and SaveResults()
│ ├── metrics_utils.go # Percentile/mean calculation, MetricsOutput JSON struct, NewRequestMetrics canonical constructor
│ ├── rng.go # PartitionedRNG for deterministic multi-subsystem simulation
│ ├── model_hardware_config.go # ModelConfig, HardwareCalib structs (config types stay in sim/); HardwareCalib includes MemoryGiB (used by KV capacity auto-calculation in roofline/crossmodel modes). ModelConfig.WeightBytesPerParam (0=fallback to BytesPerParam) with EffectiveWeightBytesPerParam() accessor decouples weight storage precision from compute/KV dtype. Note: MaxModelLen is int64 (aligned with ProgressIndex, TotalKVBlocks, BlockSizeTokens).
│ └── internal/ # Shared internal packages
│ ├── hash/ # Block-level hashing for prefix cache
│ ├── testutil/ # Shared test infrastructure (golden dataset loading)
│ └── util/ # General utility functions
├── sim/kv/ # KV cache implementations (PKG-1)
│ ├── cache.go # KVCacheState (single-tier GPU)
│ ├── tiered.go # TieredKVCache (GPU+CPU mirror/reload, vLLM v1 model)
│ └── register.go # NewKVStore factory + init()-based registration into sim/
├── sim/latency/ # Latency model implementations (PKG-2)
│ ├── latency.go # RooflineLatencyModel (default, analytical FLOPs/bandwidth), BlackboxLatencyModel (alpha/beta regression), CrossModelLatencyModel (physics-informed cross-model), NewLatencyModel(LatencyCoeffs, ModelHardwareConfig) factory
│ ├── trained_roofline.go # TrainedRooflineLatencyModel: roofline basis functions × learned corrections (7β + 3α from training pipeline)
│ ├── crossmodel.go # CrossModelLatencyModel: physics-informed step time from architecture features (MoE-aware)
│ ├── roofline.go # rooflineStepTime(), calculateTransformerFlops(), calculateMemoryAccessBytes(), StepConfig/PrefillRequestConfig/DecodeRequestConfig types
│ ├── kv_capacity.go # CalculateKVBlocks: auto-derive total KV cache blocks from model architecture + GPU memory; KVCapacityParams, ExtractKVCapacityParams, computeModelWeightBytes
│ ├── config.go # HFConfig, GetHWConfig(), GetModelConfig(), ValidateRooflineConfig(), parseHWConfig(), ParseHFConfig()
│ └── register.go # init()-based registration of NewLatencyModelFunc into sim/
├── sim/cluster/ # Multi-replica cluster simulation
│ ├── instance.go # InstanceSimulator wraps sim.Simulator via NewInstanceSimulator(id, SimConfig) with run-once guard; delegates to Simulator observation methods (QueueDepth(), BatchSize(), etc.)
│ ├── cluster.go # ClusterSimulator orchestrates N instances with shared-clock event loop, online routing pipeline, and metrics aggregation; Run() returns error
│ ├── cluster_event.go # ClusterArrivalEvent, AdmissionDecisionEvent, RoutingDecisionEvent
│ ├── counterfactual.go # computeCounterfactual() for top-k candidate ranking and regret computation
│ ├── snapshot.go # CachedSnapshotProvider (returns sim.RoutingSnapshot), ObservabilityConfig
│ ├── metrics.go # RawMetrics, Distribution, FitnessResult, CollectRawMetrics (accepts priorityPolicy), ComputeFitness (returns (FitnessResult, error)), anomaly detection, ParseFitnessWeights with NaN/Inf validation, per-SLO-class metrics, JainFairnessIndex
│ ├── deployment.go # DeploymentConfig embeds sim.SimConfig + cluster-only fields; ToSimConfig() returns the embedded config
│ └── evaluation.go # EvaluationResult wrapper (RawMetrics + FitnessResult + trace + summary)
├── sim/workload/ # ServeGen-informed workload generation
│ ├── spec.go # WorkloadSpec v2, ClientSpec (with Model field), ArrivalSpec, DistSpec, YAML loading, v1→v2 auto-upgrade (UpgradeV1ToV2), IsValidSLOClass accessor
│ ├── arrival.go # ArrivalSampler: Poisson, Gamma (Marsaglia-Tsang), Weibull (bisection), Constant (fixed-interval)
│ ├── distribution.go # LengthSampler: Gaussian, Exponential, ParetoLogNormal, EmpiricalPDF, Constant
│ ├── client.go # Rate normalization, prefix group management
│ ├── generator.go # GenerateRequests pipeline with client decomposition
│ ├── servegen.go # Native ServeGen data file loading (chunk-*-trace.csv + dataset.json)
│ ├── tracev2.go # Trace v2 format (YAML header + CSV data); 26-column schema including finish_reason (backward-compat with 25-column pre-finish_reason traces)
│ ├── replay.go # Trace v2 → sim.Request with synthetic token IDs
│ ├── calibrate.go # CalibrationReport, PrepareCalibrationPairs, MAPE/Pearson r
│ ├── multimodal.go # Multimodal token generation (text+image+audio+video)
│ ├── reasoning.go # Reasoning multi-turn with context accumulation
│ ├── session.go # SessionManager: closed-loop session tracking, follow-up round generation on completion
│ ├── network.go # Client-perspective latency (RTT + bandwidth)
│ ├── inference_perf.go # inference-perf format: InferencePerfSpec, expansion, validation
│ ├── scenarios.go # Built-in presets (bursty, unfair, prefix-heavy, mixed-slo)
│ ├── convert.go # Format converters: ConvertServeGen, ConvertPreset, ComposeSpecs
│ ├── cohort.go # CohortSpec expansion: diurnal, spike, drain patterns → lifecycle windows
│ └── synthesis.go # Flag-to-spec synthesis: SynthesizeFromDistribution, SynthesizeFromPreset
├── sim/trace/ # Decision trace recording
│ ├── trace.go # TraceLevel, TraceConfig, SimulationTrace, NewSimulationTrace, recording methods
│ ├── record.go # AdmissionRecord, RoutingRecord, CandidateScore (pure data types, no sim/ dependency)
│ └── summary.go # TraceSummary, Summarize()
├── model_configs/ # Auto-fetched HuggingFace config.json files (gitignored)
├── defaults.yaml # Pre-trained coefficients, default GPU/TP/vLLM mappings, workload presets
├── hardware_config.json # GPU specifications
├── examples/ # Example configuration files
├── testdata/goldendataset.json # Golden dataset for regression tests
├── docs/
│ ├── getting-started/ # New user onboarding
│ │ ├── index.md # What is BLIS?
│ │ ├── installation.md # Build from source
│ │ ├── quickstart.md # First simulation
│ │ └── tutorial.md # Capacity planning walkthrough
│ ├── guide/ # Task-oriented user guides
│ │ ├── index.md # Guide overview
│ │ ├── routing.md # Routing policies
│ │ ├── admission.md # Admission control
│ │ ├── scheduling.md # Scheduling & priority
│ │ ├── latency-models.md # Latency models (roofline + blackbox)
│ │ ├── kv-cache.md # KV cache & memory management
│ │ ├── workloads.md # Workload specifications
│ │ ├── cluster.md # Cluster simulation
│ │ ├── results.md # Metrics & results
│ │ ├── experimentation.md # Hypothesis-driven experimentation
│ │ └── skills-and-plugins.md # Claude Code skills & plugins
│ ├── concepts/ # Architecture and design documentation
│ │ ├── index.md # Concepts overview
│ │ ├── glossary.md # Concepts glossary
│ │ ├── architecture.md # Cluster architecture
│ │ ├── core-engine.md # Core DES engine
│ │ └── roofline.md # Roofline step time estimation
│ ├── reference/ # Configuration and model reference
│ │ ├── index.md # Reference overview
│ │ ├── project-structure.md # Project file organization (this file)
│ │ ├── configuration.md # Configuration reference
│ │ ├── models.md # Supported models catalog
│ │ └── workload-spec.md # Workload spec YAML schema
│ ├── methodology/ # Research methodology documentation
│ │ ├── index.md # Methodology overview
│ │ ├── strategy-evolution.md # Strategy Evolution methodology guide
│ │ ├── hypothesis-bundles.md # Hypothesis bundle examples and writing guide
│ │ └── principles.md # Discovered principles catalog (30 principles)
│ ├── contributing/ # Contributor documentation
│ │ ├── index.md # Contributing landing page
│ │ ├── extension-recipes.md # Step-by-step extension guides
│ │ ├── pr-workflow.md # PR development workflow
│ │ ├── design-process.md # Design document process
│ │ ├── macro-planning.md # Macro-level planning process
│ │ ├── hypothesis.md # Hypothesis experiment process
│ │ ├── convergence.md # Universal Convergence Protocol
│ │ ├── standards/ # Canonical rules, invariants, principles, experiment standards
│ │ └── templates/ # Artifact templates + agent prompts
│ │ ├── design-guidelines.md # DES foundations, module architecture
│ │ ├── macro-plan.md # Multi-PR template (human-readable)
│ │ ├── macro-plan-prompt.md # Agent preamble for macro planning
│ │ ├── micro-plan.md # Single-PR template (human-readable)
│ │ ├── micro-plan-prompt.md # Agent preamble for writing-plans skill
│ │ └── hypothesis.md # Experiment FINDINGS.md template
│ └── plans/ # Active implementation plans (excluded from MkDocs)
│ └── archive/ # Completed design docs (architectural reference)
├── CONTRIBUTING.md # Contributor guide (references docs/contributing/standards/)
└── mkdocs.yml # MkDocs Material site configuration