Skip to content

Cluster Simulation

This guide covers running multi-instance BLIS simulations — the full pipeline from request arrival through admission, routing, scheduling, and metrics aggregation.

# Quick example: 4-instance cluster with tracing
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 100 --num-requests 500 \
  --trace-level decisions --summarize-trace

Single-Instance vs Cluster Mode

Setting Behavior
--num-instances 1 (default) Single-instance: requests go directly to the wait queue, no admission or routing
--num-instances N (N > 1) Cluster mode: requests pass through admission → routing → per-instance queues

The Pipeline

Request → Admission → Routing → Instance WaitQueue → Batch Formation → Step → Completion
                                                    KV Allocation + Latency Estimation

Each stage is configurable:

Stage Controls Key Flags
Admission Whether to accept the request --admission-policy, --token-bucket-capacity
Routing Which instance receives it --routing-policy, --routing-scorers
Scheduling What order within the instance --scheduler, --priority-policy
Batch Formation Which requests form the next batch --max-num-running-reqs, --max-num-scheduled-tokens

Tensor Parallelism

The --tp flag sets the tensor parallelism degree for all instances. TP affects both latency (FLOPs split across GPUs) and memory (KV blocks split across GPUs):

# TP=2: 2 GPUs per instance
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --tp 2 --rate 100 --num-requests 500

# TP=4: 4 GPUs per instance (lower latency, fewer KV blocks per GPU)
./blis run --model qwen/qwen3-14b \
  --num-instances 2 --tp 4 --rate 100 --num-requests 500

Homogeneous instances

All instances share the same SimConfig (model, GPU, TP, KV blocks). BLIS does not currently model heterogeneous fleets (mixed GPU types or TP configurations).

Scaling and Saturation

Instance scaling produces super-linear TTFT improvement near saturation. With the default model (Qwen3-14B / H100 / TP=1, ~17 req/s per instance at saturation), scaling from 4→12 instances at rate=200 improves TTFT p99 from ~1,500ms to ~54ms.

This happens because the per-instance queue growth rate excess = λ/k - μ drops faster than linearly:

4 instances:  excess = 200/4 - 17  = 33 req/s per instance   → rapid queue growth
8 instances:  excess = 200/8 - 17  = 8 req/s per instance    → near saturation
12 instances: excess = 200/12 - 17 = -0.3 req/s per instance → balanced (sub-saturation)

At sub-saturation (excess ≤ 0): TTFT converges to the baseline (~54ms) and further scaling provides diminishing returns.

Admission Control

For rate-limiting and traffic shaping policies, see the Admission Control page.

Admission and Routing Latency

Model real network/processing overhead between gateway and backend:

--admission-latency 1000   # 1ms admission decision overhead
--routing-latency 500      # 0.5ms routing decision overhead

These add simulated delays to the admission and routing pipeline, modeling gRPC overhead, service mesh hops, and queue serialization in production deployments.

Decision Tracing

Log every routing decision for offline analysis:

./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 100 --num-requests 500 \
  --trace-level decisions --summarize-trace --counterfactual-k 3

The trace summary shows: - Target Distribution — how many requests went to each instance - Mean/Max Regret — how much better an alternative routing decision could have been

Counterfactual regret for weighted policies

For score-based policies (weighted, least-loaded), counterfactual regret is structurally zero — the chosen instance is always the highest-scoring one. Regret is only meaningful for non-score-based policies like round-robin.

Event Ordering

The cluster uses (timestamp, priority, seqID) ordering for deterministic event processing:

  • Cluster events at time T process before instance events at time T
  • Same-time instance ties broken by lowest instance index
  • This ensures determinism (INV-6) but means results differ from a simple M/M/k queueing model

Work-Conserving Property

BLIS is work-conserving (INV-8): it never idles while requests wait. After every step completion, if the WaitQ has requests, a new StepEvent is immediately scheduled. Real systems may have scheduling delays not modeled here.

Further Reading