Tutorial: Capacity Planning¶
This tutorial walks through a complete capacity planning exercise: determining how many inference instances you need to serve a target request rate while meeting latency SLOs.
Scenario: You're deploying Qwen3 14B on H100 GPUs with TP=1. Your SLO is TTFT p99 < 500ms. You need to find the minimum number of instances for 200 requests/second.
Step 1: Estimate Instance Capacity¶
Before scaling up, measure the throughput of a single instance under load. Run enough requests at a high arrival rate to saturate the instance — this reveals the maximum throughput with continuous batching:
Check the responses_per_sec value in the output. For Qwen3 14B / H100 / TP=1 with default workload (512 input / 512 output tokens), a saturated instance handles roughly 17 requests/second.
Why measure at high load?
With continuous batching, throughput depends on batch size. At low arrival rates (e.g., --rate 2), requests trickle in one at a time and the instance processes only ~2 req/s — not because it's slow, but because there's nothing else to batch. At saturation, many concurrent requests share each decode step, amortizing overhead and reaching ~17 req/s. Always measure capacity under load.
This means for 200 req/s, you need at minimum ceil(200/17) = 12 instances. Let's verify with simulation.
--rate is the total arrival rate
The --rate flag specifies the total arrival rate across the cluster, not per-instance. With --rate 200 --num-instances 8, each instance receives roughly 200/8 = 25 req/s (distributed by the routing policy).
Step 2: Baseline — Single Instance at Low Load¶
At 2 req/s (well below capacity), TTFT p99 is around 50ms. Note this value — this is your best-case baseline that won't improve further with more instances.
Step 3: Scale Up and Find the Saturation Point¶
Run simulations at increasing instance counts for 200 req/s:
# 4 instances (50 req/s per instance vs ~17 saturated capacity → heavily overloaded)
./blis run \
--model qwen/qwen3-14b \
--num-instances 4 --rate 200 --num-requests 1000
# 8 instances (25 req/s per instance → still above capacity but batching helps)
./blis run \
--model qwen/qwen3-14b \
--num-instances 8 --rate 200 --num-requests 1000
# 12 instances (~17 req/s per instance → balanced, near baseline)
./blis run \
--model qwen/qwen3-14b \
--num-instances 12 --rate 200 --num-requests 1000
Compare the cluster-level ttft_p99_ms across runs:
- 4 instances: TTFT p99 around 1,500ms — the per-instance arrival rate (50 req/s) far exceeds capacity (~17 req/s), so requests queue up and wait
- 8 instances: TTFT p99 drops to around 60ms — per-instance arrival rate (25 req/s) is above single-request capacity but manageable with continuous batching
- 12 instances: TTFT p99 around 54ms — near baseline, with comfortable headroom
Understanding saturation
Saturation means requests arrive faster than they can be served, so the queue grows continuously. In queueing theory terms, the per-instance excess rate is excess = λ/k - μ, where λ is the total arrival rate (200 req/s), k is the instance count, and μ is the per-instance service rate (~17 req/s). When excess > 0, the queue grows at that rate and TTFT degrades.
The improvement from 4→8 instances is dramatic (1,500ms → 60ms) because the excess rate drops from ~33 req/s to ~8 req/s per instance — a much larger relative change than the 2x instance count would suggest.
Step 4: Identify the Bottleneck Type¶
When TTFT is high, there are three possible causes:
- Queue saturation — arrival rate exceeds service capacity → add instances
- Memory saturation — KV cache preemptions degrade throughput → add KV blocks or reduce batch size
- Compute saturation — step time dominates → reduce batch size or use chunked prefill
Check the output for clues:
- High
preemption_count→ memory saturation - High
scheduling_delay_p99_ms→ queue saturation (requests waiting in the WaitQ) - Low
preemption_count+ lowscheduling_delay+ high TTFT → compute saturation
Step 5: Compare Routing Policies¶
With 8 instances at 200 req/s (near saturation), compare routing strategies:
# Round-robin (baseline)
./blis run \
--model qwen/qwen3-14b \
--num-instances 8 --rate 200 --num-requests 1000 \
--routing-policy round-robin
# Weighted (default profile)
./blis run \
--model qwen/qwen3-14b \
--num-instances 8 --rate 200 --num-requests 1000 \
--routing-policy weighted
# Least-loaded
./blis run \
--model qwen/qwen3-14b \
--num-instances 8 --rate 200 --num-requests 1000 \
--routing-policy least-loaded
With uniform workloads (same prompt/output distribution), routing policies produce similar results because all instances are roughly equally loaded. Routing differentiation becomes meaningful with heterogeneous workloads — for example, prefix-heavy traffic where prefix-affinity routing concentrates same-prefix requests on the same instance for KV cache reuse:
./blis run \
--model qwen/qwen3-14b \
--num-instances 8 --rate 200 --num-requests 1000 \
--routing-policy weighted \
--routing-scorers "prefix-affinity:5,queue-depth:1" \
--prefix-tokens 512
Step 6: Evaluate with Fitness Scores¶
For automated comparison across many configurations, use fitness evaluation:
./blis run \
--model qwen/qwen3-14b \
--num-instances 12 --rate 200 --num-requests 1000 \
--routing-policy weighted \
--fitness-weights "p99_ttft:3,mean_e2e:1,throughput:2"
The fitness score is a weighted sum of normalized metrics — higher is better. Each latency metric is normalized to a [0, 1] range using 1/(1+x/1000), and throughput is normalized to [0, 1] using throughput/max_throughput. With weights p99_ttft:3, mean_e2e:1, throughput:2, the score is a weighted sum out of a theoretical maximum of 6.0 (the sum of all weights).
Fitness score normalization
The 1/(1+x/1000) normalization compresses large raw differences into small score differences. A 38% TTFT improvement may appear as only an 8% fitness score difference. Always examine raw metrics (ttft_p99_ms, e2e_mean_ms, responses_per_sec) alongside fitness scores when making capacity decisions.
Step 7: Validate Against Your SLO¶
Your SLO: TTFT p99 < 500ms at 200 req/s.
From the simulations above, 8 instances already meet the SLO (TTFT p99 ~60ms), and 12 instances provide comfortable headroom (~54ms). Add 20-30% headroom for traffic spikes (real deployments see bursty traffic that exceeds the Poisson assumption).
Key Takeaways¶
- Measure capacity under load — run at high arrival rates (e.g.,
--rate 500) to measure saturated throughput; low-load measurements underestimate capacity due to small batch sizes - Saturation is non-linear — TTFT degrades super-linearly as you approach capacity. Scaling from 4→8 instances can produce a 25x improvement, not just 2x, because the per-instance excess rate drops dramatically
- Check the bottleneck type — preemption count, scheduling delay, and raw TTFT tell you whether to add instances, add memory, or tune batch size
- Routing matters for heterogeneous workloads — with uniform traffic, routing policies produce similar results; with prefix-heavy or mixed-SLO workloads, prefix-affinity and load-aware routing provide meaningful differentiation
- Deterministic replay — use
--seedto get identical results for A/B comparisons
What's Next¶
- Routing Policies — deep dive into scorer composition and signal freshness
- KV Cache & Memory — tune KV blocks, prefix caching, and chunked prefill
- Metrics & Results — understand all output fields and common patterns
- Hypothesis Experimentation — run rigorous experiments with the
/hypothesis-experimentskill