Admission Control¶
Admission control is the first gate in the cluster pipeline. It decides whether to accept or reject incoming requests before they reach the routing stage. Admission only applies in cluster mode (--num-instances > 1) -- single-instance simulations skip directly to the wait queue.
# Rate-limit a 4-instance cluster with token bucket admission
./blis run --model qwen/qwen3-14b \
--num-instances 4 --rate 500 --num-requests 2000 \
--admission-policy token-bucket \
--token-bucket-capacity 10000 --token-bucket-refill-rate 1000
Available Policies¶
| Policy | Flag Value | Behavior |
|---|---|---|
| Always-admit | --admission-policy always-admit (default) |
Accepts all requests unconditionally. No filtering. |
| Token-bucket | --admission-policy token-bucket |
Rate-limiting. Each request consumes tokens equal to its input token count. Tokens refill at a constant rate. Rejects when the bucket is empty. |
| Tier-shed | --admission-policy tier-shed |
SLO-aware shedding. Under overload, rejects requests whose SLO tier priority is below tier_shed_min_priority. See SLO Tier Priorities below. |
| GAIE-legacy | --admission-policy gaie-legacy |
Saturation-based shedding matching production llm-d/GAIE behavior. Non-sheddable requests always pass; sheddable requests (priority < 0) rejected when pool-average saturation >= 1.0. See GAIE-Legacy Admission below. |
| Reject-all | --admission-policy reject-all |
Rejects all requests unconditionally. Pathological template for testing. |
Token Bucket Mechanics¶
The token bucket policy controls throughput by treating each request's input token count as a cost:
| Flag | Description | Default |
|---|---|---|
--token-bucket-capacity |
Maximum number of tokens the bucket can hold | 10000 |
--token-bucket-refill-rate |
Tokens added per second of simulation time | 1000 |
How it works:
- Bucket starts full. At initialization, the bucket holds
capacitytokens. - Refill is continuous. On each admission decision, the bucket refills proportionally to elapsed simulation time:
refill = elapsed_microseconds * refill_rate / 1,000,000. - Cost per request = number of input tokens. A request with 512 input tokens costs 512 tokens from the bucket.
- Admission check. If
current_tokens >= cost, the request is admitted and the cost is subtracted. Otherwise the request is rejected with reason"insufficient tokens". - Capacity cap. Tokens never accumulate beyond
capacity, even after long idle periods.
Sizing the bucket
With --token-bucket-capacity 10000 --token-bucket-refill-rate 1000 and requests averaging 512 input tokens, the sustained admission rate is roughly 1000 / 512 ~ 1.95 req/s. The bucket's capacity of 10000 tokens allows a burst of up to 10000 / 512 ~ 19 requests before rate-limiting kicks in.
Rejected requests are counted in the output anomaly counters (Rejected Requests) and in the full pipeline conservation formula (num_requests == injected_requests + rejected_requests), but they never enter the routing stage or any instance queue. Every rejection — regardless of admission policy — is also recorded in the per-SLO-class ShedByTier counter, so you can see which request classes are being rejected (e.g., {"batch": 12, "sheddable": 8}). ShedByTier is also exposed in periodic ProgressSnapshot callbacks, allowing live monitoring of per-tier shedding rates during simulation.
When to Use Admission Control¶
- Overload protection. When the arrival rate significantly exceeds service capacity, unbounded queues grow without limit. Admission shedding keeps queue depth manageable.
- Cost control. Limit total token throughput to match a token budget or downstream rate limit.
- Graceful degradation. Shed excess load to protect latency for admitted requests. Under extreme overload, routing distributes load and scheduling orders within instances, but neither can reduce total queue depth — admission is the lever that can. The tier-shed policy provides SLO-aware shedding — rejecting lower-priority classes (batch, sheddable, background) before higher-priority ones (critical, standard).
- Testing rejection paths. The
reject-allpolicy verifies that rejection counting, trace recording, and conservation invariants hold when no requests are admitted.
Admission is the third lever
Routing distributes load across instances. Scheduling orders requests within each instance. But when total arrival rate exceeds total service capacity, neither routing nor scheduling can reduce the queue -- they can only redistribute it. Admission control is the mechanism that actually reduces inbound volume.
SLO Tier Priorities¶
Every request carries an SLOClass label that determines its priority throughout the admission pipeline. Priorities follow the GAIE (Gateway API Inference Extension) convention where negative priority means sheddable.
Default Priorities¶
| SLO Class | Priority | Sheddable? |
|---|---|---|
critical |
4 | No |
standard |
3 | No (also the default for empty/unknown classes) |
batch |
-1 | Yes |
sheddable |
-2 | Yes |
background |
-3 | Yes |
The sheddable/non-sheddable boundary is priority < 0. This matches the IsSheddable contract from llm-d's gateway implementation.
Customizing Priorities¶
Override specific priorities via the slo_priorities field in a policy bundle YAML:
admission:
policy: "tier-shed"
tier_shed_min_priority: 3
slo_priorities:
batch: 0 # promote batch to non-sheddable
critical: 10 # widen the gap between critical and standard
Unspecified classes retain their GAIE defaults. The slo_priorities map merges on top of defaults — you only need to specify the classes you want to change.
Where Priorities Are Used¶
Priorities affect three components:
-
Tier-shed admission (
sim/admission.go): Under overload, rejects requests withPriority(class) < MinAdmitPriority. With the defaulttier_shed_min_priority: 3, this admits critical (4) and standard (3), while rejecting batch (-1), sheddable (-2), and background (-3). -
Tenant budget enforcement (
sim/cluster/cluster_event.go): When a tenant exceeds their capacity budget, only sheddable requests (IsSheddable = priority < 0) are shed. Critical and standard traffic is always protected regardless of budget. -
Gateway queue dispatch (
sim/cluster/gateway_queue.go): Inprioritydispatch mode, higher-priority requests are dequeued first. When the queue is at capacity, the request is rejected by default (llm-d parity). With--queue-shedding, the lowest-priority sheddable request is evicted instead (BLIS-extra experimental feature, not in llm-d).
Tier-Shed Admission¶
The tier-shed policy sheds lower-priority SLO tiers under cluster overload. It activates when the maximum per-instance in-flight load exceeds tier_shed_threshold:
./blis run --model qwen/qwen3-14b \
--num-instances 4 --rate 500 --num-requests 2000 \
--admission-policy tier-shed \
--policy-config policies.yaml
Where policies.yaml contains:
admission:
policy: "tier-shed"
tier_shed_threshold: 0 # 0 = shed at any load level
tier_shed_min_priority: 3 # admit standard(3)+critical(4), shed the rest
Under overload, any request with Priority(class) < tier_shed_min_priority is rejected. Under normal load (below threshold), all requests are admitted regardless of priority.
Choosing tier_shed_min_priority
3(default): Admits critical and standard. Sheds batch, sheddable, background.0: Admits all non-sheddable classes (priority >= 0). Sheds only negative-priority classes.-3: Admits everything (effectively disables tier-shed). Useful when you want tenant budget enforcement but not tier-level shedding.
GAIE-Legacy Admission¶
The gaie-legacy policy replicates the saturation-based admission behavior from production llm-d's Gateway API Inference Extension (GAIE). It uses a two-tier decision tree:
- Non-sheddable requests (priority >= 0: critical, standard) are always admitted, regardless of cluster saturation.
- Sheddable requests (priority < 0: batch, sheddable, background) are rejected when pool-average saturation >= 1.0.
Saturation Formula¶
The saturation formula averages per-instance utilization ratios across the cluster, taking the most constrained resource (compute queue or memory) for each instance:
This matches the production GAIE implementation in gateway-api-inference-extension/pkg/epp/framework/plugins/flowcontrol/saturationdetector/utilization/detector.go:115-137.
When saturation >= 1.0, the cluster is considered overloaded and sheddable traffic is rejected. When saturation < 1.0, all requests pass.
Configuration¶
Configure via --policy-config YAML (no CLI flags for thresholds, consistent with tier-shed):
admission:
policy: "gaie-legacy"
gaie_qd_threshold: 5 # queue depth threshold per instance (default: 5)
gaie_kv_threshold: 0.8 # KV cache utilization threshold, in (0, 1.0] (default: 0.8)
| YAML Field | Type | Default | Description |
|---|---|---|---|
gaie_qd_threshold |
float | 5 | Per-instance queue depth at which the QD component reaches 1.0. Must be > 0. |
gaie_kv_threshold |
float | 0.8 | Per-instance KV cache utilization at which the KV component reaches 1.0. Must be in (0, 1.0]. |
Default Justification¶
Both defaults come directly from the GAIE production source code:
gaie_qd_threshold = 5: FromDefaultQueueDepthThresholdinsaturationdetector/utilization/config.go:31. Represents the "ideal" queue capacity for a single endpoint — at 5 queued requests per instance, the compute resource is considered at capacity.gaie_kv_threshold = 0.8: FromDefaultKVCacheUtilThresholdinsaturationdetector/utilization/config.go:33. At 80% KV cache utilization, the memory resource is considered at capacity, leaving 20% headroom for continuous batching dynamics.
Edge Cases¶
Empty cluster
When there are no instance snapshots (e.g., all instances are still loading), saturation defaults to 1.0 — a conservative choice matching GAIE's behavior where stale or missing metrics are treated as fully saturated (detector.go:116-118). Non-sheddable requests still pass; sheddable requests are rejected.
Stale metrics
GAIE production treats stale per-pod metrics (older than MetricsStalenessThreshold, default 200ms) as score=1.0. BLIS does not model per-snapshot staleness — signal freshness is controlled globally via --snapshot-refresh-interval (INV-7). This is a deliberate simplification: BLIS controls the simulator clock, so signal freshness is deterministic.
Choosing thresholds
The default thresholds (QD=5, KV=0.8) match production llm-d. Lower gaie_qd_threshold makes the policy more aggressive about shedding under queue buildup. Lower gaie_kv_threshold makes it more sensitive to KV cache pressure. Both thresholds follow the same validation as GAIE: QD > 0 (strictly positive), KV in (0, 1.0]. Extreme values (e.g., gaie_qd_threshold: 0.001) are accepted as long as they pass validation — BLIS does not clamp or warn. Use the GAIE defaults unless you have a specific reason to change them.
Comparison with Tier-Shed¶
| Aspect | tier-shed |
gaie-legacy |
|---|---|---|
| Signal | Max per-instance effective load (QueueDepth + BatchSize) | Pool-average saturation (QD and KV ratios) |
| Granularity | Configurable priority threshold (tier_shed_min_priority) |
Binary: sheddable (priority < 0) vs non-sheddable |
| Activation | When any instance exceeds tier_shed_threshold |
When pool-average saturation >= 1.0 |
| Production parity | BLIS-specific | Matches llm-d/GAIE |
Flow Control Admission¶
When --flow-control is enabled, the FlowControlAdmission policy replaces the configured
admission policy. In this mode, admission and queuing are a single step -- the queue IS the
admission decision, matching llm-d's FlowControlAdmissionController.
llm-d Parity Summary¶
BLIS's default flow control behavior matches llm-d's ShardProcessor. BLIS-extra features are off by default and require explicit flags.
| Behavior | llm-d | BLIS default | BLIS opt-in flag |
|---|---|---|---|
| Queue full → reject | Yes | Yes | — |
| Queue full → shed victim | No | No | --queue-shedding |
| Dispatch on enqueue | Yes | Yes | — |
| Dispatch via periodic tick (1ms) | Yes | Yes | --dispatch-tick-interval to tune |
| Dispatch on completion | No | No | — |
| Dispatch on eviction/TTL | No | No | — |
| Band ceilings: constant | Yes | Yes | — |
| Band ceilings: interpolated | Interface only, no impl | No | --usage-limit-threshold < 1.0 |
| Fairness: global-strict | Yes (plugin) | Yes | — |
| Fairness: round-robin | Yes (plugin) | No | --fairness-policy round-robin |
Bold = llm-d parity. BLIS-extra features are documented in code comments and only activate when explicitly enabled.
How It Works¶
- Incoming request is enqueued into a per-priority-band, per-flow queue
- Each unique (TenantID, Priority) pair gets its own FIFO queue within a priority band
- Dispatch order follows
--dispatch-order: withpriority, iterates bands highest-priority first; withfifo(default), picks the globally-earliest arrival across all bands; withslo-deadline, dispatches the request with the earliest SLO deadline within each flow (see SLO-Deadline Dispatch Ordering below) - Within a band,
--fairness-policycontrols flow selection:global-strict(default) picks the earliest arrival (lowest sequence ID);round-robincycles through tenants in sorted key order - Saturation gating: dispatch only when cluster saturation < 1.0
- Dispatch triggers (llm-d parity): on-enqueue dispatch + periodic dispatch tick (default 1ms, configurable via
--dispatch-tick-interval). The tick is demand-driven — only active while the queue is non-empty.
Fairness Policy¶
| Flag | Description | Default |
|---|---|---|
--fairness-policy |
Intra-band dispatch fairness: global-strict, round-robin |
global-strict |
- global-strict: Picks the flow head with the earliest sequence ID (arrival order). Matches GIE's GlobalStrict.
- round-robin: Cycles through tenants in sorted key order, skipping empty flows. Maintains a per-band cursor.
Per-Band Capacity¶
| Flag | Description | Default |
|---|---|---|
--per-band-capacity |
Max requests per priority band (0=unlimited) | 0 |
When a band reaches its capacity limit, incoming requests for that band are rejected
(all entries in a band share the same priority, so displacement is never possible).
The global --max-gateway-queue-depth limit applies across all bands. When at capacity,
requests are rejected by default (llm-d parity). With --queue-shedding, cross-band
shedding of sheddable entries is enabled (BLIS-extra experimental feature).
Queue Shedding and Dispatch Tick¶
| Flag | Description | Default |
|---|---|---|
--queue-shedding |
Enable cross-band victim shedding when queue is full (BLIS-extra, not in llm-d) | false |
--dispatch-tick-interval |
Microseconds between periodic dispatch ticks | 1000 (1ms, llm-d parity) |
Queue shedding (--queue-shedding): When enabled, a full queue searches all bands for the lowest-priority sheddable request and evicts it to make room. This feature is not present in llm-d and is provided as an experimental option. Without this flag, full queues simply reject incoming requests.
Dispatch tick (--dispatch-tick-interval): The periodic DES event that triggers dispatch attempts from the gateway queue. Matches llm-d's 1ms dispatchTicker. The tick is demand-driven — only active while the queue is non-empty, and only when flow control is enabled. Default is 1000µs (1ms); omitting the flag or setting 0 in YAML both use this default.
Example¶
./blis run --model qwen/qwen3-14b --flow-control --saturation-detector utilization \
--queue-depth-threshold 5 --kv-cache-util-threshold 0.8 \
--per-band-capacity 100 --max-gateway-queue-depth 500
SLO-Deadline Dispatch Ordering¶
With --dispatch-order slo-deadline, the gateway queue dispatches the request with the earliest SLO deadline first within each flow. This matches GIE's slo-deadline-ordering-policy.
How it works:
- Fairness policy picks the flow (same as priority mode)
- Within the picked flow, dequeue the request with the earliest SLO deadline
- Deadline =
GatewayEnqueueTime + SLOTargetUs - Fallback hierarchy: per-request
SLOTargetUs→ per-tier--slo-targets→ far-future (FCFS) - Equal deadlines use arrival order (seqID) as tiebreaker
Setting SLO targets:
- Per-request (workload spec):
slo_target_us: 200000on a cohort (200ms TTFT target) - Per-tier (CLI flag):
--slo-targets critical=100000,standard=500000(fallback when workload spec doesn't set per-request targets) - Per-tier (policy bundle):
admission.slo_targetsin YAML
| Flag | Description | Default |
|---|---|---|
--dispatch-order slo-deadline |
Enable SLO-deadline ordering | fifo |
--slo-targets |
Per-SLO-class TTFT targets in µs (e.g., critical=100000,standard=500000) |
none |
Example:
./blis run --model qwen/qwen3-14b --flow-control --saturation-detector utilization \
--queue-depth-threshold 5 --kv-cache-util-threshold 0.8 \
--dispatch-order slo-deadline --slo-targets "critical=100000,standard=500000"
Observe integration: When slo_target_us > 0, blis observe injects the x-slo-ttft-ms HTTP header on outgoing requests, matching GIE's header convention.
In-Flight Eviction¶
When flow control is enabled and the system is saturated, BLIS can evict sheddable requests that are already running on instances to free capacity for higher-priority waiting requests. This is a BLIS-extra feature, not present in llm-d (llm-d's RequestEvictor/EvictN exists in code but has zero non-test callers in the production dispatch path).
How it works:
- When saturation >= 1.0 and a non-sheddable request is waiting in the gateway queue
- The eviction trigger selects the most-evictable in-flight request:
- Lowest priority first (background=-3 before sheddable=-2 before batch=-1)
- Among equal priority: newest dispatch time first (minimizes wasted compute)
- The selected request is terminated on its instance (KV freed, resources released)
- The freed capacity allows the waiting higher-priority request to dispatch
Key properties:
- Only sheddable requests (priority < 0) can be evicted — critical and standard are always protected
- Disabled by default (llm-d parity). Enable with
--in-flight-eviction(requires--flow-control) - Evicted requests appear in the
Gateway Evicted (in-flight)anomaly counter - This is a terminal state — evicted requests are not requeued
- Eviction is distinct from instance-level KV preemption (
--preemption-policy), which handles memory pressure and requeues victims
Comparison with Legacy Admission¶
| Aspect | Legacy (AlwaysAdmit, TierShed, etc.) | FlowControlAdmission |
|---|---|---|
| Admission | Separate from queuing | Queue IS admission |
| Queue structure | None (admit/reject then route directly) | Per-priority-band, per-flow |
| Dispatch order | N/A (no queue) | --dispatch-order (fifo/priority/slo-deadline) |
| Capacity | N/A (no queue) | Per-band + global |
Pipeline Latency¶
The --admission-latency and --routing-latency flags model real network and processing overhead between gateway and backend (gRPC hops, service mesh serialization, queue dispatch). These are pipeline concerns that affect both admission and routing stages. See Cluster Simulation for details on configuring pipeline latency.
Further Reading¶
- Cluster Simulation -- full pipeline overview
- Routing Policies -- the next stage after admission
- Cluster Architecture -- architectural details