Admission Control¶

Admission control is the first gate in the cluster pipeline. It decides whether to accept or reject incoming requests before they reach the routing stage. Admission only applies in cluster mode (--num-instances > 1) -- single-instance simulations skip directly to the wait queue.

# Rate-limit a 4-instance cluster with token bucket admission
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 500 --num-requests 2000 \
  --admission-policy token-bucket \
  --token-bucket-capacity 10000 --token-bucket-refill-rate 1000

Available Policies¶

Policy	Flag Value	Behavior
Always-admit	`--admission-policy always-admit` (default)	Accepts all requests unconditionally. No filtering.
Token-bucket	`--admission-policy token-bucket`	Rate-limiting. Each request consumes tokens equal to its input token count. Tokens refill at a constant rate. Rejects when the bucket is empty.
Tier-shed	`--admission-policy tier-shed`	SLO-aware shedding. Under overload, rejects requests whose SLO tier priority is below `tier_shed_min_priority`. See SLO Tier Priorities below.
GAIE-legacy	`--admission-policy gaie-legacy`	Saturation-based shedding matching production llm-d/GAIE behavior. Non-sheddable requests always pass; sheddable requests (priority < 0) rejected when pool-average saturation >= 1.0. See GAIE-Legacy Admission below.
Reject-all	`--admission-policy reject-all`	Rejects all requests unconditionally. Pathological template for testing.

Token Bucket Mechanics¶

The token bucket policy controls throughput by treating each request's input token count as a cost:

Flag	Description	Default
`--token-bucket-capacity`	Maximum number of tokens the bucket can hold	10000
`--token-bucket-refill-rate`	Tokens added per second of simulation time	1000

How it works:

Bucket starts full. At initialization, the bucket holds capacity tokens.
Refill is continuous. On each admission decision, the bucket refills proportionally to elapsed simulation time: refill = elapsed_microseconds * refill_rate / 1,000,000.
Cost per request = number of input tokens. A request with 512 input tokens costs 512 tokens from the bucket.
Admission check. If current_tokens >= cost, the request is admitted and the cost is subtracted. Otherwise the request is rejected with reason "insufficient tokens".
Capacity cap. Tokens never accumulate beyond capacity, even after long idle periods.

Sizing the bucket

With --token-bucket-capacity 10000 --token-bucket-refill-rate 1000 and requests averaging 512 input tokens, the sustained admission rate is roughly 1000 / 512 ~ 1.95 req/s. The bucket's capacity of 10000 tokens allows a burst of up to 10000 / 512 ~ 19 requests before rate-limiting kicks in.

Rejected requests are counted in the output anomaly counters (Rejected Requests) and in the full pipeline conservation formula (num_requests == injected_requests + rejected_requests), but they never enter the routing stage or any instance queue. Every rejection — regardless of admission policy — is also recorded in the per-SLO-class ShedByTier counter, so you can see which request classes are being rejected (e.g., {"batch": 12, "sheddable": 8}). ShedByTier is also exposed in periodic ProgressSnapshot callbacks, allowing live monitoring of per-tier shedding rates during simulation.

When to Use Admission Control¶

Overload protection. When the arrival rate significantly exceeds service capacity, unbounded queues grow without limit. Admission shedding keeps queue depth manageable.
Cost control. Limit total token throughput to match a token budget or downstream rate limit.
Graceful degradation. Shed excess load to protect latency for admitted requests. Under extreme overload, routing distributes load and scheduling orders within instances, but neither can reduce total queue depth — admission is the lever that can. The tier-shed policy provides SLO-aware shedding — rejecting lower-priority classes (batch, sheddable, background) before higher-priority ones (critical, standard).
Testing rejection paths. The reject-all policy verifies that rejection counting, trace recording, and conservation invariants hold when no requests are admitted.

Admission is the third lever

Routing distributes load across instances. Scheduling orders requests within each instance. But when total arrival rate exceeds total service capacity, neither routing nor scheduling can reduce the queue -- they can only redistribute it. Admission control is the mechanism that actually reduces inbound volume.

SLO Tier Priorities¶

Every request carries an SLOClass label that determines its priority throughout the admission pipeline. Priorities follow the GAIE (Gateway API Inference Extension) convention where negative priority means sheddable.

Default Priorities¶

SLO Class	Priority	Sheddable?
`critical`	4	No
`standard`	3	No (also the default for empty/unknown classes)
`batch`	-1	Yes
`sheddable`	-2	Yes
`background`	-3	Yes

The sheddable/non-sheddable boundary is priority < 0. This matches the IsSheddable contract from llm-d's gateway implementation.

Customizing Priorities¶

Override specific priorities via the slo_priorities field in a policy bundle YAML:

admission:
  policy: "tier-shed"
  tier_shed_min_priority: 3
  slo_priorities:
    batch: 0       # promote batch to non-sheddable
    critical: 10   # widen the gap between critical and standard

Unspecified classes retain their GAIE defaults. The slo_priorities map merges on top of defaults — you only need to specify the classes you want to change.

Where Priorities Are Used¶

Priorities affect three components:

Tier-shed admission (sim/admission.go): Under overload, rejects requests with Priority(class) < MinAdmitPriority. With the default tier_shed_min_priority: 3, this admits critical (4) and standard (3), while rejecting batch (-1), sheddable (-2), and background (-3).
Tenant budget enforcement (sim/cluster/cluster_event.go): When a tenant exceeds their capacity budget, only sheddable requests (IsSheddable = priority < 0) are shed. Critical and standard traffic is always protected regardless of budget.
Gateway queue dispatch (sim/cluster/gateway_queue.go): In priority dispatch mode, higher-priority requests are dequeued first. When the queue is at capacity, the request is rejected by default (llm-d parity). With --queue-shedding, the lowest-priority sheddable request is evicted instead (BLIS-extra experimental feature, not in llm-d).

Tier-Shed Admission¶

The tier-shed policy sheds lower-priority SLO tiers under cluster overload. It activates when the maximum per-instance in-flight load exceeds tier_shed_threshold:

./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 500 --num-requests 2000 \
  --admission-policy tier-shed \
  --policy-config policies.yaml

Where policies.yaml contains:

admission:
  policy: "tier-shed"
  tier_shed_threshold: 0        # 0 = shed at any load level
  tier_shed_min_priority: 3     # admit standard(3)+critical(4), shed the rest

Under overload, any request with Priority(class) < tier_shed_min_priority is rejected. Under normal load (below threshold), all requests are admitted regardless of priority.

Choosing tier_shed_min_priority

3 (default): Admits critical and standard. Sheds batch, sheddable, background.
0: Admits all non-sheddable classes (priority >= 0). Sheds only negative-priority classes.
-3: Admits everything (effectively disables tier-shed). Useful when you want tenant budget enforcement but not tier-level shedding.

GAIE-Legacy Admission¶

The gaie-legacy policy replicates the saturation-based admission behavior from production llm-d's Gateway API Inference Extension (GAIE). It uses a two-tier decision tree:

Non-sheddable requests (priority >= 0: critical, standard) are always admitted, regardless of cluster saturation.
Sheddable requests (priority < 0: batch, sheddable, background) are rejected when pool-average saturation >= 1.0.

Saturation Formula¶

The saturation formula averages per-instance utilization ratios across the cluster, taking the most constrained resource (compute queue or memory) for each instance:

saturation = avg across instances of max(queueDepth / qdThreshold, kvUtilization / kvThreshold)

This matches the production GAIE implementation in gateway-api-inference-extension/pkg/epp/framework/plugins/flowcontrol/saturationdetector/utilization/detector.go:115-137.

When saturation >= 1.0, the cluster is considered overloaded and sheddable traffic is rejected. When saturation < 1.0, all requests pass.

Configuration¶

Configure via --policy-config YAML (no CLI flags for thresholds, consistent with tier-shed):

admission:
  policy: "gaie-legacy"
  gaie_qd_threshold: 5     # queue depth threshold per instance (default: 5)
  gaie_kv_threshold: 0.8   # KV cache utilization threshold, in (0, 1.0] (default: 0.8)

YAML Field	Type	Default	Description
`gaie_qd_threshold`	float	5	Per-instance queue depth at which the QD component reaches 1.0. Must be > 0.
`gaie_kv_threshold`	float	0.8	Per-instance KV cache utilization at which the KV component reaches 1.0. Must be in (0, 1.0].

Default Justification¶

Both defaults come directly from the GAIE production source code:

gaie_qd_threshold = 5: From DefaultQueueDepthThreshold in saturationdetector/utilization/config.go:31. Represents the "ideal" queue capacity for a single endpoint — at 5 queued requests per instance, the compute resource is considered at capacity.
gaie_kv_threshold = 0.8: From DefaultKVCacheUtilThreshold in saturationdetector/utilization/config.go:33. At 80% KV cache utilization, the memory resource is considered at capacity, leaving 20% headroom for continuous batching dynamics.

Edge Cases¶

Empty cluster

When there are no instance snapshots (e.g., all instances are still loading), saturation defaults to 1.0 — a conservative choice matching GAIE's behavior where stale or missing metrics are treated as fully saturated (detector.go:116-118). Non-sheddable requests still pass; sheddable requests are rejected.

Stale metrics

GAIE production treats stale per-pod metrics (older than MetricsStalenessThreshold, default 200ms) as score=1.0. BLIS does not model per-snapshot staleness — signal freshness is controlled globally via --snapshot-refresh-interval (INV-7). This is a deliberate simplification: BLIS controls the simulator clock, so signal freshness is deterministic.

Choosing thresholds

The default thresholds (QD=5, KV=0.8) match production llm-d. Lower gaie_qd_threshold makes the policy more aggressive about shedding under queue buildup. Lower gaie_kv_threshold makes it more sensitive to KV cache pressure. Both thresholds follow the same validation as GAIE: QD > 0 (strictly positive), KV in (0, 1.0]. Extreme values (e.g., gaie_qd_threshold: 0.001) are accepted as long as they pass validation — BLIS does not clamp or warn. Use the GAIE defaults unless you have a specific reason to change them.

Comparison with Tier-Shed¶

Aspect	`tier-shed`	`gaie-legacy`
Signal	Max per-instance effective load (QueueDepth + BatchSize)	Pool-average saturation (QD and KV ratios)
Granularity	Configurable priority threshold (`tier_shed_min_priority`)	Binary: sheddable (priority < 0) vs non-sheddable
Activation	When any instance exceeds `tier_shed_threshold`	When pool-average saturation >= 1.0
Production parity	BLIS-specific	Matches llm-d/GAIE

Flow Control Admission¶

When --flow-control is enabled, the FlowControlAdmission policy replaces the configured admission policy. In this mode, admission and queuing are a single step -- the queue IS the admission decision, matching llm-d's FlowControlAdmissionController.

llm-d Parity Summary¶

BLIS's default flow control behavior matches llm-d's ShardProcessor. BLIS-extra features are off by default and require explicit flags.

Behavior	llm-d	BLIS default	BLIS opt-in flag
Queue full → reject	Yes	Yes	—
Queue full → shed victim	No	No	`--queue-shedding`
Dispatch on enqueue	Yes	Yes	—
Dispatch via periodic tick (1ms)	Yes	Yes	`--dispatch-tick-interval` to tune
Dispatch on completion	No	No	—
Dispatch on eviction/TTL	No	No	—
Band ceilings: constant	Yes	Yes	—
Band ceilings: interpolated	Interface only, no impl	No	`--usage-limit-threshold < 1.0`
Fairness: global-strict	Yes (plugin)	Yes	—
Fairness: round-robin	Yes (plugin)	No	`--fairness-policy round-robin`

Bold = llm-d parity. BLIS-extra features are documented in code comments and only activate when explicitly enabled.

How It Works¶

Incoming request is enqueued into a per-priority-band, per-flow queue
Each unique (TenantID, Priority) pair gets its own FIFO queue within a priority band
Dispatch order follows --dispatch-order: with priority, iterates bands highest-priority first; with fifo (default), picks the globally-earliest arrival across all bands; with slo-deadline, dispatches the request with the earliest SLO deadline within each flow (see SLO-Deadline Dispatch Ordering below)
Within a band, --fairness-policy controls flow selection: global-strict (default) picks the earliest arrival (lowest sequence ID); round-robin cycles through tenants in sorted key order
Saturation gating: dispatch only when cluster saturation < 1.0
Dispatch triggers (llm-d parity): on-enqueue dispatch + periodic dispatch tick (default 1ms, configurable via --dispatch-tick-interval). The tick is demand-driven — only active while the queue is non-empty.

Fairness Policy¶

Flag	Description	Default
`--fairness-policy`	Intra-band dispatch fairness: `global-strict`, `round-robin`	`global-strict`

global-strict: Picks the flow head with the earliest sequence ID (arrival order). Matches GIE's GlobalStrict.
round-robin: Cycles through tenants in sorted key order, skipping empty flows. Maintains a per-band cursor.

Per-Band Capacity¶

Flag	Description	Default
`--per-band-capacity`	Max requests per priority band (0=unlimited)	0

When a band reaches its capacity limit, incoming requests for that band are rejected (all entries in a band share the same priority, so displacement is never possible). The global --max-gateway-queue-depth limit applies across all bands. When at capacity, requests are rejected by default (llm-d parity). With --queue-shedding, cross-band shedding of sheddable entries is enabled (BLIS-extra experimental feature).

Queue Shedding and Dispatch Tick¶

Flag	Description	Default
`--queue-shedding`	Enable cross-band victim shedding when queue is full (BLIS-extra, not in llm-d)	`false`
`--dispatch-tick-interval`	Microseconds between periodic dispatch ticks	`1000` (1ms, llm-d parity)

Queue shedding (--queue-shedding): When enabled, a full queue searches all bands for the lowest-priority sheddable request and evicts it to make room. This feature is not present in llm-d and is provided as an experimental option. Without this flag, full queues simply reject incoming requests.

Dispatch tick (--dispatch-tick-interval): The periodic DES event that triggers dispatch attempts from the gateway queue. Matches llm-d's 1ms dispatchTicker. The tick is demand-driven — only active while the queue is non-empty, and only when flow control is enabled. Default is 1000µs (1ms); omitting the flag or setting 0 in YAML both use this default.

Example¶

./blis run --model qwen/qwen3-14b --flow-control --saturation-detector utilization \
  --queue-depth-threshold 5 --kv-cache-util-threshold 0.8 \
  --per-band-capacity 100 --max-gateway-queue-depth 500

SLO-Deadline Dispatch Ordering¶

With --dispatch-order slo-deadline, the gateway queue dispatches the request with the earliest SLO deadline first within each flow. This matches GIE's slo-deadline-ordering-policy.

How it works:

Fairness policy picks the flow (same as priority mode)
Within the picked flow, dequeue the request with the earliest SLO deadline
Deadline = GatewayEnqueueTime + SLOTargetUs
Fallback hierarchy: per-request SLOTargetUs → per-tier --slo-targets → far-future (FCFS)
Equal deadlines use arrival order (seqID) as tiebreaker

Setting SLO targets:

Per-request (workload spec): slo_target_us: 200000 on a cohort (200ms TTFT target)
Per-tier (CLI flag): --slo-targets critical=100000,standard=500000 (fallback when workload spec doesn't set per-request targets)
Per-tier (policy bundle): admission.slo_targets in YAML

Flag	Description	Default
`--dispatch-order slo-deadline`	Enable SLO-deadline ordering	`fifo`
`--slo-targets`	Per-SLO-class TTFT targets in µs (e.g., `critical=100000,standard=500000`)	none

Example:

./blis run --model qwen/qwen3-14b --flow-control --saturation-detector utilization \
  --queue-depth-threshold 5 --kv-cache-util-threshold 0.8 \
  --dispatch-order slo-deadline --slo-targets "critical=100000,standard=500000"

Observe integration: When slo_target_us > 0, blis observe injects the x-slo-ttft-ms HTTP header on outgoing requests, matching GIE's header convention.

In-Flight Eviction¶

When flow control is enabled and the system is saturated, BLIS can evict sheddable requests that are already running on instances to free capacity for higher-priority waiting requests. This is a BLIS-extra feature, not present in llm-d (llm-d's RequestEvictor/EvictN exists in code but has zero non-test callers in the production dispatch path).