Scheduling & Priority¶

Routing decides which instance receives a request. Scheduling decides what order requests are processed within an instance. These are independent policy axes — you can combine any routing policy with any scheduler.

# Priority-FCFS scheduling with SLO-class-based priority in a 4-instance cluster
./blis run --model qwen/qwen3-14b \
  --num-instances 4 --rate 100 --num-requests 500 \
  --scheduler priority-fcfs
# SLO class is set in the workload spec (slo_class: critical/standard/batch/etc.)

How Priority Works¶

Request priority is static — set once when the request enters the instance, and never changed.

At enqueue time (EnqueueRequest and EnqueueDecodeSubRequest), the simulator converts the request's SLOClass to a vLLM-convention priority value via SLOPriorityMap.InvertForVLLM(SLOClass):

critical   → Priority 0.0  (most urgent)
standard   → Priority 1.0
batch      → Priority 5.0
sheddable  → Priority 6.0
background → Priority 7.0  (least urgent)

This follows vLLM's convention: lower integer = more urgent. The cluster layer (admission, routing, gateway queue) continues to use the llm-d convention (higher = more urgent); the inversion happens at the cluster→instance dispatch boundary.

SLO class is set in your workload spec:

clients:
  - slo_class: critical
    rate: 10
  - slo_class: standard
    rate: 50

Custom priority values can be overridden via the policy bundle:

admission:
  slo_priorities:
    batch: 0   # make batch non-sheddable

Available Schedulers¶

Each scheduler implements the InstanceScheduler interface: a single OrderQueue method called every step to reorder the wait queue before batch formation.

Scheduler	Flag Value	Strategy	Notes
FCFS	`--scheduler fcfs`	First-Come-First-Served. No reordering — requests are processed in arrival order.	Default. Fair and predictable.
Priority-FCFS	`--scheduler priority-fcfs`	Sort by priority ascending (lower value = more urgent, vLLM convention), then by arrival time ascending within the same priority. Ties broken by request ID for determinism.	Useful when `SLOClass` is set in the workload spec. Without SLO classes, all requests get Priority=1.0 (standard) and this degrades to FCFS by arrival tiebreak.
SJF	`--scheduler sjf`	Shortest Job First. Sort by input token count ascending, then by arrival time, then by ID.	Optimizes TTFT for short requests but can starve long ones under sustained load. Ignores `Request.Priority` entirely.
Reverse-priority	`--scheduler reverse-priority`	Sort by priority descending (highest value = least urgent scheduled first).	Pathological template for testing only — deliberately causes priority inversions.

All schedulers use sort.SliceStable for deterministic ordering (INV-6).

How Scheduling and Priority Interact¶

The scheduler and priority are composed at each simulation step:

Priority is already set (static — set at enqueue via SLOPriorityMap.InvertForVLLM, not recomputed per step).
Queue reordering: Call InstanceScheduler.OrderQueue() on the wait queue.
Batch formation: Dequeue requests from the front of the reordered queue into the running batch.

Common combinations:

Combination	Effective Behavior
`priority-fcfs` + mixed SLO classes	Critical requests scheduled first, background last.
`priority-fcfs` + uniform SLO class	All priorities equal — degrades to FCFS by arrival time.
`sjf` + any SLO class	SJF by input length (priority ignored).
`fcfs` + any SLO class	FCFS by arrival time (priority computed but reordering skipped).

Preemption and Re-enqueueing¶

BLIS models vLLM's two-queue architecture (WaitQ + RunningBatch). When a request is preempted from the running batch due to KV cache pressure:

The request is placed at the front of the WaitQ (not the back).
Its progress is reset to zero (recompute mode, matching vLLM's recompute preemption).
On the next step, the scheduler reorders the full queue including the preempted request.

This means preempted requests get implicit priority over fresh arrivals in FCFS mode. With priority-fcfs and mixed SLO classes, the preempted request's static priority determines its position relative to other waiting requests.

By default (--preemption-policy fcfs), the tail of the running batch is evicted. When --preemption-policy priority is set, the least-urgent running request is evicted — the one with the highest Priority value (vLLM convention: background=7 is least urgent and evicted first). Among equal-priority requests, the most recently arrived is evicted first. This matches vLLM's --scheduling-policy priority preemption behavior (scheduler.py:1086).

When to Use Which¶

Workload	Recommended Configuration	Why
Uniform traffic, no SLO differentiation	`--scheduler fcfs` (default)	No reordering needed. All requests are equivalent.
Mixed SLO classes (critical vs background)	`--scheduler priority-fcfs` with `slo_class` in workload spec	Critical requests get Priority=0, scheduled before background (Priority=7).
Latency-sensitive short requests	`--scheduler sjf`	Short prompts get processed first. Watch for starvation of long requests under sustained load.
Low load (< ~10 req/s)	Any	Batch sizes are small enough that all schedulers pick the same requests. At low load, all schedulers produce equivalent results within ~5%.

SJF starvation risk

Under sustained high load, SJF can indefinitely delay long-prompt requests as short ones keep arriving. BLIS does not currently implement aging or starvation guards for SJF. If your workload has a mix of short and long prompts at high utilization, prefer --scheduler priority-fcfs with SLO classes instead.