Session-based Routing
SkyRL's Tinker server forwards a stable routing key with every sample request so vllm-router can pin a logical request — including all of its num_samples expansions and any retries — to a single inference backend. This keeps the prefix cache warm across the turns of one rollout and across retries of one request, while still spreading distinct requests across backends.
Routing key
For each /api/v1/asample call, the server computes:
X-Session-ID = "<sampling_session_id>:<seq_id>"sampling_session_idis per-client, assigned byservice_client.create_sampling_client()(ortraining_client.save_weights_and_get_sampling_client()).seq_idis per-request within the session. The Tinker SDK auto-bumps it on everySamplingClient.sample(...)call.
If either sampling_session_id or seq_id is absent (e.g. direct base-model sampling without an SDK session), the header is omitted and vllm-router falls back to plain load-balancing.
When you'd care
- Multi-turn rollouts. Reuse the same
seq_idacross all turns of one trajectory so they land on the same backend and the second turn hits the prefix cache from the first. - Retries. A retried sample request inherits the original
seq_idand therefore the original routing key, so it returns to the same backend. num_samples > 1. All N samples of one request already share a routing key (sameseq_idon the server) — they co-locate on one backend automatically.
Driving it explicitly from a client
The standard SamplingClient.sample(...) API auto-bumps seq_id on each call and doesn't expose it to the user. To pin a trajectory's turns to one backend you need caller-controlled seq_id. This means bypassing SamplingClient.sample and dispatching a raw SampleRequest via the low-level client.sampling.asample API.
An example for configuring this is at examples/tinker/session_based_routing/. It starts a SkyRL Tinker server, runs N parallel trajectories each over T turns where every turn reuses seq_id=trajectory_idx, and prints the routing key on dispatch:
# Terminal 1
bash examples/tinker/session_based_routing/run_tinker_server.sh
# Terminal 2
TINKER_API_KEY=tml-dummy uv run --extra tinker --with torch --with transformers \
python examples/tinker/session_based_routing/sample_session_routing_demo.py \
--num-trajectories 4 --turns 3You should see the client and server agree 1:1 on the routing key for every dispatch:
client: dispatch routing-key=sampling_<id>:<seq_id>
server: [sticky-routing] dispatch idx=<i> model=<...> session_id=sampling_<id>:<seq_id>The vllm-router logs can also be inspected to confirm that requests with the same session ID land on the same worker (sampling_33680dc2:1 consistently maps to :8001, sampling_33680dc2:0 to :8300):
# /tmp/skyrl-logs/router-xxx.log
INFO consistent_hash: Hash key 'header:x-session-id:sampling_33680dc2:1' mapped to worker: http://10.206.0.28:8001
INFO consistent_hash: Consistent hash routing: key='header:x-session-id:sampling_33680dc2:1' -> worker='http://10.206.0.28:8001' (index=2)
INFO consistent_hash: Found session key in header 'x-session-id': sampling_33680dc2:0
INFO consistent_hash: Extracted hash key: header:x-session-id:sampling_33680dc2:0
INFO consistent_hash: Hash key 'header:x-session-id:sampling_33680dc2:0' mapped to worker: http://10.206.0.28:8300
INFO consistent_hash: Consistent hash routing: key='header:x-session-id:sampling_33680dc2:0' -> worker='http://10.206.0.28:8300' (index=3)
INFO consistent_hash: Found session key in header 'x-session-id': sampling_33680dc2:1
INFO consistent_hash: Extracted hash key: header:x-session-id:sampling_33680dc2:1
INFO consistent_hash: Hash key 'header:x-session-id:sampling_33680dc2:1' mapped to worker: http://10.206.0.28:8001See the example's sample_session_routing_demo.py for more details.