Inference Architecture

This page explains how SkyRL's inference engine separates data and control planes, and how weight sync works during training. We detail the architecture of SkyRL's HTTP-based inference implementation below.

System Architecture

The system is simple, consisting of 3 pieces:

RemoteInferenceClient (inference_servers/remote_inference_client.py) — a pickle-safe @dataclass HTTP client. It is the single entry point trainers use to talk to inference.
VLLMRouter (inference_servers/vllm_router.py) — a session-aware load balancer used for data-plane traffic. It is a process wrapper around vllm_router.Router.
vLLM API servers — off-the-shelf vLLM servers, one per replica, started and managed via ServerGroup and VLLMServerActor Ray actors.

                  ┌─────────────────────────┐
                  │        Trainer          │
                  └────────────┬────────────┘
                               │
                  ┌────────────▼────────────┐
                  │  RemoteInferenceClient  │
                  └─────┬─────────────┬─────┘
        data plane      │             │      control plane
        (routed)        │             │      (fan-out)
                  ┌─────▼─────┐  ┌────▼─────────────────┐
                  │ VLLMRouter│  │  vLLM API servers    │
                  └─────┬─────┘  │  (one per replica)   │
                        │        └──────────────────────┘
                        └──────────►  (same servers)

Data Plane vs. Control Plane

RemoteInferenceClient holds two URL types:

proxy_url — a single endpoint that points at VLLMRouter (or any external data-plane router). Used for routed, load-balanced traffic: /v1/completions, /v1/chat/completions, /v1/chat/completions/render, the SkyRL generate endpoint, tokenize, and detokenize.
server_urls — the list of backend vLLM server URLs. Used for fan-out: /pause, /resume, /sleep, /wake_up, and all weight-sync endpoints.

The split exists because generation scales out via routing (the router picks a backend per request and can be session-sticky), while control operations like pause and weight update must hit every replica. This also lets SkyRL plug in an external router that only understands the data plane.

Generation

client.generate(input_batch, model=...) sends each prompt as a separate POST to proxy_url. Requests are issued concurrently (capped by SKYRL_GENERATE_CONCURRENCY_PER_ENGINE × num_engines) and the router distributes them across backends; an optional X-Session-ID header drives session-aware routing. chat_completion, completion, and sample are thin wrappers over the same routed proxy.

Routing policies

Session-aware routing

VLLMRouter is started with vLLM's consistent_hash routing policy (see inference_servers/utils.py). When a request carries an X-Session-ID HTTP header, the router hashes on that header so all requests with the same session ID are pinned to the same backend replica. Without the header, the router still applies consistent hashing, but there is no trajectory-level pinning - the turns of a single rollout aren't guaranteed to land on the same backend.

Pinning a logical request to one backend keeps that backend's prefix cache warm across the turns of a multi-turn rollout (and across retries of one request), which is a large throughput win for agentic / multi-turn generation. Distinct sessions still spread across replicas.

RemoteInferenceClient sets this header automatically: every routed call (generate, chat_completion, completion, sample, and the render paths) accepts a session_id, and when one is present the client attaches X-Session-ID: <session_id> to the outgoing request (inference_servers/remote_inference_client.py). SkyRL's built-in generators pass the trajectory ID as the session ID so all turns of a trajectory co-locate.

Session-aware routing with load balancing

SkyRL also supports an improved routing policy sticky_least_loaded for better load balancing across vLLM replicas. This routing policy functions as follows:

a. For new trajectories / first turn, the policy picks the server with the least number of active sessions b. For future turns of the same trajectory, the policy routes to the same server with sticky routing.

This is the recommended routing policy for multi-turn / agentic RL. For detailed benchmarking, please refer to the PR here.

Passing it from a custom agent

If you write a custom agent / generator that talks to the data-plane proxy directly (instead of going through RemoteInferenceClient), set the header yourself on each request to get sticky routing. Use a stable ID per logical rollout - e.g. the trajectory or conversation ID - and reuse it across every turn and retry of that rollout:

# OpenAI-compatible client talking to the VLLMRouter proxy
resp = await client.chat.completions.create(
    model=model,
    messages=messages,
    extra_headers={"X-Session-ID": trajectory_id},  # stable across all turns of this rollout
)

If you don't set the header, you lose trajectory-level pinning: turns of the same rollout aren't guaranteed to share a backend, so the cross-turn prefix-cache benefit is lost. Correctness is unaffected, but throughput on multi-turn workloads drops. See the Harbor integration for a concrete example of wiring this through an agent's LLM client.

Pause / Resume

We use the native /pause and /resume APIs in vLLM:

POST /pause?mode={abort|wait|keep}&clear_cache=...
POST /resume

There are three PauseMode values:

Mode	In-flight requests	Resume behavior
`abort`	Aborted immediately; client gets partial completion with `finish_reason="abort"`	Caller must retry with accumulated context
`wait`	Scheduler waits for in-flight to finish; new requests queue	New requests resume after `/resume`
`keep`	Scheduler freezes in-flight requests in place; KV cache is preserved; new requests queue	Frozen requests pick up exactly where they left off after `/resume`

During weight sync in non-colocated mode SkyRL calls /pause?mode=keep, runs the weight update, then /resume. Rollouts that were in flight are preserved in the scheduler and continue from the same token position with the new policy.

Weight Sync APIs

SkyRL uses the native weight syncing APIs in vLLM, with the following four-stage protocol:

POST /init_weight_transfer_engine — Establishes the communication channel between the trainer and inference workers.
POST /start_weight_update — Starts a weight update.
POST /update_weights — Updates all or a subset of the weights. SkyRL uses chunked weight transfer for efficiency.
POST /finish_weight_update — Finishes the current weight update.

For colocated training, SkyRL currently uses chunked transfers with CUDA IPC handles and currently implements a custom Worker extension as a transitional implementation, pending vllm-project/vllm#39212. We plan to migrate to the native APIs in the next vLLM release.

SkyRL implements two transfer strategies in skyrl/backends/skyrl_train/weight_sync/:

BroadcastTransferStrategy (broadcast_strategy.py) — used for non-colocated training. Tensor data is broadcast over NCCL from trainer rank 0 to all inference workers, concurrently with /update_weights HTTP calls that ship the metadata. This is used in combination with /pause?mode=keep and /resume so that in-flight rollouts are paused correctly during the sync.
CudaIpcTransferStrategy (cuda_ipc_strategy.py) — used for colocated training where the trainer and inference engines share GPUs. Weights are exchanged via CUDA IPC handles. Combined with /sleep and /wake_up for memory management — the inference engine sleeps to free VRAM during training, then wakes for rollouts.

Endpoint summary:

Endpoint	Plane	Purpose
`/init_weight_transfer_engine`	fan-out	One-time communicator setup
`/start_weight_update` *	fan-out	Begin a chunked update
`/update_weights` *	fan-out	Send a tensor chunk
`/finish_weight_update` *	fan-out	Commit the update
`/pause`, `/resume`	fan-out	Generation control
`/sleep`, `/wake_up`	fan-out	Colocated memory management
`/v1/completions`, `/v1/chat/completions`, generate	routed	Generation

* Currently we use custom /collective_rpc + worker methods that mimick the native APIs because SkyRL makes some vLLM fixes for Qwen 3.6 model loading. We will migrate to the native /start_weight_update and /finish_weight_update soon.

End-to-end Weight Sync Flow

Non-colocated mode (NCCL broadcast):

trainer step done
       │
       ▼
client.pause(mode="keep")        ──►  in-flight rollouts frozen, KV cache kept
       │
       ▼
NCCL broadcast (rank 0 → engines)
   +  POST /update_weights (concurrent)   ──►  N chunks
       │
       ▼
client.resume()                  ──►  frozen requests thaw, continue
       │
       ▼
rollouts proceed on new policy

Colocated mode (CUDA IPC):

trainer step done
       │
       ▼
client.wake_up(tags=["weights"]) ──►  inference loads weight buffers
       │
       ▼
POST /start_weight_update
       │
       ▼
for each chunk:
   pack tensors → CUDA IPC handle → POST /update_weights
       │
       ▼
POST /finish_weight_update
       │
       ▼
client.wake_up(tags=["kv_cache"]) ──►  KV cache restored, ready to serve

Inference Architecture

On this page