Agent Integration
This doc is a work-in-progress. Last updated: March 18, 2026.
SkyRL is designed with modularity as the most important principle. As shown in the Architecture Overview, the Generator is where rollout is performed and where your per-task agent behaviors are defined (e.g. environment, tool calls). In general, there are two ways to perform RL with SkyRL:
-
Using the
SkyRLGymGenerator— a basic agent loop where you define your task-specific logic in a gymnasium-style API (init(),step(),close()). It supports all features including fully async RL (i.e. in-flight weight update), step-wise training, token-in-token-out, TIS, R3, etc. This is recommended when your environment is lightweight (e.g. deep-research tasks where the tools are simple API calls). See: SkyRLGymGenerator. -
Implementing a custom generator by following the
GeneratorInterfaceabstraction, with the only required method beingasync def generate(self, input_batch: GeneratorInput) -> GeneratorOutput. This is the path to take if you have an existing agent harness that is too complex to migrate into theSkyRLGymGeneratorformat. A canonical example is the Harbor generator — see: Harbor.
Prerequisite: Toggle on HTTP Endpoint
SkyRL exposes the inference engine as an OpenAI-compatible HTTP endpoint, so your agent harness can send requests to it. Configure the following:
generator.inference_engine.enable_http_endpoint: Set totrueto launch an OpenAI-compatible HTTP endpoint. When using HTTP endpoints, propagate the temperature appropriately totrainer.algorithm.temperatureif you are not utilizinggenerator.sampling_params.temperature.generator.inference_engine.http_endpoint_host: Host for the inference HTTP endpoint.generator.inference_engine.http_endpoint_port: Port for the inference HTTP endpoint.generator.inference_engine.served_model_name: The model name to use for HTTP endpoint validation. If set, this name must be used in themodelfield of/chat/completionsand/completionsrequests instead of the model path.
Ways to Integrate Your Custom Agent
The end goal is to implement the contract of generate(self, input_batch: GeneratorInput) -> GeneratorOutput. Let's first look at the GeneratorOutput:
class GeneratorOutput(TypedDict):
prompt_token_ids: List[List[int]]
response_ids: List[List[int]]
rewards: Union[List[float], List[List[float]]]
loss_masks: List[List[int]]
stop_reasons: Optional[List[str]]
rollout_metrics: Optional[Dict[str, Any]]
rollout_logprobs: Optional[List[List[float]]]
trajectory_ids: Optional[List[TrajectoryID]]
rollout_expert_indices: Optional[List[List[List[List[int]]]]] # [batch_size, seq_len, layer_num, topk]
# Applicable only for step-wise training
is_last_step: Optional[List[bool]]With a custom agent, you are expected to book-keep:
rollout_logprobsif you want to perform TISrollout_expert_indicesif you want to perform R3prompt_token_idsandresponse_idsif you perform step-wise training
There are roughly three ways to integrate an agent, each with different trade-offs.
1. Re-Tokenization
For each trajectory, record a chat_history: List[Dict[str, str]], re-tokenize it with the chat template, and construct loss_mask based on roles. You can use the helper method get_response_ids_and_loss_mask_from_messages() to construct prompt_token_ids, response_ids, and loss_masks.
Pros:
- Simplest approach — works almost out of the box for most agent harnesses. Despite re-tokenization drift, some successful open-source recipes were trained this way.
Cons:
- Re-tokenization drift — what the model actually generated may not match what you end up tokenizing (and hence training on). This means:
- You cannot do rollout correction like TIS reliably, so you cannot do fully async training with proper staleness correction.
- The chat history must be strictly appending (no context management like summarization).
2. Make the agent harness Token-In-Token-Out
Make your agent harness operate entirely in token space.
This likely involves rewriting your agent to use /completions (not /chat/completions), meaning you cannot use vLLM's native tool call parsing — you will need to parse tool calls yourself. You maintain a list of tokens that is strictly appending: turn 2's input consists of the exact tokens from turn 1's LLM output plus observation tokens (tokenized in a way that obeys the chat template). You can refer to the SkyRLGymGenerator to see how it is done.
Pros:
- Guaranteed on-policyness with no tokenization drift.
- A single forward pass per trajectory (unless combined with approach 3 for non-strictly-appending chat histories, e.g. context management like summarization or thinking token stripping).
Cons:
- More implementation work. But likely worth it since doing RL is a significant investment of time and effort.
3. Step-Wise Training
For each trajectory, treat each turn's input and output pair as a separate training sequence.
Your agent harness can still use /chat/completions with tool call parsing, since you can use vLLM's return_token_ids to get the raw input and output token IDs. However, your agent harness is expected to do this book-keeping per turn.
Pros:
- Simpler than rewriting the agent into token-in-token-out.
- On-policy (no tokenization drift) despite using
/chat/completions(string-space) and context management (i.e. non-strictly-appending chat history).
Cons:
- Training time can grow: O(T^2) vs O(T), since each trajectory of T turns becomes T sequences to forward (each with a growing prefix), as opposed to 1 sequence.
- SkyRL support prefix-aware merging of per-step sequences when the prefix matches with config flag
generator.merge_stepwise_output, which can reduce the O(T^2) cost if chat history is linearly appending across turns and there is no token mismatch. See https://github.com/NovaSky-AI/SkyRL/pull/1532
- SkyRL support prefix-aware merging of per-step sequences when the prefix matches with config flag
For the full details on how to structure the GeneratorOutput for step-wise training, including the required fields, invariants, and a concrete example, see: Step-Wise Training.