Off-policy correction is supported with the fsdp and megatron backends.

SkyRL provides built-in utilities for correcting off-policy drift from trainer/inference mismatch and AsyncRL. This guide covers:

Sources of off-policy drift — why training and inference policies diverge
Algorithmic corrections — importance sampling and sequence masking techniques
Configuration in SkyRL — how to enable these corrections in your training runs

TLDR

We recommend adding the following configs in order to your training runs to help address off-policy drift:

# We recommend starting with geometric sequence masking. This zeros out the per-sequence
# loss when the geometric mean of token-wise importance-sampling ratios falls outside [geo_mask_low, geo_mask_high].
trainer.algorithm.off_policy_correction.sequence_mask_metric="geometric"
trainer.algorithm.off_policy_correction.geo_mask_low=0.99
trainer.algorithm.off_policy_correction.geo_mask_high=1.01
# ------------ #

# for MoE models, enabling router replay (R3) to fix the source of train/infer mismatch is recommended 
trainer.policy.megatron_config.moe_enable_routing_replay=True
generator.inference_engine.enable_return_routed_experts=True
generator.inference_engine.distributed_executor_backend="mp"    # this is temporarily needed for vLLM, since routed experts cause issues with the ray backend.
# ------------ #

# TIS (Truncated Importance Sampling) applies logprob corrections on a per-token basis
trainer.algorithm.off_policy_correction.tis_ratio_type="token"
trainer.algorithm.off_policy_correction.token_tis_ratio_clip_high=2.0
# ------------ #

# The following masking strategies can additionally help mitigate off policy drift, especially from sources other than train/infer mismatch
# token masking (icepop): tune token_mask_is_threshold_low/high
trainer.algorithm.off_policy_correction.token_mask_is_threshold_low=0.5
trainer.algorithm.off_policy_correction.token_mask_is_threshold_high=2.0
# ------------ #

# outlier based sequence masking: stacks on top of geometric sequence masking
trainer.algorithm.off_policy_correction.outlier_token_is_threshold_low=1e-4
trainer.algorithm.off_policy_correction.outlier_token_is_threshold_high=100
# ------------ #

Setup

For common RL objectives (i.e. PPO/GRPO variants), we typically seek to optimize a token-wise objective of the form:

\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim q} \left[ \min \left( \frac{p_\theta(x)}{q(x)} \cdot A(x), \, \text{clip}\left( \frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}} \right) \cdot A(x) \right) \right]

where:

$x \sim q$ — samples drawn from the sampling policy $q$
$p_\theta(x)$ — probability under the current policy being optimized
$q(x)$ — probability under the sampling policy used during rollout
$A(x)$ — advantage estimate, typically computed as group-relative rewards (with the std norm being optional per Dr. GRPO):

A(x) = \frac{r(x) - \text{mean}(r)}{\text{std}(r)}

$\frac{p_\theta(x)}{q(x)}$ — the PPO importance sampling ratio, correcting for distributional shift between the sampling policy and the current policy when taking multiple mini batch steps for a single training batch
$\epsilon_{\text{low}}, \epsilon_{\text{high}}$ — clipping bounds (can be asymmetric)

In most RL frameworks, there are two options for representing $q$ :

$q = \mu_{\theta_{\text{old}}}$ , where $\mu$ is the actual sampling policy via the inference engine
$q = \pi_{\theta_{\text{old}}}$ , where $\pi$ is the trainer policy (same weights, but potentially different parallelism/kernels)

By default in SkyRL (and in most RL frameworks) $q = \pi_{\theta_{\text{old}}}$ is used as an approximation of the rollout policy $\mu_{\theta_{\text{old}}}$ . This requires recomputing the logprobs of responses under the training policy by taking a forward pass using the training weights prior to updating the weights for a given training step. However, the goal is still to most accurately estimate the importance sampling ratio using $\mu_{\theta_{\text{old}}}$

\frac{p_\theta(x)}{q(x)} = \frac{\pi_{\theta}(x)}{\mu_{\theta_{\text{old}}}(x)}

Off-policy drift in RL

We can quantify off-policy drift from this ideal $\frac{\pi_{\theta}(x)}{\mu_{\theta_{\text{old}}}(x)}$ by considering the following expansion:

\frac{p(x)}{q(x)} = \frac{\pi_{\theta}(x)}{\mu_{\theta_{\text{old}}}(x)} = \frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)} * \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}

The first term:

\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}

corresponds to the off-policy drift from training vs inference mismatch (due to system differences).

While the second term:

\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}

corresponds to the off-policy drift from policy staleness (due to parameter differences).

Next, we discuss how each of these commonly occur in RL training

Training vs Inference Engine Mismatch

Training vs inference mismatch

\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}

occurs due to discrepancies between the computed logprobs when using a different training backend (FSDP, Megatron) and inference engine (i.e. vLLM). These discrepancies include:

Kernel Mismatch: Optimized kernels for inference engines are often not batch invariant, causing $\mu_{\theta_{\text{old}}}(x)$ to differ from $\pi_{\theta_{\text{old}}}(x)$ . This can be fixed by enabling batch invariant kernels at the cost of slower inference.
Inconsistent Expert Routing: Experts that are routed to by the trainer and the inference engine may not line up, causing mismatch in computed logprobs. This can be fixed by introducing routing replay, which fixes the expert routing in the training engine to use the expert routing from the inference engine (Zheng et. al 2025, Ma et al. 2025).
Different Parallelisms: Kernel mismatch and numeric drift can be exacerbated by different parallelisms being configured in training backends like FSDP and Megatron compared to inference engines like vLLM. For example, Yao et. al 2025 show that enabling ulysses style sequence parallelism on the trainer greatly increases the trainer/inference mismatch.

Policy staleness

\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}

is caused by the following factors:

Async RL: When doing fully asynchronous RL, each training batch can consist of trajectories which were partially (or even fully) computed using stale policies. To mitigate this, the max staleness of trajectories can be tuned to prevent trajectories that are too old from being used during training.
Mini Batching: Breaking down a training step into multiple mini batches for multiple gradient steps per training batch is common for increasing training efficiency for online RL. Mini batching results in off-policy updates, which can be clamped within an acceptable range in the common dual clip formulation of the PPO loss. Tuning the number of mini batches per training batch can impact convergence of RL runs, and impact whether corrections like routing replay and masking are needed.

Routing Replay

SkyRL supports rollout routing replay (R3), first introduced by Ma et al. to help eliminate trainer/inference mismatch for MoE at the source. Rollout routing replay works by recording expert routing decisions for MoE layers at inference time, and replaying the same per-layer expert routing decisions at training time, which helps reduce mismatched logprobs.

generator:
  inference_engine:
    enable_return_routed_experts: True    # pass through argument to vLLM
    distributed_executor_backend: "mp"    # temporarily needed to work around hanging issues with other backends
...
trainer:
  policy:
    megatron_config:
      moe_enable_routing_replay: True     # enables Megatron native RoutingReplay feature

To enable rollout router replay, set generator.inference_engine.enable_return_routed_experts=True, trainer.policy.megatron_config.moe_enable_routing_replay=True, and use the mp distributed_executor_backend for vLLM. Note that R3 does induce additional training bias when mini-batching, since routing decisions are fixed for all mini-batches in a training batch. However, it has been shown to be important for stabilizing large-scale MoE training, particularly in models adopting a DeepSeek-V3 like architecture (notably the GLM family) due to the use of sigmoid-based affinity scoring instead of softmax for top-k routing.

Algorithmic Off Policy Correction

In the previous sections, we described some reasons why off-policy drift can occur, and some ways to mitigate it (e.g., batch invariant kernels, routing replay). However, these solutions come with tradeoffs (slower inference for batch invariant kernels), and are not sufficient to address all sources of drift, like fully async RL.

Recent works (Liu et. al 2025, Yao et. al 2025) have proposed additional techniques for off-policy correction. In this section, we describe these techniques and how to enable them in SkyRL.

Truncated Importance Sampling

Yao et. al 2025 propose adding a truncated importance sampling term—equivalent to the training-inference mismatch term above, but clamped—to the loss formulation:

\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim q} \left[\textcolor{red}{\min(\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}, C)} \cdot \min \left( \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \cdot A(x), \, \text{clip}\left(\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}} \right) \cdot A(x) \right) \right]

The original TIS blog post suggests applying this term using a token-wise $\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}$ , but follow-up works like Liu et. al 2025 suggest applying a sequence-level term instead:

\textcolor{red}{\min(\prod_{t=1}^{T} \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)}, C)}

off_policy_correction:
  tis_ratio_type: null # null, "token", "sequence"
  token_tis_ratio_clip_high: 2.0
  sequence_tis_ratio_clip_high: 5.0
  ...

To enable TIS in SkyRL, set tis_ratio_type to either token or sequence to use a token-wise or sequence-wise correction term. If tis_ratio_type is token, token_tis_ratio_clip_high will be used for the clamping term $C$ ; if sequence, sequence_tis_ratio_clip_high will be used.

Sequence Masking

Liu et. al 2025 propose masking out sequences with sequence-level importance sampling ratios outside a given range (also used by Deepseek v3.2 and Cognition's SWE-Grep) to maintain training stability and tolerance for off-policy updates.

\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim q} \left[\textcolor{red}{M} \cdot \min \left( \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \cdot A(x), \, \text{clip}\left(\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}} \right) \cdot A(x) \right) \right]

\textcolor{red}{M = \begin{cases} 1 & \text{if } C_\text{low} < \rho < C_\text{high} \\ 0 & \text{otherwise} \end{cases}}

where

\textcolor{red}{\rho = \prod_{t=1}^{T} \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)} \text{ or } (\prod_{t=1}^{T} \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)})^{\frac{1}{T}}}

Here, either the geometric mean or a simple product of token-wise importance sampling ratios can be used to determine whether a sequence should be masked.

off_policy_correction:
  ...
  sequence_mask_metric: null # null, "product", "geometric"
  geo_mask_high: 1.01
  geo_mask_low: 0.99
  product_mask_high: 2.0
  product_mask_low: 0.5
  outlier_token_is_threshold_low: null
  outlier_token_is_threshold_high: null
  ...

To enable sequence masking in SkyRL, set sequence_mask_metric to either product or geometric. For geometric, set geo_mask_high and geo_mask_low; for product, set product_mask_high and product_mask_low.

SkyRL also provides a way to mask sequences where any token has an importance sampling ratio outside a specified range:

\textcolor{red}{M = \begin{cases} 1 & \text{if } \forall t \in T \text{ } C_\text{low} < \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)} < C_\text{high} \\ 0 & \text{otherwise} \end{cases}}

To enable masking sequences with outlier tokens, set outlier_token_is_threshold_low and outlier_token_is_threshold_high.

Token Masking

Unlike sequence masking (which rejects entire sequences when a sequence-level metric exceeds a threshold), token masking zeros out only the individual tokens whose importance-sampling ratio falls outside an acceptable range. This is a finer-grained correction that preserves the rest of the sequence for learning. This technique was introduced by Zhou et. al as Icepop, and has since been used in the training of GLM-5.

\textcolor{red}{m_t = \begin{cases} 1 & \text{if } C_\text{low} \leq \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)} \leq C_\text{high} \\ 0 & \text{otherwise} \end{cases}}

The per-token mask $m_t$ is multiplied element-wise into the loss mask, so masked tokens contribute zero gradient while all other tokens in the sequence remain unaffected.

off_policy_correction:
  ...
  token_mask_is_threshold_low: 0.5   # suggested starting value
  token_mask_is_threshold_high: 2.0  # suggested starting value

To enable token masking, set both token_mask_is_threshold_low and token_mask_is_threshold_high. Both must be set for the mask to activate; setting only one has no effect.

Token masking can be combined with other corrections (TIS, sequence masking, outlier masking). When multiple masks are active, they are applied multiplicatively to the loss mask in order: outlier sequence mask, then token mask, then sequence mask.

When enabled, the metric token_mask_ratio is logged, representing the fraction of originally-valid tokens that were zeroed by the token mask.

Metrics and Monitoring

If off-policy correction is enabled, you can view relevant metrics, like the mean/std of the importance sampling ratio, the mean/std of logprob diffs, and the fraction of masked sequences in the logger of your choice under policy/loss_metrics.

Some examples are shown below:

Importance Sampling Ratio

Log Probability Differences

Outlier Mask

Comparisons for different off-policy correction methods

To augment the references mentioned above, we also ran our own comparisons for different off policy correction methods in SkyRL. Specifically, we compared training deepseek-ai/deepseek-r1-distill-qwen-1.5b for 1.5K steps with TIS vs geometric sequence masking with one-step off policy RL on the DAPO recipe. Geometric masking yielded superior evaluation performance while also keeping logprob difference in a stable range.

Eval Reward: Geometric vs TIS

Logprob Diff: Geometric vs TIS

The full WandB report is available here.

Off Policy Correction in SkyRL