SkyRL
Algorithms

Off Policy Correction in SkyRL

Off-policy correction is supported with the fsdp and megatron backends.

SkyRL provides built-in utilities for correcting off-policy drift from trainer/inference mismatch and AsyncRL. This guide covers:

  1. Sources of off-policy drift — why training and inference policies diverge
  2. Algorithmic corrections — importance sampling and sequence masking techniques
  3. Configuration in SkyRL — how to enable these corrections in your training runs

TLDR

We recommend adding the following configs in order to your training runs to help address off-policy drift:

# we recommend trying basic TIS correction first
trainer.algorithm.off_policy_correction.tis_ratio_type="token"
trainer.algorithm.off_policy_correction.token_tis_ratio_clip_high=2.0

# for long context + MoE models, try geometric sequence masking - tune geo_mask_high/geo_mask_low as needed
trainer.algorithm.off_policy_correction.sequence_mask_metric="geometric"
trainer.algorithm.off_policy_correction.geo_mask_high=1.01
trainer.algorithm.off_policy_correction.geo_mask_low=0.99

# alternatively, for long context + MoE you can try token masking (icepop) and tune token_mask_is_threshold_low/high
trainer.algorithm.off_policy_correction.token_mask_is_threshold_low=0.5
trainer.algorithm.off_policy_correction.token_mask_is_threshold_high=2.0

# for longer context + MoE, you can also try outlier based sequence masking, which stacks on top of geometric sequence masking
trainer.algorithm.off_policy_correction.outlier_token_is_threshold_low=1e-4
trainer.algorithm.off_policy_correction.outlier_token_is_threshold_high=100

Setup

For common RL objectives (i.e. PPO/GRPO variants), we typically seek to optimize a token-wise objective of the form:

LGRPO(θ)=Exq[min(pθ(x)q(x)A(x),clip(pθ(x)q(x),1ϵlow,1+ϵhigh)A(x))]\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim q} \left[ \min \left( \frac{p_\theta(x)}{q(x)} \cdot A(x), \, \text{clip}\left( \frac{p_\theta(x)}{q(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}} \right) \cdot A(x) \right) \right]

where:

  • xqx \sim q — samples drawn from the sampling policy qq
  • pθ(x)p_\theta(x) — probability under the current policy being optimized
  • q(x)q(x) — probability under the sampling policy used during rollout
  • A(x)A(x) — advantage estimate, typically computed as group-relative rewards (with the std norm being optional per Dr. GRPO):
A(x)=r(x)mean(r)std(r)A(x) = \frac{r(x) - \text{mean}(r)}{\text{std}(r)}
  • pθ(x)q(x)\frac{p_\theta(x)}{q(x)} — the PPO importance sampling ratio, correcting for distributional shift between the sampling policy and the current policy when taking multiple mini batch steps for a single training batch
  • ϵlow,ϵhigh\epsilon_{\text{low}}, \epsilon_{\text{high}} — clipping bounds (can be asymmetric)

In most RL frameworks, there are two options for representing qq:

  • q=μθoldq = \mu_{\theta_{\text{old}}}, where μ\mu is the actual sampling policy via the inference engine
  • q=πθoldq = \pi_{\theta_{\text{old}}}, where π\pi is the trainer policy (same weights, but potentially different parallelism/kernels)

By default in SkyRL (and in most RL frameworks) q=πθoldq = \pi_{\theta_{\text{old}}} is used as an approximation of the rollout policy μθold\mu_{\theta_{\text{old}}}. This requires recomputing the logprobs of responses under the training policy by taking a forward pass using the training weights prior to updating the weights for a given training step. However, the goal is still to most accurately estimate the importance sampling ratio using μθold\mu_{\theta_{\text{old}}}

pθ(x)q(x)=πθ(x)μθold(x)\frac{p_\theta(x)}{q(x)} = \frac{\pi_{\theta}(x)}{\mu_{\theta_{\text{old}}}(x)}

Off-policy drift in RL

We can quantify off-policy drift from this ideal πθ(x)μθold(x)\frac{\pi_{\theta}(x)}{\mu_{\theta_{\text{old}}}(x)} by considering the following expansion:

p(x)q(x)=πθ(x)μθold(x)=πθold(x)μθold(x)πθ(x)πθold(x)\frac{p(x)}{q(x)} = \frac{\pi_{\theta}(x)}{\mu_{\theta_{\text{old}}}(x)} = \frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)} * \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}

The first term:

πθold(x)μθold(x)\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}

corresponds to the off-policy drift from training vs inference mismatch (due to system differences).

While the second term:

πθ(x)πθold(x)\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}

corresponds to the off-policy drift from policy staleness (due to parameter differences).

Next, we discuss how each of these commonly occur in RL training

Training vs Inference Engine Mismatch

Training vs inference mismatch

πθold(x)μθold(x)\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}

occurs due to discrepancies between the computed logprobs when using a different training backend (FSDP, Megatron) and inference engine (i.e. vLLM). These discrepancies include:

  • Kernel Mismatch: Optimized kernels for inference engines are often not batch invariant, causing μθold(x)\mu_{\theta_{\text{old}}}(x) to differ from πθold(x)\pi_{\theta_{\text{old}}}(x). This can be fixed by enabling batch invariant kernels at the cost of slower inference.
  • Inconsistent Expert Routing: Experts that are routed to by the trainer and the inference engine may not line up, causing mismatch in computed logprobs. This can be fixed by introducing routing replay, which fixes the expert routing in the training engine to use the expert routing from the inference engine (Zheng et. al 2025, Ma et al. 2025).
  • Different Parallelisms: Kernel mismatch and numeric drift can be exacerbated by different parallelisms being configured in training backends like FSDP and Megatron compared to inference engines like vLLM. For example, Yao et. al 2025 show that enabling ulysses style sequence parallelism on the trainer greatly increases the trainer/inference mismatch.

Policy staleness

Policy staleness

πθ(x)πθold(x)\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}

is caused by the following factors:

  • Async RL: When doing fully asynchronous RL, each training batch can consist of trajectories which were partially (or even fully) computed using stale policies. To mitigate this, the max staleness of trajectories can be tuned to prevent trajectories that are too old from being used during training.
  • Mini Batching: Breaking down a training step into multiple mini batches for multiple gradient steps per training batch is common for increasing training efficiency for online RL. Mini batching results in off-policy updates, which can be clamped within an acceptable range in the common dual clip formulation of the PPO loss. Tuning the number of mini batches per training batch can impact convergence of RL runs, and impact whether corrections like routing replay and masking are needed.

Algorithmic Off Policy Correction

In the previous section, we described some reasons why off-policy drift can occur, and some ways to mitigate it (e.g., batch invariant kernels, routing replay). However, these solutions come with tradeoffs (slower inference for batch invariant kernels, additional bias for routing replay), and are not sufficient to address all sources of drift, like fully async RL.

Recent works (Liu et. al 2025, Yao et. al 2025) have proposed additional techniques for off-policy correction. In this section, we describe these techniques and how to enable them in SkyRL.

Truncated Importance Sampling

Yao et. al 2025 propose adding a truncated importance sampling term—equivalent to the training-inference mismatch term above, but clamped—to the loss formulation:

LGRPO(θ)=Exq[min(πθold(x)μθold(x),C)min(πθ(x)πθold(x)A(x),clip(πθ(x)πθold(x),1ϵlow,1+ϵhigh)A(x))]\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim q} \left[\textcolor{red}{\min(\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}, C)} \cdot \min \left( \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \cdot A(x), \, \text{clip}\left(\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}} \right) \cdot A(x) \right) \right]

The original TIS blog post suggests applying this term using a token-wise πθold(x)μθold(x)\frac{\pi_{\theta_{\text{old}}}(x)}{\mu_{\theta_{\text{old}}}(x)}, but follow-up works like Liu et. al 2025 suggest applying a sequence-level term instead:

min(t=1Tπθold(xt)μθold(xt),C)\textcolor{red}{\min(\prod_{t=1}^{T} \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)}, C)}
off_policy_correction:
  tis_ratio_type: null # null, "token", "sequence"
  token_tis_ratio_clip_high: 2.0
  sequence_tis_ratio_clip_high: 5.0
  ...

To enable TIS in SkyRL, set tis_ratio_type to either token or sequence to use a token-wise or sequence-wise correction term. If tis_ratio_type is token, token_tis_ratio_clip_high will be used for the clamping term CC; if sequence, sequence_tis_ratio_clip_high will be used.

Sequence Masking

Liu et. al 2025 propose masking out sequences with sequence-level importance sampling ratios outside a given range (also used by Deepseek v3.2 and Cognition's SWE-Grep) to maintain training stability and tolerance for off-policy updates.

LGRPO(θ)=Exq[Mmin(πθ(x)πθold(x)A(x),clip(πθ(x)πθold(x),1ϵlow,1+ϵhigh)A(x))]\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x \sim q} \left[\textcolor{red}{M} \cdot \min \left( \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \cdot A(x), \, \text{clip}\left(\frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}} \right) \cdot A(x) \right) \right] M={1if Clow<ρ<Chigh0otherwise\textcolor{red}{M = \begin{cases} 1 & \text{if } C_\text{low} < \rho < C_\text{high} \\ 0 & \text{otherwise} \end{cases}}

where

ρ=t=1Tπθold(xt)μθold(xt) or (t=1Tπθold(xt)μθold(xt))1T\textcolor{red}{\rho = \prod_{t=1}^{T} \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)} \text{ or } (\prod_{t=1}^{T} \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)})^{\frac{1}{T}}}

Here, either the geometric mean or a simple product of token-wise importance sampling ratios can be used to determine whether a sequence should be masked.

off_policy_correction:
  ...
  sequence_mask_metric: null # null, "product", "geometric"
  geo_mask_high: 1.01
  geo_mask_low: 0.99
  product_mask_high: 2.0
  product_mask_low: 0.5
  outlier_token_is_threshold_low: null
  outlier_token_is_threshold_high: null
  ...

To enable sequence masking in SkyRL, set sequence_mask_metric to either product or geometric. For geometric, set geo_mask_high and geo_mask_low; for product, set product_mask_high and product_mask_low.

SkyRL also provides a way to mask sequences where any token has an importance sampling ratio outside a specified range:

M={1if tT Clow<πθold(xt)μθold(xt)<Chigh0otherwise\textcolor{red}{M = \begin{cases} 1 & \text{if } \forall t \in T \text{ } C_\text{low} < \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)} < C_\text{high} \\ 0 & \text{otherwise} \end{cases}}

To enable masking sequences with outlier tokens, set outlier_token_is_threshold_low and outlier_token_is_threshold_high.

Token Masking

Unlike sequence masking (which rejects entire sequences when a sequence-level metric exceeds a threshold), token masking zeros out only the individual tokens whose importance-sampling ratio falls outside an acceptable range. This is a finer-grained correction that preserves the rest of the sequence for learning. This technique was introduced by Zhou et. al as Icepop, and has since been used in the training of GLM-5.

mt={1if Clowπθold(xt)μθold(xt)Chigh0otherwise\textcolor{red}{m_t = \begin{cases} 1 & \text{if } C_\text{low} \leq \frac{\pi_{\theta_{\text{old}}}(x_t)}{\mu_{\theta_{\text{old}}}(x_t)} \leq C_\text{high} \\ 0 & \text{otherwise} \end{cases}}

The per-token mask mtm_t is multiplied element-wise into the loss mask, so masked tokens contribute zero gradient while all other tokens in the sequence remain unaffected.

off_policy_correction:
  ...
  token_mask_is_threshold_low: 0.5   # suggested starting value
  token_mask_is_threshold_high: 2.0  # suggested starting value

To enable token masking, set both token_mask_is_threshold_low and token_mask_is_threshold_high. Both must be set for the mask to activate; setting only one has no effect.

Token masking can be combined with other corrections (TIS, sequence masking, outlier masking). When multiple masks are active, they are applied multiplicatively to the loss mask in order: outlier sequence mask, then token mask, then sequence mask.

When enabled, the metric token_mask_ratio is logged, representing the fraction of originally-valid tokens that were zeroed by the token mask.

Metrics and Monitoring

If off-policy correction is enabled, you can view relevant metrics, like the mean/std of the importance sampling ratio, the mean/std of logprob diffs, and the fraction of masked sequences in the logger of your choice under policy/loss_metrics.

Some examples are shown below:

Importance Sampling Ratio

Log Probability Differences

Outlier Mask

References

On this page