Off Policy Correction in SkyRL
Off-policy correction is supported with the fsdp and megatron backends.
SkyRL provides built-in utilities for correcting off-policy drift from trainer/inference mismatch and AsyncRL. This guide covers:
- Sources of off-policy drift — why training and inference policies diverge
- Algorithmic corrections — importance sampling and sequence masking techniques
- Configuration in SkyRL — how to enable these corrections in your training runs
TLDR
We recommend adding the following configs in order to your training runs to help address off-policy drift:
# we recommend trying basic TIS correction first
trainer.algorithm.off_policy_correction.tis_ratio_type="token"
trainer.algorithm.off_policy_correction.token_tis_ratio_clip_high=2.0
# for long context + MoE models, try geometric sequence masking - tune geo_mask_high/geo_mask_low as needed
trainer.algorithm.off_policy_correction.sequence_mask_metric="geometric"
trainer.algorithm.off_policy_correction.geo_mask_high=1.01
trainer.algorithm.off_policy_correction.geo_mask_low=0.99
# alternatively, for long context + MoE you can try token masking (icepop) and tune token_mask_is_threshold_low/high
trainer.algorithm.off_policy_correction.token_mask_is_threshold_low=0.5
trainer.algorithm.off_policy_correction.token_mask_is_threshold_high=2.0
# for longer context + MoE, you can also try outlier based sequence masking, which stacks on top of geometric sequence masking
trainer.algorithm.off_policy_correction.outlier_token_is_threshold_low=1e-4
trainer.algorithm.off_policy_correction.outlier_token_is_threshold_high=100Setup
For common RL objectives (i.e. PPO/GRPO variants), we typically seek to optimize a token-wise objective of the form:
where:
- — samples drawn from the sampling policy
- — probability under the current policy being optimized
- — probability under the sampling policy used during rollout
- — advantage estimate, typically computed as group-relative rewards (with the std norm being optional per Dr. GRPO):
- — the PPO importance sampling ratio, correcting for distributional shift between the sampling policy and the current policy when taking multiple mini batch steps for a single training batch
- — clipping bounds (can be asymmetric)
In most RL frameworks, there are two options for representing :
- , where is the actual sampling policy via the inference engine
- , where is the trainer policy (same weights, but potentially different parallelism/kernels)
By default in SkyRL (and in most RL frameworks) is used as an approximation of the rollout policy . This requires recomputing the logprobs of responses under the training policy by taking a forward pass using the training weights prior to updating the weights for a given training step. However, the goal is still to most accurately estimate the importance sampling ratio using
Off-policy drift in RL
We can quantify off-policy drift from this ideal by considering the following expansion:
The first term:
corresponds to the off-policy drift from training vs inference mismatch (due to system differences).
While the second term:
corresponds to the off-policy drift from policy staleness (due to parameter differences).
Next, we discuss how each of these commonly occur in RL training
Training vs Inference Engine Mismatch
Training vs inference mismatch
occurs due to discrepancies between the computed logprobs when using a different training backend (FSDP, Megatron) and inference engine (i.e. vLLM). These discrepancies include:
- Kernel Mismatch: Optimized kernels for inference engines are often not batch invariant, causing to differ from . This can be fixed by enabling batch invariant kernels at the cost of slower inference.
- Inconsistent Expert Routing: Experts that are routed to by the trainer and the inference engine may not line up, causing mismatch in computed logprobs. This can be fixed by introducing routing replay, which fixes the expert routing in the training engine to use the expert routing from the inference engine (Zheng et. al 2025, Ma et al. 2025).
- Different Parallelisms: Kernel mismatch and numeric drift can be exacerbated by different parallelisms being configured in training backends like FSDP and Megatron compared to inference engines like vLLM. For example, Yao et. al 2025 show that enabling ulysses style sequence parallelism on the trainer greatly increases the trainer/inference mismatch.
Policy staleness
Policy staleness
is caused by the following factors:
- Async RL: When doing fully asynchronous RL, each training batch can consist of trajectories which were partially (or even fully) computed using stale policies. To mitigate this, the max staleness of trajectories can be tuned to prevent trajectories that are too old from being used during training.
- Mini Batching: Breaking down a training step into multiple mini batches for multiple gradient steps per training batch is common for increasing training efficiency for online RL. Mini batching results in off-policy updates, which can be clamped within an acceptable range in the common dual clip formulation of the PPO loss. Tuning the number of mini batches per training batch can impact convergence of RL runs, and impact whether corrections like routing replay and masking are needed.
Algorithmic Off Policy Correction
In the previous section, we described some reasons why off-policy drift can occur, and some ways to mitigate it (e.g., batch invariant kernels, routing replay). However, these solutions come with tradeoffs (slower inference for batch invariant kernels, additional bias for routing replay), and are not sufficient to address all sources of drift, like fully async RL.
Recent works (Liu et. al 2025, Yao et. al 2025) have proposed additional techniques for off-policy correction. In this section, we describe these techniques and how to enable them in SkyRL.
Truncated Importance Sampling
Yao et. al 2025 propose adding a truncated importance sampling term—equivalent to the training-inference mismatch term above, but clamped—to the loss formulation:
The original TIS blog post suggests applying this term using a token-wise , but follow-up works like Liu et. al 2025 suggest applying a sequence-level term instead:
off_policy_correction:
tis_ratio_type: null # null, "token", "sequence"
token_tis_ratio_clip_high: 2.0
sequence_tis_ratio_clip_high: 5.0
...To enable TIS in SkyRL, set tis_ratio_type to either token or sequence to use a token-wise or sequence-wise correction term. If tis_ratio_type is token, token_tis_ratio_clip_high will be used for the clamping term ; if sequence, sequence_tis_ratio_clip_high will be used.
Sequence Masking
Liu et. al 2025 propose masking out sequences with sequence-level importance sampling ratios outside a given range (also used by Deepseek v3.2 and Cognition's SWE-Grep) to maintain training stability and tolerance for off-policy updates.
where
Here, either the geometric mean or a simple product of token-wise importance sampling ratios can be used to determine whether a sequence should be masked.
off_policy_correction:
...
sequence_mask_metric: null # null, "product", "geometric"
geo_mask_high: 1.01
geo_mask_low: 0.99
product_mask_high: 2.0
product_mask_low: 0.5
outlier_token_is_threshold_low: null
outlier_token_is_threshold_high: null
...To enable sequence masking in SkyRL, set sequence_mask_metric to either product or geometric. For geometric, set geo_mask_high and geo_mask_low; for product, set product_mask_high and product_mask_low.
SkyRL also provides a way to mask sequences where any token has an importance sampling ratio outside a specified range:
To enable masking sequences with outlier tokens, set outlier_token_is_threshold_low and outlier_token_is_threshold_high.
Token Masking
Unlike sequence masking (which rejects entire sequences when a sequence-level metric exceeds a threshold), token masking zeros out only the individual tokens whose importance-sampling ratio falls outside an acceptable range. This is a finer-grained correction that preserves the rest of the sequence for learning. This technique was introduced by Zhou et. al as Icepop, and has since been used in the training of GLM-5.
The per-token mask is multiplied element-wise into the loss mask, so masked tokens contribute zero gradient while all other tokens in the sequence remain unaffected.
off_policy_correction:
...
token_mask_is_threshold_low: 0.5 # suggested starting value
token_mask_is_threshold_high: 2.0 # suggested starting valueTo enable token masking, set both token_mask_is_threshold_low and token_mask_is_threshold_high. Both must be set for the mask to activate; setting only one has no effect.
Token masking can be combined with other corrections (TIS, sequence masking, outlier masking). When multiple masks are active, they are applied multiplicatively to the loss mask in order: outlier sequence mask, then token mask, then sequence mask.
When enabled, the metric token_mask_ratio is logged, representing the fraction of originally-valid tokens that were zeroed by the token mask.
Metrics and Monitoring
If off-policy correction is enabled, you can view relevant metrics, like the mean/std of the importance sampling ratio, the mean/std of logprob diffs, and the fraction of masked sequences in the logger of your choice under policy/loss_metrics.
Some examples are shown below:


