Configuration Overview
Data Configuration
data:
train_data: ["${oc.env:HOME}/data/gsm8k/train.parquet"]
val_data: ["${oc.env:HOME}/data/gsm8k/validation.parquet"]data.train_data: A list of files for the training dataset.data.val_data: A list of files for the evaluation dataset.
A dataset file can be a path to a parquet or json file, or the name of a Hugging Face dataset.
Currently, all datasets are loaded into memory, so the dataset size is limited by available CPU memory on a worker node.
Model Placement Configuration
placement:
colocate_all: true
colocate_policy_ref: true
colocate_critic_reward: false
policy_num_nodes: 1
policy_num_gpus_per_node: 4
critic_num_nodes: 1
critic_num_gpus_per_node: 4
ref_num_nodes: 1
ref_num_gpus_per_node: 4
reward_num_nodes: 1
reward_num_gpus_per_node: 4For an in-depth guide on model placement and colocation, please refer to the model placement and colocation guide.
General Training Configuration
epochs: 1 # Number of passes over the full dataset
update_epochs_per_batch: 1
train_batch_size: 1024
policy_mini_batch_size: 256
critic_mini_batch_size: 256
micro_train_batch_size_per_gpu: 1
micro_forward_batch_size_per_gpu: 1
update_ref_every_epoch: false
use_sample_packing: true
max_prompt_length: 512
gradient_checkpointing: true
seed: 42-
epochs: Number of epochs/ passes over the full dataset (similar to SFT) -
update_epochs_per_batch: Number of gradient update passes over each training batch. This is equivalent to the concept of "PPO epochs" where you iterate over the same experience multiple times. -
train_batch_size: Batch size of prompts used for each dataloader step. -
policy_mini_batch_size: Mini batch size used during RL training step. Each mini batch corresponds to one optimizer step. For example, if thetrain_batch_sizeis 4 andpolicy_mini_batch_sizeis 2, then there will be 2 optimizer steps (i.e., model updates) for a given training batch. Note that is this the global mini batch size. The actual size of the mini batch per worker would bepolicy_mini_batch_size/ number of DP ranks -
critic_mini_batch_size: Similar topolicy_mini_batch_sizebut for the critic model (if applicable). Note that in general, the critic model can tolerate off-policy updates more than the policy. Thus, you would want to setcritic_mini_batch_sizeto be lower comparedpolicy_mini_batch_size(i.e., more critic updates). -
micro_train_batch_size_per_gpu: Micro batch size during training step. This is common for both policy and critic models. Each mini batch is split into micro batches of this size, gradients are computed and accumulated over these micro batches. -
micro_forward_batch_size_per_gpu: Micro batch size during forward pass (i.e., for log probability or value computation). This is common for both policy and critic models. Each mini batch is split into micro batches of this size, model forward pass is performed over these micro batches. -
update_ref_every_epoch: Whether to update the reference model every epoch. -
use_sample_packing: Whether to use sample packing during model forward pass (common for all models). -
max_prompt_length: Maximum prompt length during training. Longer prompts will be truncated. -
gradient_checkpointing: Whether to use gradient checkpointing. -
seed: Random seed for training.If you're facing issues with tuning the right values for
micro_train_batch_size_per_gpu,policy_mini_batch_sizeandmicro_forward_batch_size_per_gpu, seeutils/utils.py::validate_batch_sizesfor details on constraints.
Evaluation Configuration
eval_batch_size: 1024
eval_before_train: true
eval_interval: 5 # Set to -1 to disable evaluation.-
eval_batch_size: Batch size for evaluation. -
eval_before_train: Whether to evaluate the model before training. -
eval_interval: The frequency of evaluating the model with the validation dataset (in terms of number of steps). If set to-1, evaluation will not be performed.If multiple validation datasets are provided (e.g.
data.val_data="['$DATA_DIR/validation1.parquet', '$DATA_DIR/validation2.parquet']" \), then the evaluation will be performed on all of them. The metrics for each dataset, and the aggregated metrics, will all be logged in WandB. Ifdump_eval_resultsis set totrue, the per-dataset and aggregated results will be dumped.
Checkpoint Configuration
resume_mode: latest # null/"none", "latest", "from_path"
resume_path: null
ckpt_path: "${oc.env:HOME}/ckpts/" # Local directory path or cloud storage path (S3, GCP) for resumable training checkpoints (model state, optimizer state, etc.)
max_ckpts_to_keep: -1 # -1 to keep all checkpoints, N to keep the last N checkpoints
ckpt_interval: 10 # Save full training checkpoint every `ckpt_interval` steps.
hf_save_interval: -1 # Save HF format model(s)every `hf_save_interval` steps.
export_path: "${oc.env:HOME}/exports/" # Path for exported artifacts (HF models, debug dumps, etc.)
project_name: "skyrl"
run_name: "test_run"
logger: "wandb"For an in-depth guide on checkpointing and resumption, please refer to the checkpointing guide.
Logging and Debugging Configuration
logger: "wandb"
project_name: "skyrl"
run_name: "test_run"
dump_data_batch: false
dump_eval_results: truelogger: Logger to use. Currently, we supportwandb,mlflow, andconsole.consolewill simply log metrics to the console.project_name: Name of the project in WandB and MLFlow.run_name: Name of the run in WandB and MLFlow.dump_data_batch: Whether to dump the data batch to a file. This is useful for debugging. Whentrue, the data batch will be dumped to a file in theexport_pathdirectory. The training batch at global stepNis saved toself.cfg.trainer.export_path / "dumped_data" / global_step_N_training_inputdump_eval_results: Whether to dump the evaluation results to a file. Whentrue, the full evaluation results will be dumped to a file in theexport_pathdirectory. The evaluation results at global stepNis saved toself.cfg.trainer.export_path / "dumped_eval" / global_step_N_eval_results
Training Backends
We support three backends: FSDP1, FSDP2, and Megatron. The backend can be chosen with trainer.strategy field.
FSDP Configuration
We use the same configuration group for FSDP1 and FSDP2
fsdp_config:
cpu_offload: false # offload params + optimizer state to cpu during fwd pass
reshard_after_forward: true # fsdp2 only, [True, False, int between 1 and fsdp_size]
fsdp_size: -1-
cpu_offload: Whether to train with CPU offloading (i.e., offload state during forward pass). This corresponds to cpu_offload parameter in FSDP1 and offload_policy in FSDP2. -
reshard_after_forward: Whether to re-shard FSDP model after forward pass. This is a FSDP2 specific configuration, please refer to the FSDP2 docs for more details. If set tofalse, this would retain the full model parameters on each worker (similar to DeepSpeed's ZeRO stage 2). -
fsdp_size: The group size within which worker state is sharded with FSDP. This is a parameter to be used for hybrid sharding in multi-node settings. For example, if the number of workers in the actor group is 8, with 4 in each node, andfsdp_sizeis 4, then the training state will be fully sharded across 4 ranks in each node, but replicated (DP) across nodes.cpu_offloadis different from worker state offloading with model colocation.In FSDP,
cpu_offloadwill offload parameter and optimizer state to CPU memory and only copy over model parameters to GPU during model forward pass.In
skyrl-train, we offload worker state in certain colocation settings - however this happens only after the training step/ log probability computation - thus optimizer step and model forward pass happen as usual with sharded parameters on GPU. For more details, refer to the guide on model placement and colocation
Megatron Configuration
megatron_config:
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
context_parallel_size: 1
expert_model_parallel_size: 1
expert_tensor_parallel_size: null
ddp_config: # pass-through config to Megatron's `DistributedDataParallelConfig` object
# https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/distributed/distributed_data_parallel_config.py#L8
...
optimizer_config_kwargs: # pass-through kwargs to Megatron's `OptimizerConfig` object
# any overlapping arguments with those we attempt to resolve in trainer.policy.optimizer_config will be overridden by the values here
# https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/optimizer/optimizer_config.py#L12
...
model_config_kwargs: # pass-through kwargs to the HuggingFace model config (i.e. for overriding vocab size, etc)
...
transformer_config_kwargs: # pass-through kwargs to the Megatron's `TransformerConfig` object
# https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/transformer/transformer_config.py#L33
...
lora_config:
# see: https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html for details - currently "lora" and "canonical_lora" are supported
lora_type: "lora"
# flag to manually empty torch's cuda cache between the forward/backward pass and the optimizer step
# this will free reserved but unallocated memory, and can help avoid OoMs in the optimizer
empty_cuda_cache: truemegatron_config.tensor_model_parallel_size: Tensor model parallel size for reducing memory across model parameters and activations. Sequence parallelism (unrelated to ulysses sequence parallelism) is also enabled by default if tensor parallel size is greater than 1.megatron_config.pipeline_model_parallel_size: Pipeline model parallel size for sharding model layers across multiple GPUs.megatron_config.context_parallel_size: Context parallel size for reducing activation memory across the sequence length dimension.megatron_config.expert_model_parallel_size: The expert parallel size for sharding expert modules across multiple GPUs.megatron_config.expert_tensor_parallel_size: The tensor parallel size for each expert module. If set tonull, then the value will be resolved totensor_model_parallel_sizeby Megatron. It is recommended to set this to1when enablingexpert_model_parallel_size > 1for the best performance.
Some rules for configuring these parameters:
model_size = pp_size * tp_size * cp_sizedp_size = world_size / model_sizeworld_size % (pp_size * ep_size * etp_size) == 0- This means that
ep_size * etp_sizecan scale independently oftp_size * cp_size, and can go across data parallel ranks.
- This means that
optimizer_config_kwargs.use_precision_aware_optimizer=true can cause checkpointing to fail. See: https://github.com/nvidia/megatron-lm/issues/1820. We recommend leaving this setting to false.
Optimizer Configuration
For both the critic and policy model, we provide a common optimizer configuration
optimizer_config:
lr: 1.0e-6
adam_betas: [0.9, 0.999]
weight_decay: 1e-2
max_grad_norm: 1.0
offload_after_step: true
num_warmup_steps: 0
scheduler: "constant_with_warmup"optimizer_config.lr: Learning rate for the optimizeroptimizer_config.adam_betas: Betas for AdamW optimizer.optimizer_config.weight_decay: L2 regularization strength for AdamW.optimizer_config.max_grad_norm: Gradient clipping parameter. The total L2 norm of the model gradients will be scaled to this value during training.optimizer_config.offload_after_step: Whether to offload optimizer state to CPU after step if colocated. When generation and training workers are colocated, we recommend using the default setting oftrue. In some cases with non-colocation, it can be desirable to leave optimizer state on GPU memory to avoid offloading costs as well as additional CPU memory usage.optimizer_config.num_warmup_steps: Number of mini-batch steps to warmup the optimizer for.optimizer_config.scheduler: Which learning rate scheduler to use. Intended to align withtransformers.SchedulerTypefrom Huggingface.
Policy Configuration
This section configures the policy model used for training, including optimizer, FSDP, sequence parallelism, and LoRA options.
policy:
model:
path: "Qwen/Qwen2.5-1.5B-Instruct" # Hugging Face model path for the policy model
lora:
rank: 0 # LoRA rank (0 = disabled)
alpha: 16 # LoRA scaling parameter
dropout: 0 # LoRA dropout rate
lora_sync_path: "/tmp/skyrl_lora_sync" # Path for LoRA adapter sync
target_modules: "all-linear" # Apply to all linear layers OR
# specify specific modules as a list
exclude_modules: null # Modules to exclude from LoRA
# For FSDP, this corresponds to `init_lora_weights` in PEFT. See: https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig
# For Megatron, this is used for `lora_A_init_method`, and "xavier", "normal", "kaiming", and "zero" are supported.
init_method: "kaiming" # Initialization method for LoRA layers
optimizer_config:
lr: 1.0e-6 # Learning rate
adam_betas: [0.9, 0.999] # Betas for Adam optimizer
weight_decay: 1e-2 # L2 regularization strength
max_grad_norm: 1.0 # Gradient clipping
offload_after_step: true # Offload optimizer state to CPU after step (if colocated)
fsdp_config:
cpu_offload: false # Offload model params to CPU during forward
reshard_after_forward: true # Re-shard FSDP model after forward pass
fsdp_size: -1 # Auto FSDP group sizing
sequence_parallel_size: 1 # sequence parallel size
use_torch_compile: false # Enable torch compile for the entropy calculation
record_memory: false # Dump memory snapshot for debugging
model_config_kwargs: {} # pass through kwargs to the HuggingFace model config for FSDP training backends (i.e. for overriding vocab size, etc) - for megatron, use policy.megatron_config.transformer_config_kwargs insteadpolicy.optimizer_config: Optimizer configuration for the policy modelpolicy.fsdp_config: FSDP configuration, applicable iftrainer.strategy='fsdp'.policy.sequence_parallel_size: Sequence parallel size. We implement Ulysses sequence parallelismpolicy.use_torch_compile: Whether to enable torch compile for entropy calculationpolicy.record_memory: Whether to record memory usage. IfTrue, this will use PyTorch's memory snapshotting utility to record memory usage and dump memory snapshots after each policy model training step.
LoRA Configuration
LoRA (Low-Rank Adaptation) enables parameter-efficient fine-tuning by training only a small number of additional low-rank matrices instead of the full model weights:
policy.model.lora.rank: LoRA rank for low-rank decomposition. Set to 0 to disable LoRA. Higher values increase model capacity but also memory usage. Common values include 8, 16, 32, or 64.policy.model.lora.alpha: Scaling factor for LoRA updates.policy.model.lora.dropout: Dropout probability applied to LoRA layers. Helps prevent overfitting during training.policy.model.lora.lora_sync_path: Directory path where LoRA adapter weights are saved and synchronized between training and inference processes. Must be accessible to all workers in distributed setups.policy.model.lora.init_method: Initialization method for LoRA layers. For FSDP, this corresponds to init_lora_weights in PEFT. 'kaiming' is mapped to 'true' by default for PEFT. For Megatron, this is used forlora_A_init_method, and "xavier", "normal", "kaiming", and "zero" are supported.
Critic Configuration
We support similar configuration options as the policy model, including LoRA.
critic:
model:
path: null
lora:
rank: 0 # LoRA rank (0 = disabled)
alpha: 16 # LoRA scaling parameter
dropout: 0 # LoRA dropout rate
target_modules: "all-linear"
exclude_modules: null # Modules to exclude from LoRA
init_method: "kaiming" # Initialization method for LoRA layers
optimizer_config:
lr: 5.0e-6
adam_betas: [0.9, 0.999]
weight_decay: 1e-2
max_grad_norm: 1.0 # gradient clipping
offload_after_step: true # offload optimizer state to cpu after each step. Applicable only when `colocate_all=true`
fsdp_config:
cpu_offload: false
reshard_after_forward: true
fsdp_size: -1
sequence_parallel_size: 1
model_config_kwargs: {} # pass through kwargs to the HuggingFace model config (i.e. for overriding vocab size, etc)Reference Model Configuration
ref:
model:
path: ${trainer.policy.model.path}
fsdp_config:
cpu_offload: false
reshard_after_forward: true
fsdp_size: -1
sequence_parallel_size: 1
model_config_kwargs: {} # pass through kwargs to the HuggingFace model config for FSDP training backends (i.e. for overriding vocab size, etc) - for megatron, use ref.megatron_config.transformer_config_kwargs insteadref.model.path: Path to the reference model. Defaults to the policy model path, but can be separately set (i.e. for distillation based approaches, the reference model can be a different model than the policy model).ref.fsdp_config: FSDP configuration, applicable iftrainer.strategy='fsdp'.ref.sequence_parallel_size: Sequence parallel size. We implement Ulysses sequence parallelism
The reference model is used only if the base model log probabilities are required either as a part of the training loss or as a part of the reward. Thus, trainer.algorithm.use_kl_in_reward or trainer.algorithm.use_kl_loss should be set to true to use the reference model. If both are false, then the reference model is not instantiated.
Algorithm Configuration
algorithm:
advantage_estimator: "grpo" # "grpo", "gae", or customizable with AdvantageEstimatorRegistry
# KL Penalty Parameters
kl_ctrl: # only used if use_kl_in_reward is true (not applied in the case of use_kl_loss=true) - uses kl_loss_coef as the initial KL coefficient
type: "fixed" # "fixed" or "adaptive"
kl_target: 0.1 # target KL divergence for adaptive KL controller
horizon: 10000 # controls the update rate of the adaptive KL controller
kl_estimator_type: "k3" # "k1", "k2", "k3", "abs" - see http://joschu.net/blog/kl-approx.html for details
# note: use_kl_in_reward and use_kl_loss should be mutually exclusive
use_kl_in_reward: false # apply kl loss to rewards
use_kl_loss: true # used in policy model
kl_loss_coef: 0.001
# this adds training batch level normalization to advantages
advantage_batch_normalize: false
value_head_prefix: "value_head"
policy_loss_type: "regular" # "regular", "dual_clip", "gspo", "clip_cov", "kl_cov" or customizable with PolicyLossRegistry
loss_reduction: "token_mean" # "token_mean", "sequence_mean", "seq_mean_token_sum_norm"
grpo_norm_by_std: true # set to false to disable normalization by std in GRPO (used in Dr. GRPO)
zero_variance_filter: false # set to true to loss mask out prompts with zero variance rewards. only applicable when rewards are response-level.
# GAE parameters
lambd: 1.0
gamma: 1.0
# PPO parameters
eps_clip_low: 0.2
eps_clip_high: 0.2
# dual clip parameters
clip_ratio_c: 3.0
# clip-cov parameters (only used when policy_loss_type: "clip_cov")
clip_cov:
clip_ratio: 0.0002 # fraction of tokens to clip based on covariance
clip_cov_lb: 1.0 # lower bound for covariance clipping
clip_cov_ub: 5.0 # upper bound for covariance clipping
# kl-cov parameters (only used when policy_loss_type: "kl_cov")
kl_cov:
kl_cov_frac: 0.2 # percentage of tokens to apply KL regularization to (20%)
ppo_kl_coef: 1.0 # coefficient for KL regularization term
# cispo parameters (only used when policy_loss_type: "cispo")
cispo:
cispo_eps_clip_low: 0 # offset for lower bound of importance sampling ratio clipping (as opposed to PPO token update clipping)
cispo_eps_clip_high: 5 # offset for upper bound of importance sampling ratio clipping (as opposed to PPO token update clipping)
# value loss parameters
value_clip: 0.2
# dynamic sampling parameters
dynamic_sampling:
type: null # filter (DAPO), replace (POLARIS/WebSailor), or null
max_sample_batches: 30 # sample at most this many batches before stopping, -1 to sample forever
min_replace_ratio: 0.3 # minimum proportion of good samples with which to replace bad samples (for replace strategy only)
# Truncated Importance Sampling as proposed in https://fengyao.notion.site/off-policy-rl
use_tis: false
tis_imp_ratio_cap: -1.0
# SAPO parameters (only used when policy_loss_type: "sapo") (https://arxiv.org/pdf/2511.20347)
sapo:
tau_pos: 1.0
tau_neg: 1.05 # default values used in the paper with Qwen3-30B-A3B-Base-
algorithm.advantage_estimator: Advantage estimator to use. We currently implementgrpo,gae,rloo,reinforce++, and custom advantage estimators can be registered with theAdvantageEstimatorRegistry. -
algorithm.kl_ctrlConfiguration for the KL controller - only used ifuse_kl_in_rewardistrue(not applied in the case ofuse_kl_lossistrue).kl_loss_coefis used as the initial KL coefficient for bothfixedandadaptiveKL controllers. -
type: Type of KL controller to use. Options include:fixedoradaptive. -
kl_target: Target KL divergence for adaptive KL controller. -
horizon: Controls the update rate of the adaptive KL controller. -
algorithm.kl_estimator_type: KL estimator type to use. Options include:k1,k2,k3,abs. See this blog post for details. We usek3as the default. -
algorithm.use_kl_in_reward: Whether to apply KL divergence penalty to rewards. The new rewards will be computed asrewards - kl * kl_loss_coef. -
algorithm.use_kl_loss: Whether to add a KL divergence loss to the policy model. The policy loss will be computed aspolicy_loss + kl * kl_loss_coef. -
algorithm.kl_loss_coef: Coefficient for the KL divergence loss. -
algorithm.advantage_batch_normalize: Whether to normalize advantages by the (global) batch mean and standard deviation. -
algorithm.value_head_prefix: The name used to identify the value head in the critic model. -
algorithm.policy_loss_type: Type of policy loss to use. Options include:regular: Vanilla PPO loss with token-level importance samplingdual_clip: Dual clip PPO loss proposed in this papergspo: Group Sequence Policy Optimization with sequence-level importance sampling for improved training stability. Implements the "GSPO-token" variant from the paper.clip_cov: Clip-Cov combines standard PPO clipping with covariance-based correction masking for improved stability. Based on this paper.kl_cov: KL-Cov applies KL regularization to tokens selected based on covariance values. Based on this paper.cispo: Clipped Importance Sampling Weight Policy Optimization (CISPO) proposed in MiniMax-M1.- Custom policy losses can be registered with the
PolicyLossRegistry
-
algorithm.loss_reduction: Type of loss reduction to use. Options include:token_mean: computes average loss over all valid tokens in the batch. Used in DAPO.sequence_mean: computes per-sequence avg token loss, then averages over the batch.seq_mean_token_sum_norm: computes the sum of token losses for each sequence, normalizes by the max sequence length (computed ascfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length), and then averages over the batch. This is used in Dr. GRPO.
-
algorithm.grpo_norm_by_std: Whether to normalize advantages by the standard deviation in GRPO. This is set tofalsein Dr. GRPO. -
algorithm.zero_variance_filter: Whether to loss mask out prompts with zero variance rewards. This is only applicable when rewards are response-level. -
algorithm.lambd: Lambda parameter for GAE. -
algorithm.gamma: Gamma parameter for GAE. -
algorithm.eps_clip_low: Lower bound for PPO clipping. -
algorithm.eps_clip_high: Upper bound for PPO clipping. -
algorithm.clip_ratio_c: Clip ratio for dual clip PPO loss. -
algorithm.value_clip: Clip value for value loss. -
algorithm.dynamic_sampling: Dynamic sampling configuration.algorithm.dynamic_sampling.type: Type of dynamic sampling to use. Currently, we supportfilter(DAPO),replace(POLARIS / WebSailor), ornullfor no dynamic sampling.algorithm.dynamic_sampling.max_sample_batches: Maximum number of batches to sample before stopping. Set to-1to sample forever.algorithm.dynamic_sampling.min_replace_ratio: Minimum proportion of good samples with which to replace bad samples forreplacestrategy.
-
algorithm.use_tis: Whether to use Truncated Importance Sampling (TIS) as proposed in this blog. -
algorithm.tis_imp_ratio_cap: Cap parameter for the importance ratio in TIS. -
algorithm.clip_cov: Clip-Cov parameters (only used whenpolicy_loss_typeisclip_cov):clip_ratio: Fraction of tokens to clip based on covariance values.clip_cov_lb: Lower bound for covariance clipping.clip_cov_ub: Upper bound for covariance clipping.
-
algorithm.kl_cov: KL-Cov parameters (only used whenpolicy_loss_typeiskl_cov):kl_cov_frac: Percentage of tokens to apply KL regularization to.ppo_kl_coef: Coefficient for KL regularization term.
-
algorithm.cispo: CISPO parameters (only used whenpolicy_loss_typeiscispo):cispo_eps_clip_low: Offset for lower bound of importance sampling ratio clipping. Tokens with importance sampling ratio less than1 - cispo_eps_clip_lowwill have their ratio clipped, but can still be updated in the policy gradient update.cispo_eps_clip_high: Offset for upper bound of importance sampling ratio clipping. Tokens with importance sampling ratio greater than1 + cispo_eps_clip_highwill have their ratio clipped, but can still be updated in the policy gradient update.
-
algorithm.sapo: SAPO (as proposed in this paper) parameters (only used whenpolicy_loss_typeissapo):tau_pos: Temperature for gating function for tokens with positive advantages.tau_neg: Temperature for gating function for tokens with negative (or zero) advantages.
Policy Loss Formulation
It can be helpful to understand the final loss formulation to see how the different configuration options are used. The final loss is computed as below in the ppo_policy_loss function.
def ppo_policy_loss(
log_probs: torch.Tensor,
old_log_probs: torch.Tensor,
advantages: torch.Tensor,
config: DictConfig, # trainer.algorithm config
loss_mask: Optional[torch.Tensor] = None,
) -> torch.Tensor:
ratio = (log_probs - old_log_probs).exp()
surr1 = ratio * advantages
surr2 = ratio.clamp(1 - config.eps_clip_low, 1 + config.eps_clip_high) * advantages
loss = -torch.min(surr1, surr2)
clip_ratio = masked_mean((-surr2 > -surr1).float(), loss_mask).mean().detach().item()
clip_pg_losses1 = loss
if config.policy_loss_type == "dual_clip":
pg_losses3 = -advantages * config.clip_ratio_c
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
loss = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
loss = reduce_loss(loss, loss_mask, config.loss_reduction)
return loss, clip_ratioGenerator Configuration
generator:
model_dtype: "bfloat16" # should match dtype for inference engine
run_engines_locally: true
num_inference_engines: 1
backend: "vllm"
weight_sync_backend: "nccl"
inference_engine_tensor_parallel_size: 4
inference_engine_pipeline_parallel_size: 1
inference_engine_expert_parallel_size: 1
inference_engine_data_parallel_size: 1
n_samples_per_prompt: 5
async_engine: true
batched: true
max_input_length: ${trainer.max_prompt_length} # max generator input length used for multi-turn conversations - for single turn set equal to max_prompt_length
enable_prefix_caching: true
enable_chunked_prefill: true
max_num_batched_tokens: 8192
enforce_eager: false
gpu_memory_utilization: 0.8
max_num_seqs: 1024
remote_inference_engine_urls: ["127.0.0.1:8001"]
max_turns: 1
# Custom chat template configuration if needed
chat_template:
source: "name" # "name" or "file"
name_or_path: null # e.g., "qwen3_with_thinking" or "/path/to/template.j2"
# Chat templating kwargs to pass to `tokenizer.apply_chat_template`
chat_template_kwargs: {}
engine_init_kwargs: {}
override_existing_update_group: "auto" # "auto", "enable", "disable"
# sampling params for generation phase
sampling_params:
max_generate_length: 1024
temperature: 1.0
top_p: 1.0
min_p: 0.0
top_k: -1
logprobs: 0
use_conversation_multi_turn: true
# sampling params for evaluation
eval_sampling_params:
max_generate_length: ${generator.sampling_params.max_generate_length}
temperature: 1.0
top_p: 1.0
min_p: 0.0
top_k: -1
logprobs: 0
# number of samples per prompt for evaluation
eval_n_samples_per_prompt: 1
zero_reward_on_non_stop: false
apply_overlong_filtering: falseInference Engine Placement Configuration
generator.run_engines_locally: Whether to use local inference engines. Iftrue, the inference engine will be initialized during the training run in the current Ray cluster. We use one Ray actor per inference replica and communication will happen via Ray object store. If set tofalse, then the generator expects a list of remote urls and communication will happen over HTTP.generator.num_inference_engines: Number of inference engines to use. Ifrun_engines_locallyisfalse, then this number should match the number of remote urls.generator.remote_inference_engine_urls: List of remote urls to use. Applicable only whenrun_engines_locallyisfalse.generator.enable_http_endpoint: Whentrue, launch an OpenAI-compatible HTTP endpoint for the inference engine client so that generators can send requests to this server instead of using.generate()Python calls.generator.http_endpoint_host: Host for the inference HTTP endpoint.generator.http_endpoint_port: Port for the inference HTTP endpoint.
For more details on how different placement options work, please refer to the placement guide.
Weight Transfer Configuration
generator.weight_sync_backend: Backend to use for weight synchronization. Currently, we supportncclandgloo.generator.override_existing_update_group: Whether to override the existing update group for the inference engine. This is applicable only for remote inference engines. During training,skyrl-trainforms a custom process group ("update group") with the rank 0 training worker and all the inference engine ranks. Ifoverride_existing_update_group=enable, then during initialization, a previous weight update group will be overriden in the inference engine. For example, if you have a remote server setup and you run training for the same model multiple times, it is helpful to override the previous update group. We recommend leaving this toauto- since it will automatically determine if the previous update group should be overridden based onrun_engines_locally.
Inference Engine Configuration
generator.backend: Backend to use for the inference engine. We supportvllmandsglang.sglangis supported only for remote inference engines at the moment.generator.model_dtype: Dtype used for the inference engine. This is also used during weight transfer - the policy model weights are casted to this dtype before being sent to the inference engine during weight transfer.generator.async_engine: Whether to use an asynchronous/ offline inference engine. Applicable only whenbackend="vllm".generator.inference_engine_tensor_parallel_size: Tensor parallel size for the inference engine.generator.inference_engine_pipeline_parallel_size: Pipeline parallel size for the inference engine. Currently, PP is only supported for vLLM backend with async_engine=true.generator.inference_engine_expert_parallel_size: Expert parallel size for the inference engine. Currently, EP is only supported for vLLM backend and ep_size must equal dp_size * tp_size.generator.inference_engine_data_parallel_size: Data parallel size for the inference engine. Currently, DP is only supported for vLLM backend.generator.gpu_memory_utilization: GPU memory utilization for the inference engine. Applicable only forrun_engines_locally=true.generator.vllm_v1_disable_multiproc: Iftrue, this will setVLLM_ENABLE_V1_MULTIPROCESSING=0in the environment, which makes the scheduling deterministic. This is useful for reproducibility.generator.enable_prefix_caching: Whether to enable prefix caching for the inference engine. Applicable only whenbackend="vllm". This can be left to the defaulttruein most cases. Note that in the case of remote inference engines, you would need to match the setting used when you initialized the remote servers.generator.enable_chunked_prefill: Whether to enable chunked prefill for the inference engine. Applicable only whenbackend="vllm". With vLLM, this can be left to the defaulttruein most cases.generator.max_num_seqs: Continous batching parameter for vLLM. Maximum number of sequences to pack into a batch.generator.max_num_batched_tokens: Continous batching parameter for vLLM. Maximum number of tokens to pack into a batch.
Generation Parameters
-
generator.n_samples_per_prompt: Number of samples to generate per prompt. Note that the total size of the training batch will betrainer.train_batch_size * generator.n_samples_per_prompt. -
generator.batched: Whether to use batched inference. This is applicable only for single turn generation. -
generator.max_input_length: Maximum input length for the inference engine. For single turn generation, this can be same astrainer.max_prompt_length(i.e., the initial prompt length). For multi-turn generation, this is the maximum input length used for multi-turn conversations at each turn. -
generator.sampling_params: Sampling parameters for the inference engine during trajectory generation phase.generator.sampling_params.max_generate_length: Maximum length of the generated response.generator.sampling_params.temperature: Temperature for the inference engine.generator.sampling_params.top_p: Top-p sampling parameter for the inference engine.generator.sampling_params.min_p: Min-p sampling parameter for the inference engine, as proposed in this paper.generator.sampling_params.top_k: Top-k sampling parameter for the inference engine.generator.sampling_params.logprobs: Number of logprobs to return from the inference engine. Set to0to return only the chosen token's logprob.
-
generator.eval_sampling_params: Sampling parameters for evaluation. -
generator.eval_n_samples_per_prompt: Number of samples to generate per prompt for evaluation. -
generator.max_turns: Maximum number of turns for generation with multi-turn RL. -
generator.use_conversation_multi_turn: Whether to use conversation format for multi-turn generation. If set totruethen observations are appended to the chat history as a new turn. If set tofalsethen observations are appended as-is to the assistant response in token space and generation is continued (after removing any EOS token in the response). We've observed some cases where model can be sensitive to chat history format (ex: in SkyRL-SQL), and thusfalsecan be used for full control over the exact tokens added after environment interaction. -
generator.engine_init_kwargs: Inference engine arguments passed directly to the vLLM or SGLang engine. To specify an engine arg in the CLI override, use the format: +generator.engine_init_kwargs.[arg_name]=value. If duplicate kwargs are passed or kwargs clash with existing generator arguments (e.g.,tensor_parallel_size), an error is raised. -
generator.chat_template: Custom chat template configuration if needed.generator.chat_template.source: Source of the chat template. Can be eithernameorfile.generator.chat_template.name_or_path: Name or path of the chat template. If the source isname, then it should be one of the supported templates inskyrl_train/generators/utils.py. If the source isfile, then this field should be a path to a Jinja2 template file.
-
generator.chat_template_kwargs: Chat templating kwargs to pass totokenizer.apply_chat_template. Applicable only for non-batched generation withgenerator.batched=false.
Misc Configuration
generator.zero_reward_on_non_stop: Whether to set the reward to 0 if thestop_reasonis notstop. Cases where this is useful: Often, we have format rewards for the LLM to follow, but in cases where the LLM didn't finish the response, we typically don't want to reward it. This is a general setting for all environments.generator.apply_overlong_filtering: Whether to apply DAPO Overlong Filtering to the loss masks. For each trajectory that exceeds the max length (i.e., truncated and does not end with an EOS token), this masks out every token in the loss mask.generator.step_wise_trajectories: Whether to return outputs in a step-wise fashion. Iftrue, then the generator will return multi-turn generations with the (prompt, response) pair of each turn being a separate trajectory. Advantages are computed based on the last step of each trajectory and propagated to the previous steps.