SkyRL
Configuration

Configuration Overview

Data Configuration

data:
  train_data: ["${oc.env:HOME}/data/gsm8k/train.parquet"]
  val_data: ["${oc.env:HOME}/data/gsm8k/validation.parquet"]
  • data.train_data: A list of files for the training dataset.
  • data.val_data: A list of files for the evaluation dataset.

A dataset file can be a path to a parquet or json file, or the name of a Hugging Face dataset.

Currently, all datasets are loaded into memory, so the dataset size is limited by available CPU memory on a worker node.

Model Placement Configuration

placement:
  colocate_all: true
  colocate_policy_ref: true
  colocate_critic_reward: false
  policy_num_nodes: 1
  policy_num_gpus_per_node: 4
  critic_num_nodes: 1
  critic_num_gpus_per_node: 4
  ref_num_nodes: 1
  ref_num_gpus_per_node: 4
  reward_num_nodes: 1
  reward_num_gpus_per_node: 4

For an in-depth guide on model placement and colocation, please refer to the model placement and colocation guide.

General Training Configuration

epochs: 1  # Number of passes over the full dataset
update_epochs_per_batch: 1
train_batch_size: 1024
policy_mini_batch_size: 256
critic_mini_batch_size: 256
micro_train_batch_size_per_gpu: 1
micro_forward_batch_size_per_gpu: 1
update_ref_every_epoch: false
use_sample_packing: true
max_prompt_length: 512
gradient_checkpointing: true
seed: 42
  • epochs: Number of epochs/ passes over the full dataset (similar to SFT)

  • update_epochs_per_batch: Number of gradient update passes over each training batch. This is equivalent to the concept of "PPO epochs" where you iterate over the same experience multiple times.

  • train_batch_size: Batch size of prompts used for each dataloader step.

  • policy_mini_batch_size: Mini batch size used during RL training step. Each mini batch corresponds to one optimizer step. For example, if the train_batch_size is 4 and policy_mini_batch_size is 2, then there will be 2 optimizer steps (i.e., model updates) for a given training batch. Note that is this the global mini batch size. The actual size of the mini batch per worker would be policy_mini_batch_size/ number of DP ranks

  • critic_mini_batch_size: Similar to policy_mini_batch_size but for the critic model (if applicable). Note that in general, the critic model can tolerate off-policy updates more than the policy. Thus, you would want to set critic_mini_batch_size to be lower compared policy_mini_batch_size (i.e., more critic updates).

  • micro_train_batch_size_per_gpu: Micro batch size during training step. This is common for both policy and critic models. Each mini batch is split into micro batches of this size, gradients are computed and accumulated over these micro batches.

  • micro_forward_batch_size_per_gpu: Micro batch size during forward pass (i.e., for log probability or value computation). This is common for both policy and critic models. Each mini batch is split into micro batches of this size, model forward pass is performed over these micro batches.

  • update_ref_every_epoch: Whether to update the reference model every epoch.

  • use_sample_packing: Whether to use sample packing during model forward pass (common for all models).

  • max_prompt_length: Maximum prompt length during training. Longer prompts will be truncated.

  • gradient_checkpointing: Whether to use gradient checkpointing.

  • seed: Random seed for training.

    If you're facing issues with tuning the right values for micro_train_batch_size_per_gpu, policy_mini_batch_size and micro_forward_batch_size_per_gpu, see utils/utils.py::validate_batch_sizes for details on constraints.

Evaluation Configuration

eval_batch_size: 1024
eval_before_train: true
eval_interval: 5 # Set to -1 to disable evaluation.
  • eval_batch_size: Batch size for evaluation.

  • eval_before_train: Whether to evaluate the model before training.

  • eval_interval: The frequency of evaluating the model with the validation dataset (in terms of number of steps). If set to -1, evaluation will not be performed.

    If multiple validation datasets are provided (e.g. data.val_data="['$DATA_DIR/validation1.parquet', '$DATA_DIR/validation2.parquet']" \), then the evaluation will be performed on all of them. The metrics for each dataset, and the aggregated metrics, will all be logged in WandB. If dump_eval_results is set to true, the per-dataset and aggregated results will be dumped.

Checkpoint Configuration

resume_mode: latest # null/"none", "latest", "from_path"
resume_path: null
ckpt_path: "${oc.env:HOME}/ckpts/" # Local directory path or cloud storage path (S3, GCP) for resumable training checkpoints (model state, optimizer state, etc.)
max_ckpts_to_keep: -1 # -1 to keep all checkpoints, N to keep the last N checkpoints
ckpt_interval: 10  # Save full training checkpoint every `ckpt_interval` steps.
hf_save_interval: -1  # Save HF format model(s)every `hf_save_interval` steps.
export_path: "${oc.env:HOME}/exports/" # Path for exported artifacts (HF models, debug dumps, etc.)
project_name: "skyrl"
run_name: "test_run"
logger: "wandb"

For an in-depth guide on checkpointing and resumption, please refer to the checkpointing guide.

Logging and Debugging Configuration

logger: "wandb"
project_name: "skyrl"
run_name: "test_run"
dump_data_batch: false
dump_eval_results: true
  • logger: Logger to use. Currently, we support wandb, mlflow, and console. console will simply log metrics to the console.
  • project_name: Name of the project in WandB and MLFlow.
  • run_name: Name of the run in WandB and MLFlow.
  • dump_data_batch: Whether to dump the data batch to a file. This is useful for debugging. When true, the data batch will be dumped to a file in the export_path directory. The training batch at global step N is saved to self.cfg.trainer.export_path / "dumped_data" / global_step_N_training_input
  • dump_eval_results: Whether to dump the evaluation results to a file. When true, the full evaluation results will be dumped to a file in the export_path directory. The evaluation results at global step N is saved to self.cfg.trainer.export_path / "dumped_eval" / global_step_N_eval_results

Training Backends

We support three backends: FSDP1, FSDP2, and Megatron. The backend can be chosen with trainer.strategy field.

FSDP Configuration

We use the same configuration group for FSDP1 and FSDP2

fsdp_config:
    cpu_offload: false # offload params + optimizer state to cpu during fwd pass
    reshard_after_forward: true # fsdp2 only, [True, False, int between 1 and fsdp_size]
    fsdp_size: -1
  • cpu_offload: Whether to train with CPU offloading (i.e., offload state during forward pass). This corresponds to cpu_offload parameter in FSDP1 and offload_policy in FSDP2.

  • reshard_after_forward: Whether to re-shard FSDP model after forward pass. This is a FSDP2 specific configuration, please refer to the FSDP2 docs for more details. If set to false, this would retain the full model parameters on each worker (similar to DeepSpeed's ZeRO stage 2).

  • fsdp_size: The group size within which worker state is sharded with FSDP. This is a parameter to be used for hybrid sharding in multi-node settings. For example, if the number of workers in the actor group is 8, with 4 in each node, and fsdp_size is 4, then the training state will be fully sharded across 4 ranks in each node, but replicated (DP) across nodes.

    cpu_offload is different from worker state offloading with model colocation.

    In FSDP, cpu_offload will offload parameter and optimizer state to CPU memory and only copy over model parameters to GPU during model forward pass.

    In skyrl-train, we offload worker state in certain colocation settings - however this happens only after the training step/ log probability computation - thus optimizer step and model forward pass happen as usual with sharded parameters on GPU. For more details, refer to the guide on model placement and colocation

Megatron Configuration

megatron_config:
  tensor_model_parallel_size: 1 
  pipeline_model_parallel_size: 1
  context_parallel_size: 1
  expert_model_parallel_size: 1
  expert_tensor_parallel_size: null

  ddp_config: # pass-through config to Megatron's `DistributedDataParallelConfig` object
    # https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/distributed/distributed_data_parallel_config.py#L8
    ...
  optimizer_config_kwargs: # pass-through kwargs to Megatron's `OptimizerConfig` object
    # any overlapping arguments with those we attempt to resolve in trainer.policy.optimizer_config will be overridden by the values here
    # https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/optimizer/optimizer_config.py#L12
    ...
  model_config_kwargs: # pass-through kwargs to the HuggingFace model config (i.e. for overriding vocab size, etc)
    ...
  transformer_config_kwargs: # pass-through kwargs to the Megatron's `TransformerConfig` object
    # https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/transformer/transformer_config.py#L33
    ...
  lora_config:
    # see: https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html for details - currently "lora" and "canonical_lora" are supported
    lora_type: "lora"
  # flag to manually empty torch's cuda cache between the forward/backward pass and the optimizer step
  # this will free reserved but unallocated memory, and can help avoid OoMs in the optimizer
  empty_cuda_cache: true
  • megatron_config.tensor_model_parallel_size: Tensor model parallel size for reducing memory across model parameters and activations. Sequence parallelism (unrelated to ulysses sequence parallelism) is also enabled by default if tensor parallel size is greater than 1.
  • megatron_config.pipeline_model_parallel_size: Pipeline model parallel size for sharding model layers across multiple GPUs.
  • megatron_config.context_parallel_size: Context parallel size for reducing activation memory across the sequence length dimension.
  • megatron_config.expert_model_parallel_size: The expert parallel size for sharding expert modules across multiple GPUs.
  • megatron_config.expert_tensor_parallel_size: The tensor parallel size for each expert module. If set to null, then the value will be resolved to tensor_model_parallel_size by Megatron. It is recommended to set this to 1 when enabling expert_model_parallel_size > 1 for the best performance.

Some rules for configuring these parameters:

  • model_size = pp_size * tp_size * cp_size
  • dp_size = world_size / model_size
  • world_size % (pp_size * ep_size * etp_size) == 0
    • This means that ep_size * etp_size can scale independently of tp_size * cp_size, and can go across data parallel ranks.

optimizer_config_kwargs.use_precision_aware_optimizer=true can cause checkpointing to fail. See: https://github.com/nvidia/megatron-lm/issues/1820. We recommend leaving this setting to false.

Optimizer Configuration

For both the critic and policy model, we provide a common optimizer configuration

optimizer_config:
   lr: 1.0e-6
   adam_betas: [0.9, 0.999]
   weight_decay: 1e-2
   max_grad_norm: 1.0
   offload_after_step: true
   num_warmup_steps: 0
   scheduler: "constant_with_warmup"
  • optimizer_config.lr: Learning rate for the optimizer
  • optimizer_config.adam_betas: Betas for AdamW optimizer.
  • optimizer_config.weight_decay: L2 regularization strength for AdamW.
  • optimizer_config.max_grad_norm: Gradient clipping parameter. The total L2 norm of the model gradients will be scaled to this value during training.
  • optimizer_config.offload_after_step: Whether to offload optimizer state to CPU after step if colocated. When generation and training workers are colocated, we recommend using the default setting of true. In some cases with non-colocation, it can be desirable to leave optimizer state on GPU memory to avoid offloading costs as well as additional CPU memory usage.
  • optimizer_config.num_warmup_steps: Number of mini-batch steps to warmup the optimizer for.
  • optimizer_config.scheduler: Which learning rate scheduler to use. Intended to align with transformers.SchedulerType from Huggingface.

Policy Configuration

This section configures the policy model used for training, including optimizer, FSDP, sequence parallelism, and LoRA options.

policy:
  model:
    path: "Qwen/Qwen2.5-1.5B-Instruct"  # Hugging Face model path for the policy model
    lora:
      rank: 0                    # LoRA rank (0 = disabled)
      alpha: 16                  # LoRA scaling parameter
      dropout: 0                 # LoRA dropout rate
      lora_sync_path: "/tmp/skyrl_lora_sync"  # Path for LoRA adapter sync
      target_modules: "all-linear"  # Apply to all linear layers OR
      # specify specific modules as a list
      exclude_modules: null  # Modules to exclude from LoRA
      # For FSDP, this corresponds to `init_lora_weights` in PEFT. See: https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig
      # For Megatron, this is used for `lora_A_init_method`, and "xavier", "normal", "kaiming", and "zero" are supported.
      init_method: "kaiming" # Initialization method for LoRA layers
  optimizer_config:
    lr: 1.0e-6  # Learning rate
    adam_betas: [0.9, 0.999]  # Betas for Adam optimizer
    weight_decay: 1e-2  # L2 regularization strength
    max_grad_norm: 1.0  # Gradient clipping
    offload_after_step: true  # Offload optimizer state to CPU after step (if colocated)

  fsdp_config:
    cpu_offload: false  # Offload model params to CPU during forward
    reshard_after_forward: true  # Re-shard FSDP model after forward pass
    fsdp_size: -1  # Auto FSDP group sizing

  sequence_parallel_size: 1  # sequence parallel size

  use_torch_compile: false  # Enable torch compile for the entropy calculation
  record_memory: false  # Dump memory snapshot for debugging

  model_config_kwargs: {}     # pass through kwargs to the HuggingFace model config for FSDP training backends (i.e. for overriding vocab size, etc) - for megatron, use policy.megatron_config.transformer_config_kwargs instead
  • policy.optimizer_config: Optimizer configuration for the policy model
  • policy.fsdp_config: FSDP configuration, applicable if trainer.strategy='fsdp'.
  • policy.sequence_parallel_size: Sequence parallel size. We implement Ulysses sequence parallelism
  • policy.use_torch_compile: Whether to enable torch compile for entropy calculation
  • policy.record_memory: Whether to record memory usage. If True, this will use PyTorch's memory snapshotting utility to record memory usage and dump memory snapshots after each policy model training step.

LoRA Configuration

LoRA (Low-Rank Adaptation) enables parameter-efficient fine-tuning by training only a small number of additional low-rank matrices instead of the full model weights:

  • policy.model.lora.rank: LoRA rank for low-rank decomposition. Set to 0 to disable LoRA. Higher values increase model capacity but also memory usage. Common values include 8, 16, 32, or 64.
  • policy.model.lora.alpha: Scaling factor for LoRA updates.
  • policy.model.lora.dropout: Dropout probability applied to LoRA layers. Helps prevent overfitting during training.
  • policy.model.lora.lora_sync_path: Directory path where LoRA adapter weights are saved and synchronized between training and inference processes. Must be accessible to all workers in distributed setups.
  • policy.model.lora.init_method: Initialization method for LoRA layers. For FSDP, this corresponds to init_lora_weights in PEFT. 'kaiming' is mapped to 'true' by default for PEFT. For Megatron, this is used for lora_A_init_method, and "xavier", "normal", "kaiming", and "zero" are supported.

Critic Configuration

We support similar configuration options as the policy model, including LoRA.

critic:
  model:
    path: null
    lora:
      rank: 0                    # LoRA rank (0 = disabled)
      alpha: 16                  # LoRA scaling parameter
      dropout: 0                 # LoRA dropout rate
      target_modules: "all-linear"
      exclude_modules: null  # Modules to exclude from LoRA
      init_method: "kaiming" # Initialization method for LoRA layers
  optimizer_config:
    lr: 5.0e-6
    adam_betas: [0.9, 0.999]
    weight_decay: 1e-2
    max_grad_norm: 1.0 # gradient clipping
    offload_after_step: true # offload optimizer state to cpu after each step. Applicable only when `colocate_all=true`
  fsdp_config:
    cpu_offload: false
    reshard_after_forward: true
    fsdp_size: -1
  sequence_parallel_size: 1
  model_config_kwargs: {} # pass through kwargs to the HuggingFace model config (i.e. for overriding vocab size, etc)

Reference Model Configuration

ref:
  model:
    path: ${trainer.policy.model.path}
  fsdp_config:
    cpu_offload: false
    reshard_after_forward: true
    fsdp_size: -1
  sequence_parallel_size: 1
  model_config_kwargs: {}     # pass through kwargs to the HuggingFace model config for FSDP training backends (i.e. for overriding vocab size, etc) - for megatron, use ref.megatron_config.transformer_config_kwargs instead
  • ref.model.path: Path to the reference model. Defaults to the policy model path, but can be separately set (i.e. for distillation based approaches, the reference model can be a different model than the policy model).
  • ref.fsdp_config: FSDP configuration, applicable if trainer.strategy='fsdp'.
  • ref.sequence_parallel_size: Sequence parallel size. We implement Ulysses sequence parallelism

The reference model is used only if the base model log probabilities are required either as a part of the training loss or as a part of the reward. Thus, trainer.algorithm.use_kl_in_reward or trainer.algorithm.use_kl_loss should be set to true to use the reference model. If both are false, then the reference model is not instantiated.

Algorithm Configuration

algorithm:
  advantage_estimator: "grpo"  # "grpo", "gae", or customizable with AdvantageEstimatorRegistry

  # KL Penalty Parameters
  kl_ctrl: # only used if use_kl_in_reward is true (not applied in the case of use_kl_loss=true) - uses kl_loss_coef as the initial KL coefficient
    type: "fixed" # "fixed" or "adaptive"
    kl_target: 0.1 # target KL divergence for adaptive KL controller
    horizon: 10000 # controls the update rate of the adaptive KL controller

  kl_estimator_type: "k3" # "k1", "k2", "k3", "abs" - see http://joschu.net/blog/kl-approx.html for details

  # note: use_kl_in_reward and use_kl_loss should be mutually exclusive
  use_kl_in_reward: false # apply kl loss to rewards
  use_kl_loss: true # used in policy model
  kl_loss_coef: 0.001
  # this adds training batch level normalization to advantages
  advantage_batch_normalize: false
  value_head_prefix: "value_head"
  policy_loss_type: "regular" # "regular", "dual_clip", "gspo", "clip_cov", "kl_cov" or customizable with PolicyLossRegistry
  loss_reduction: "token_mean" # "token_mean", "sequence_mean", "seq_mean_token_sum_norm"
  grpo_norm_by_std: true # set to false to disable normalization by std in GRPO (used in Dr. GRPO)
  zero_variance_filter: false # set to true to loss mask out prompts with zero variance rewards. only applicable when rewards are response-level.

  # GAE parameters
  lambd: 1.0
  gamma: 1.0

  # PPO parameters
  eps_clip_low: 0.2
  eps_clip_high: 0.2
  # dual clip parameters
  clip_ratio_c: 3.0

  # clip-cov parameters (only used when policy_loss_type: "clip_cov")
  clip_cov:
    clip_ratio: 0.0002 # fraction of tokens to clip based on covariance
    clip_cov_lb: 1.0 # lower bound for covariance clipping
    clip_cov_ub: 5.0 # upper bound for covariance clipping

  # kl-cov parameters (only used when policy_loss_type: "kl_cov")
  kl_cov:
    kl_cov_frac: 0.2 # percentage of tokens to apply KL regularization to (20%)
    ppo_kl_coef: 1.0 # coefficient for KL regularization term

  # cispo parameters (only used when policy_loss_type: "cispo")
  cispo: 
    cispo_eps_clip_low: 0  # offset for lower bound of importance sampling ratio clipping (as opposed to PPO token update clipping)
    cispo_eps_clip_high: 5 # offset for upper bound of importance sampling ratio clipping (as opposed to PPO token update clipping)

  # value loss parameters
  value_clip: 0.2

  # dynamic sampling parameters
  dynamic_sampling:
    type: null # filter (DAPO), replace (POLARIS/WebSailor), or null
    max_sample_batches: 30 # sample at most this many batches before stopping, -1 to sample forever
    min_replace_ratio: 0.3 # minimum proportion of good samples with which to replace bad samples (for replace strategy only)

  # Truncated Importance Sampling as proposed in https://fengyao.notion.site/off-policy-rl 
  use_tis: false 
  tis_imp_ratio_cap: -1.0

  # SAPO parameters (only used when policy_loss_type: "sapo") (https://arxiv.org/pdf/2511.20347)
  sapo:
    tau_pos: 1.0
    tau_neg: 1.05 # default values used in the paper with Qwen3-30B-A3B-Base
  • algorithm.advantage_estimator: Advantage estimator to use. We currently implement grpo, gae, rloo, reinforce++, and custom advantage estimators can be registered with the AdvantageEstimatorRegistry.

  • algorithm.kl_ctrl Configuration for the KL controller - only used if use_kl_in_reward is true (not applied in the case of use_kl_loss is true). kl_loss_coef is used as the initial KL coefficient for both fixed and adaptive KL controllers.

  • type: Type of KL controller to use. Options include: fixed or adaptive.

  • kl_target: Target KL divergence for adaptive KL controller.

  • horizon: Controls the update rate of the adaptive KL controller.

  • algorithm.kl_estimator_type: KL estimator type to use. Options include: k1, k2, k3, abs. See this blog post for details. We use k3 as the default.

  • algorithm.use_kl_in_reward: Whether to apply KL divergence penalty to rewards. The new rewards will be computed as rewards - kl * kl_loss_coef.

  • algorithm.use_kl_loss: Whether to add a KL divergence loss to the policy model. The policy loss will be computed as policy_loss + kl * kl_loss_coef.

  • algorithm.kl_loss_coef: Coefficient for the KL divergence loss.

  • algorithm.advantage_batch_normalize: Whether to normalize advantages by the (global) batch mean and standard deviation.

  • algorithm.value_head_prefix: The name used to identify the value head in the critic model.

  • algorithm.policy_loss_type: Type of policy loss to use. Options include:

    • regular: Vanilla PPO loss with token-level importance sampling
    • dual_clip: Dual clip PPO loss proposed in this paper
    • gspo: Group Sequence Policy Optimization with sequence-level importance sampling for improved training stability. Implements the "GSPO-token" variant from the paper.
    • clip_cov: Clip-Cov combines standard PPO clipping with covariance-based correction masking for improved stability. Based on this paper.
    • kl_cov: KL-Cov applies KL regularization to tokens selected based on covariance values. Based on this paper.
    • cispo: Clipped Importance Sampling Weight Policy Optimization (CISPO) proposed in MiniMax-M1.
    • Custom policy losses can be registered with the PolicyLossRegistry
  • algorithm.loss_reduction: Type of loss reduction to use. Options include:

    • token_mean: computes average loss over all valid tokens in the batch. Used in DAPO.
    • sequence_mean: computes per-sequence avg token loss, then averages over the batch.
    • seq_mean_token_sum_norm: computes the sum of token losses for each sequence, normalizes by the max sequence length (computed as cfg.generator.max_input_length + cfg.generator.sampling_params.max_generate_length), and then averages over the batch. This is used in Dr. GRPO.
  • algorithm.grpo_norm_by_std: Whether to normalize advantages by the standard deviation in GRPO. This is set to false in Dr. GRPO.

  • algorithm.zero_variance_filter: Whether to loss mask out prompts with zero variance rewards. This is only applicable when rewards are response-level.

  • algorithm.lambd: Lambda parameter for GAE.

  • algorithm.gamma: Gamma parameter for GAE.

  • algorithm.eps_clip_low: Lower bound for PPO clipping.

  • algorithm.eps_clip_high: Upper bound for PPO clipping.

  • algorithm.clip_ratio_c: Clip ratio for dual clip PPO loss.

  • algorithm.value_clip: Clip value for value loss.

  • algorithm.dynamic_sampling: Dynamic sampling configuration.

    • algorithm.dynamic_sampling.type: Type of dynamic sampling to use. Currently, we support filter (DAPO), replace (POLARIS / WebSailor), or null for no dynamic sampling.
    • algorithm.dynamic_sampling.max_sample_batches: Maximum number of batches to sample before stopping. Set to -1 to sample forever.
    • algorithm.dynamic_sampling.min_replace_ratio: Minimum proportion of good samples with which to replace bad samples for replace strategy.
  • algorithm.use_tis: Whether to use Truncated Importance Sampling (TIS) as proposed in this blog.

  • algorithm.tis_imp_ratio_cap: Cap parameter for the importance ratio in TIS.

  • algorithm.clip_cov: Clip-Cov parameters (only used when policy_loss_type is clip_cov):

    • clip_ratio: Fraction of tokens to clip based on covariance values.
    • clip_cov_lb: Lower bound for covariance clipping.
    • clip_cov_ub: Upper bound for covariance clipping.
  • algorithm.kl_cov: KL-Cov parameters (only used when policy_loss_type is kl_cov):

    • kl_cov_frac: Percentage of tokens to apply KL regularization to.
    • ppo_kl_coef: Coefficient for KL regularization term.
  • algorithm.cispo: CISPO parameters (only used when policy_loss_type is cispo):

    • cispo_eps_clip_low: Offset for lower bound of importance sampling ratio clipping. Tokens with importance sampling ratio less than 1 - cispo_eps_clip_low will have their ratio clipped, but can still be updated in the policy gradient update.
    • cispo_eps_clip_high: Offset for upper bound of importance sampling ratio clipping. Tokens with importance sampling ratio greater than 1 + cispo_eps_clip_high will have their ratio clipped, but can still be updated in the policy gradient update.
  • algorithm.sapo: SAPO (as proposed in this paper) parameters (only used when policy_loss_type is sapo):

    • tau_pos: Temperature for gating function for tokens with positive advantages.
    • tau_neg: Temperature for gating function for tokens with negative (or zero) advantages.

Policy Loss Formulation

It can be helpful to understand the final loss formulation to see how the different configuration options are used. The final loss is computed as below in the ppo_policy_loss function.

def ppo_policy_loss(
    log_probs: torch.Tensor,
    old_log_probs: torch.Tensor,
    advantages: torch.Tensor,
    config: DictConfig, # trainer.algorithm config
    loss_mask: Optional[torch.Tensor] = None,
) -> torch.Tensor:

    ratio = (log_probs - old_log_probs).exp()
    surr1 = ratio * advantages
    surr2 = ratio.clamp(1 - config.eps_clip_low, 1 + config.eps_clip_high) * advantages
    loss = -torch.min(surr1, surr2)
    clip_ratio = masked_mean((-surr2 > -surr1).float(), loss_mask).mean().detach().item()
    clip_pg_losses1 = loss
    if config.policy_loss_type == "dual_clip":
      pg_losses3 = -advantages * config.clip_ratio_c
      clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
      loss = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
    loss = reduce_loss(loss, loss_mask, config.loss_reduction)
    return loss, clip_ratio

Generator Configuration

generator:
  model_dtype: "bfloat16" # should match dtype for inference engine
  run_engines_locally: true
  num_inference_engines: 1
  backend: "vllm"
  weight_sync_backend: "nccl"
  inference_engine_tensor_parallel_size: 4
  inference_engine_pipeline_parallel_size: 1
  inference_engine_expert_parallel_size: 1  
  inference_engine_data_parallel_size: 1
  n_samples_per_prompt: 5
  async_engine: true
  batched: true
  max_input_length: ${trainer.max_prompt_length} # max generator input length used for multi-turn conversations - for single turn set equal to max_prompt_length
  enable_prefix_caching: true
  enable_chunked_prefill: true
  max_num_batched_tokens: 8192
  enforce_eager: false
  gpu_memory_utilization: 0.8
  max_num_seqs: 1024
  remote_inference_engine_urls: ["127.0.0.1:8001"]
  max_turns: 1

  # Custom chat template configuration if needed
  chat_template:
    source: "name"  # "name" or "file"
    name_or_path: null  # e.g., "qwen3_with_thinking" or "/path/to/template.j2"

  # Chat templating kwargs to pass to `tokenizer.apply_chat_template`
  chat_template_kwargs: {}

  engine_init_kwargs: {}

  override_existing_update_group: "auto" # "auto", "enable", "disable"
  # sampling params for generation phase
  sampling_params:
    max_generate_length: 1024
    temperature: 1.0
    top_p: 1.0
    min_p: 0.0
    top_k: -1
    logprobs: 0

  use_conversation_multi_turn: true

  # sampling params for evaluation
  eval_sampling_params:
    max_generate_length: ${generator.sampling_params.max_generate_length}
    temperature: 1.0
    top_p: 1.0
    min_p: 0.0
    top_k: -1
    logprobs: 0

  # number of samples per prompt for evaluation
  eval_n_samples_per_prompt: 1

  zero_reward_on_non_stop: false

  apply_overlong_filtering: false

Inference Engine Placement Configuration

  • generator.run_engines_locally: Whether to use local inference engines. If true, the inference engine will be initialized during the training run in the current Ray cluster. We use one Ray actor per inference replica and communication will happen via Ray object store. If set to false, then the generator expects a list of remote urls and communication will happen over HTTP.
  • generator.num_inference_engines: Number of inference engines to use. If run_engines_locally is false, then this number should match the number of remote urls.
  • generator.remote_inference_engine_urls: List of remote urls to use. Applicable only when run_engines_locally is false.
  • generator.enable_http_endpoint: When true, launch an OpenAI-compatible HTTP endpoint for the inference engine client so that generators can send requests to this server instead of using .generate() Python calls.
  • generator.http_endpoint_host: Host for the inference HTTP endpoint.
  • generator.http_endpoint_port: Port for the inference HTTP endpoint.

For more details on how different placement options work, please refer to the placement guide.

Weight Transfer Configuration

  • generator.weight_sync_backend: Backend to use for weight synchronization. Currently, we support nccl and gloo.
  • generator.override_existing_update_group: Whether to override the existing update group for the inference engine. This is applicable only for remote inference engines. During training, skyrl-train forms a custom process group ("update group") with the rank 0 training worker and all the inference engine ranks. If override_existing_update_group=enable, then during initialization, a previous weight update group will be overriden in the inference engine. For example, if you have a remote server setup and you run training for the same model multiple times, it is helpful to override the previous update group. We recommend leaving this to auto - since it will automatically determine if the previous update group should be overridden based on run_engines_locally.

Inference Engine Configuration

  • generator.backend: Backend to use for the inference engine. We support vllm and sglang. sglang is supported only for remote inference engines at the moment.
  • generator.model_dtype: Dtype used for the inference engine. This is also used during weight transfer - the policy model weights are casted to this dtype before being sent to the inference engine during weight transfer.
  • generator.async_engine: Whether to use an asynchronous/ offline inference engine. Applicable only when backend="vllm".
  • generator.inference_engine_tensor_parallel_size: Tensor parallel size for the inference engine.
  • generator.inference_engine_pipeline_parallel_size: Pipeline parallel size for the inference engine. Currently, PP is only supported for vLLM backend with async_engine=true.
  • generator.inference_engine_expert_parallel_size: Expert parallel size for the inference engine. Currently, EP is only supported for vLLM backend and ep_size must equal dp_size * tp_size.
  • generator.inference_engine_data_parallel_size: Data parallel size for the inference engine. Currently, DP is only supported for vLLM backend.
  • generator.gpu_memory_utilization: GPU memory utilization for the inference engine. Applicable only for run_engines_locally=true.
  • generator.vllm_v1_disable_multiproc: If true, this will set VLLM_ENABLE_V1_MULTIPROCESSING=0 in the environment, which makes the scheduling deterministic. This is useful for reproducibility.
  • generator.enable_prefix_caching: Whether to enable prefix caching for the inference engine. Applicable only when backend="vllm". This can be left to the default true in most cases. Note that in the case of remote inference engines, you would need to match the setting used when you initialized the remote servers.
  • generator.enable_chunked_prefill: Whether to enable chunked prefill for the inference engine. Applicable only when backend="vllm". With vLLM, this can be left to the default true in most cases.
  • generator.max_num_seqs: Continous batching parameter for vLLM. Maximum number of sequences to pack into a batch.
  • generator.max_num_batched_tokens: Continous batching parameter for vLLM. Maximum number of tokens to pack into a batch.

Generation Parameters

  • generator.n_samples_per_prompt: Number of samples to generate per prompt. Note that the total size of the training batch will be trainer.train_batch_size * generator.n_samples_per_prompt.

  • generator.batched: Whether to use batched inference. This is applicable only for single turn generation.

  • generator.max_input_length: Maximum input length for the inference engine. For single turn generation, this can be same as trainer.max_prompt_length (i.e., the initial prompt length). For multi-turn generation, this is the maximum input length used for multi-turn conversations at each turn.

  • generator.sampling_params: Sampling parameters for the inference engine during trajectory generation phase.

    • generator.sampling_params.max_generate_length: Maximum length of the generated response.
    • generator.sampling_params.temperature: Temperature for the inference engine.
    • generator.sampling_params.top_p: Top-p sampling parameter for the inference engine.
    • generator.sampling_params.min_p: Min-p sampling parameter for the inference engine, as proposed in this paper.
    • generator.sampling_params.top_k: Top-k sampling parameter for the inference engine.
    • generator.sampling_params.logprobs: Number of logprobs to return from the inference engine. Set to 0 to return only the chosen token's logprob.
  • generator.eval_sampling_params: Sampling parameters for evaluation.

  • generator.eval_n_samples_per_prompt: Number of samples to generate per prompt for evaluation.

  • generator.max_turns: Maximum number of turns for generation with multi-turn RL.

  • generator.use_conversation_multi_turn: Whether to use conversation format for multi-turn generation. If set to true then observations are appended to the chat history as a new turn. If set to false then observations are appended as-is to the assistant response in token space and generation is continued (after removing any EOS token in the response). We've observed some cases where model can be sensitive to chat history format (ex: in SkyRL-SQL), and thus false can be used for full control over the exact tokens added after environment interaction.

  • generator.engine_init_kwargs: Inference engine arguments passed directly to the vLLM or SGLang engine. To specify an engine arg in the CLI override, use the format: +generator.engine_init_kwargs.[arg_name]=value. If duplicate kwargs are passed or kwargs clash with existing generator arguments (e.g., tensor_parallel_size), an error is raised.

  • generator.chat_template: Custom chat template configuration if needed.

    • generator.chat_template.source: Source of the chat template. Can be either name or file.
    • generator.chat_template.name_or_path: Name or path of the chat template. If the source is name, then it should be one of the supported templates in skyrl_train/generators/utils.py. If the source is file, then this field should be a path to a Jinja2 template file.
  • generator.chat_template_kwargs: Chat templating kwargs to pass to tokenizer.apply_chat_template. Applicable only for non-batched generation with generator.batched=false.

Misc Configuration

  • generator.zero_reward_on_non_stop: Whether to set the reward to 0 if the stop_reason is not stop. Cases where this is useful: Often, we have format rewards for the LLM to follow, but in cases where the LLM didn't finish the response, we typically don't want to reward it. This is a general setting for all environments.
  • generator.apply_overlong_filtering: Whether to apply DAPO Overlong Filtering to the loss masks. For each trajectory that exceeds the max length (i.e., truncated and does not end with an EOS token), this masks out every token in the loss mask.
  • generator.step_wise_trajectories: Whether to return outputs in a step-wise fashion. If true, then the generator will return multi-turn generations with the (prompt, response) pair of each turn being a separate trajectory. Advantages are computed based on the last step of each trajectory and propagated to the previous steps.

On this page