Configuration Overview

This page covers configuration for the fsdp and megatron backends.

Data Configuration

data:
  train_data: ["~/data/gsm8k/train.parquet"]
  val_data: ["~/data/gsm8k/validation.parquet"]

data.train_data: A list of files for the training dataset.
data.val_data: A list of files for the evaluation dataset.

A dataset file can be a path to a parquet or json file, or the name of a Hugging Face dataset.

Currently, all datasets are loaded into memory, so the dataset size is limited by available CPU memory on a worker node.

Model Placement Configuration

placement:
  colocate_all: true
  colocate_policy_ref: true
  policy_num_nodes: 1
  policy_num_gpus_per_node: 1
  critic_num_nodes: 1
  critic_num_gpus_per_node: 1
  ref_num_nodes: 1
  ref_num_gpus_per_node: 1

For an in-depth guide on model placement and colocation, please refer to the model placement and colocation guide.

General Training Configuration

epochs: 1  # Number of passes over the full dataset
update_epochs_per_batch: 1
train_batch_size: 1024
policy_mini_batch_size: 256
critic_mini_batch_size: 256
micro_train_batch_size_per_gpu: 1
micro_forward_batch_size_per_gpu: 1
update_ref_every_epoch: false
use_sample_packing: true
max_prompt_length: 512
gradient_checkpointing: true
seed: 42

epochs: Number of epochs/ passes over the full dataset (similar to SFT)
update_epochs_per_batch: Number of gradient update passes over each training batch. This is equivalent to the concept of "PPO epochs" where you iterate over the same experience multiple times.
train_batch_size: Batch size of prompts used for each dataloader step.
policy_mini_batch_size: Mini batch size used during RL training step. Each mini batch corresponds to one optimizer step. For example, if the train_batch_size is 4 and policy_mini_batch_size is 2, then there will be 2 optimizer steps (i.e., model updates) for a given training batch. Note that is this the global mini batch size. The actual size of the mini batch per worker would be policy_mini_batch_size/ number of DP ranks
critic_mini_batch_size: Similar to policy_mini_batch_size but for the critic model (if applicable). Note that in general, the critic model can tolerate off-policy updates more than the policy. Thus, you would want to set critic_mini_batch_size to be lower compared policy_mini_batch_size (i.e., more critic updates).
micro_train_batch_size_per_gpu: Micro batch size during training step. This is common for both policy and critic models. Each mini batch is split into micro batches of this size, gradients are computed and accumulated over these micro batches.
micro_forward_batch_size_per_gpu: Micro batch size during forward pass (i.e., for log probability or value computation). This is common for both policy and critic models. Each mini batch is split into micro batches of this size, model forward pass is performed over these micro batches.
update_ref_every_epoch: Whether to update the reference model every epoch.
use_sample_packing: Whether to use sample packing during model forward pass (common for all models).
max_prompt_length: Maximum prompt length during training. Longer prompts will be truncated.
gradient_checkpointing: Whether to use gradient checkpointing.
seed: Random seed for training.

If you're facing issues with tuning the right values for micro_train_batch_size_per_gpu, policy_mini_batch_size and micro_forward_batch_size_per_gpu, see utils/utils.py::validate_batch_sizes for details on constraints.

Evaluation Configuration

eval_batch_size: 1024
eval_before_train: true
eval_interval: 5 # Set to -1 to disable evaluation.

eval_batch_size: Batch size for evaluation.
eval_before_train: Whether to evaluate the model before training.
eval_interval: The frequency of evaluating the model with the validation dataset (in terms of number of steps). If set to -1, evaluation will not be performed.

If multiple validation datasets are provided (e.g. data.val_data="['$DATA_DIR/validation1.parquet', '$DATA_DIR/validation2.parquet']" \), then the evaluation will be performed on all of them. The metrics for each dataset, and the aggregated metrics, will all be logged in WandB. If dump_eval_results is set to true, the per-dataset and aggregated results will be dumped.

Checkpoint Configuration

resume_mode: latest # null/"none", "latest", "from_path"
resume_path: null
ckpt_path: "~/ckpts/" # Local directory path or cloud storage path (S3, GCP) for resumable training checkpoints (model state, optimizer state, etc.)
max_ckpts_to_keep: -1 # -1 to keep all checkpoints, N to keep the last N checkpoints
ckpt_interval: 10  # Save full training checkpoint every `ckpt_interval` steps.
hf_save_interval: -1  # Save HF format model(s) every `hf_save_interval` steps.
export_path: "~/exports/" # Path for exported artifacts (HF models, debug dumps, etc.)
project_name: "skyrl"
run_name: "test_run"
logger: "wandb"

For an in-depth guide on checkpointing and resumption, please refer to the checkpointing guide.

Logging and Debugging Configuration

logger: "wandb"
project_name: "skyrl"
run_name: "test_run"
log_path: "/tmp/skyrl-logs"
dump_data_batch: false
dump_eval_results: true

logger: Logger to use. Currently, we support wandb, mlflow, and console. console will simply log metrics to the console.
project_name: Name of the project in WandB and MLFlow.
run_name: Name of the run in WandB and MLFlow.
log_path: Path for infrastructure log files. Infrastructure logs (vLLM engine startup, model loading, worker initialization) are written to {log_path}/infra-YYMMDD_HHMMSS.log. For multi-node training, use a shared filesystem path to consolidate logs into a single file. See the logging guide for details.
dump_data_batch: Whether to dump the data batch to a file. This is useful for debugging. When true, the data batch will be dumped to a file in the export_path directory. The training batch at global step N is saved to self.cfg.trainer.export_path / "dumped_data" / global_step_N_training_input
dump_eval_results: Whether to dump the evaluation results to a file. When true, the full evaluation results will be dumped to a file in the export_path directory. The evaluation results at global step N is saved to self.cfg.trainer.export_path / "dumped_eval" / global_step_N_eval_results

Training Backends

We support three backends: FSDP1, FSDP2, and Megatron. The backend can be chosen with trainer.strategy field.

FSDP Configuration

We use the same configuration group for FSDP1 and FSDP2

fsdp_config:
    cpu_offload: false # offload params + optimizer state to cpu during fwd pass
    reshard_after_forward: true # fsdp2 only, [True, False, int between 1 and fsdp_size]
    fsdp_size: -1

cpu_offload: Whether to train with CPU offloading (i.e., offload state during forward pass). This corresponds to cpu_offload parameter in FSDP1 and offload_policy in FSDP2.
reshard_after_forward: Whether to re-shard FSDP model after forward pass. This is a FSDP2 specific configuration, please refer to the FSDP2 docs for more details. If set to false, this would retain the full model parameters on each worker (similar to DeepSpeed's ZeRO stage 2).
fsdp_size: The group size within which worker state is sharded with FSDP. This is a parameter to be used for hybrid sharding in multi-node settings. For example, if the number of workers in the actor group is 8, with 4 in each node, and fsdp_size is 4, then the training state will be fully sharded across 4 ranks in each node, but replicated (DP) across nodes.

cpu_offload is different from worker state offloading with model colocation.

In FSDP, cpu_offload will offload parameter and optimizer state to CPU memory and only copy over model parameters to GPU during model forward pass.

In the SkyRL-Train backend, we offload worker state in certain colocation settings - however this happens only after the training step/ log probability computation - thus optimizer step and model forward pass happen as usual with sharded parameters on GPU. For more details, refer to the guide on model placement and colocation

Megatron Configuration

megatron_config:
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
  context_parallel_size: 1
  expert_model_parallel_size: 1
  expert_tensor_parallel_size: null

  ddp_config: # pass-through config to Megatron's `DistributedDataParallelConfig` object
    # https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/distributed/distributed_data_parallel_config.py#L8
    ...
  optimizer_config_kwargs: # pass-through kwargs to Megatron's `OptimizerConfig` object
    # any overlapping arguments with those we attempt to resolve in trainer.policy.optimizer_config will be overridden by the values here
    # https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/optimizer/optimizer_config.py#L12
    ...
  model_config_kwargs: # pass-through kwargs to the HuggingFace model config (i.e. for overriding vocab size, etc)
    ...
  transformer_config_kwargs: # pass-through kwargs to the Megatron's `TransformerConfig` object
    # https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/transformer/transformer_config.py#L33
    ...
  lora_config:
    # see: https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html for details - currently "lora" and "canonical_lora" are supported
    lora_type: "lora"
  # flag to manually empty torch's cuda cache between the forward/backward pass and the optimizer step
  # this will free reserved but unallocated memory, and can help avoid OoMs in the optimizer
  empty_cuda_cache: true
  # When True, uses the "fully-reshardable" format for the distributed-optimizer checkpoint.
  # When False (default), uses the "dp-reshardable" format which is more
  # efficient but only supports resharding along the data-parallel dimension.
  # see https://github.com/NVIDIA/Megatron-LM/blob/core_v0.16.0/megatron/core/optimizer/distrib_optimizer.py#L1187 for more details.
  dist_ckpt_optim_fully_reshardable: false

megatron_config.tensor_model_parallel_size: Tensor model parallel size for reducing memory across model parameters and activations. Sequence parallelism (unrelated to ulysses sequence parallelism) is also enabled by default if tensor parallel size is greater than 1.
megatron_config.pipeline_model_parallel_size: Pipeline model parallel size for sharding model layers across multiple GPUs.
megatron_config.context_parallel_size: Context parallel size for reducing activation memory across the sequence length dimension.
megatron_config.expert_model_parallel_size: The expert parallel size for sharding expert modules across multiple GPUs.
megatron_config.expert_tensor_parallel_size: The tensor parallel size for each expert module. If set to null, then the value will be resolved to tensor_model_parallel_size by Megatron. It is recommended to set this to 1 when enabling expert_model_parallel_size > 1 for the best performance.

Some rules for configuring these parameters:

model_size = pp_size * tp_size * cp_size
dp_size = world_size / model_size
world_size % (pp_size * ep_size * etp_size) == 0
- This means that ep_size * etp_size can scale independently of tp_size * cp_size, and can go across data parallel ranks.

optimizer_config_kwargs.use_precision_aware_optimizer=true can cause checkpointing to fail. See: https://github.com/nvidia/megatron-lm/issues/1820. We recommend leaving this setting to false.

Optimizer Configuration

For both the critic and policy model, we provide a common optimizer configuration

optimizer_config:
   lr: 1.0e-6
   adam_betas: [0.9, 0.999]
   weight_decay: 1e-2
   max_grad_norm: 1.0
   offload_after_step: true
   num_warmup_steps: 0
   scheduler: "constant_with_warmup"

optimizer_config.lr: Learning rate for the optimizer
optimizer_config.adam_betas: Betas for AdamW optimizer.
optimizer_config.weight_decay: L2 regularization strength for AdamW.
optimizer_config.max_grad_norm: Gradient clipping parameter. The total L2 norm of the model gradients will be scaled to this value during training.
optimizer_config.offload_after_step: Whether to offload optimizer state to CPU after step if colocated. When generation and training workers are colocated, we recommend using the default setting of true. In some cases with non-colocation, it can be desirable to leave optimizer state on GPU memory to avoid offloading costs as well as additional CPU memory usage.
optimizer_config.num_warmup_steps: Number of mini-batch steps to warmup the optimizer for.
optimizer_config.scheduler: Which learning rate scheduler to use. Intended to align with transformers.SchedulerType from Huggingface.

Policy Configuration

This section configures the policy model used for training, including optimizer, FSDP, sequence parallelism, and LoRA options.

policy:
  model:
    path: "Qwen/Qwen2.5-1.5B-Instruct"  # Hugging Face model path for the policy model
    lora:
      rank: 0                    # LoRA rank (0 = disabled)
      alpha: 16                  # LoRA scaling parameter
      dropout: 0                 # LoRA dropout rate
      lora_sync_path: "/tmp/skyrl_lora_sync"  # Path for LoRA adapter sync
      target_modules: "all-linear"  # Apply to all linear layers OR
      # specify specific modules as a list
      exclude_modules: null  # Modules to exclude from LoRA
      # For FSDP, this corresponds to `init_lora_weights` in PEFT. See: https://huggingface.co/docs/peft/main/en/package_reference/lora#peft.LoraConfig
      # For Megatron, this is used for `lora_A_init_method`, and "xavier", "normal", "kaiming", and "zero" are supported.
      init_method: "kaiming" # Initialization method for LoRA layers
  optimizer_config:
    lr: 1.0e-6  # Learning rate
    adam_betas: [0.9, 0.999]  # Betas for Adam optimizer
    weight_decay: 1e-2  # L2 regularization strength
    max_grad_norm: 1.0  # Gradient clipping
    offload_after_step: true  # Offload optimizer state to CPU after step (if colocated)

  fsdp_config:
    cpu_offload: false  # Offload model params to CPU during forward
    reshard_after_forward: true  # Re-shard FSDP model after forward pass
    fsdp_size: -1  # Auto FSDP group sizing

  sequence_parallel_size: 1  # sequence parallel size

  use_torch_compile: false  # Enable torch compile for the entropy calculation
  record_memory: false  # Dump memory snapshot for debugging

  model_config_kwargs: {}     # pass through kwargs to the HuggingFace model config for FSDP training backends (i.e. for overriding vocab size, etc) - for megatron, use policy.megatron_config.transformer_config_kwargs instead

policy.optimizer_config: Optimizer configuration for the policy model
policy.fsdp_config: FSDP configuration, applicable if trainer.strategy='fsdp'.
policy.sequence_parallel_size: Sequence parallel size. We implement Ulysses sequence parallelism
policy.use_torch_compile: Whether to enable torch compile for entropy calculation
policy.record_memory: Whether to record memory usage. If True, this will use PyTorch's memory snapshotting utility to record memory usage and dump memory snapshots after each policy model training step.

LoRA Configuration

LoRA (Low-Rank Adaptation) enables parameter-efficient fine-tuning by training only a small number of additional low-rank matrices instead of the full model weights:

policy.model.lora.rank: LoRA rank for low-rank decomposition. Set to 0 to disable LoRA. Higher values increase model capacity but also memory usage. Common values include 8, 16, 32, or 64.
policy.model.lora.alpha: Scaling factor for LoRA updates.
policy.model.lora.dropout: Dropout probability applied to LoRA layers. Helps prevent overfitting during training.
policy.model.lora.lora_sync_path: Directory path where LoRA adapter weights are saved and synchronized between training and inference processes. Must be accessible to all workers in distributed setups.
policy.model.lora.init_method: Initialization method for LoRA layers. For FSDP, this corresponds to init_lora_weights in PEFT. 'kaiming' is mapped to 'true' by default for PEFT. For Megatron, this is used for lora_A_init_method, and "xavier", "normal", "kaiming", and "zero" are supported.

Critic Configuration

We support similar configuration options as the policy model, including LoRA.

critic:
  model:
    path: null
    lora:
      rank: 0                    # LoRA rank (0 = disabled)
      alpha: 16                  # LoRA scaling parameter
      dropout: 0                 # LoRA dropout rate
      target_modules: "all-linear"
      exclude_modules: null  # Modules to exclude from LoRA
      init_method: "kaiming" # Initialization method for LoRA layers
  optimizer_config:
    lr: 5.0e-6
    adam_betas: [0.9, 0.999]
    weight_decay: 1e-2
    max_grad_norm: 1.0 # gradient clipping
    offload_after_step: true # offload optimizer state to cpu after each step. Applicable only when `colocate_all=true`
  fsdp_config:
    cpu_offload: false
    reshard_after_forward: true
    fsdp_size: -1
  sequence_parallel_size: 1
  model_config_kwargs: {} # pass through kwargs to the HuggingFace model config (i.e. for overriding vocab size, etc)

Reference Model Configuration

ref:
  model:
    path: null  # defaults to `trainer.policy.model.path`
  fsdp_config:
    cpu_offload: false
    reshard_after_forward: true
    fsdp_size: -1
  sequence_parallel_size: 1
  model_config_kwargs: {}     # pass through kwargs to the HuggingFace model config for FSDP training backends (i.e. for overriding vocab size, etc) - for megatron, use ref.megatron_config.transformer_config_kwargs instead

ref.model.path: Path to the reference model. Defaults to thetrainer.policy.model.path, but can be separately set (i.e. for distillation based approaches, the reference model can be a different model than the policy model).
ref.fsdp_config: FSDP configuration, applicable if trainer.strategy='fsdp'.
ref.sequence_parallel_size: Sequence parallel size. We implement Ulysses sequence parallelism

The reference model is used only if the base model log probabilities are required either as a part of the training loss or as a part of the reward. Thus, trainer.algorithm.use_kl_in_reward or trainer.algorithm.use_kl_loss should be set to true to use the reference model. If both are false, then the reference model is not instantiated.

Algorithm Configuration

algorithm:
  advantage_estimator: "grpo"  # "grpo", "gae", or customizable with AdvantageEstimatorRegistry

  # KL Penalty Parameters
  kl_ctrl: # only used if use_kl_in_reward is true (not applied in the case of use_kl_loss=true) - uses kl_loss_coef as the initial KL coefficient
    type: "fixed" # "fixed" or "adaptive"
    kl_target: 0.1 # target KL divergence for adaptive KL controller
    horizon: 10000 # controls the update rate of the adaptive KL controller

  kl_estimator_type: "k3" # "k1", "k2", "k3", "abs" - see http://joschu.net/blog/kl-approx.html for details

  # note: use_kl_in_reward and use_kl_loss should be mutually exclusive
  use_kl_in_reward: false # apply kl loss to rewards
  use_kl_loss: true # used in policy model
  kl_loss_coef: 0.001

  # Entropy loss
  use_entropy_loss: false # add entropy bonus to policy loss
  entropy_loss_coef: 0.01 # coefficient for the entropy loss

  # Temperature for scaling logits in policy loss computation
  temperature: 1.0

  # this adds training batch level normalization to advantages
  advantage_batch_normalize: false
  value_head_prefix: "value_head"
  policy_loss_type: "regular" # "regular", "dual_clip", "gspo", "clip_cov", "kl_cov" or customizable with PolicyLossRegistry
  loss_reduction: "token_mean" # "token_mean", "sequence_mean", "seq_mean_token_sum_norm"
  grpo_norm_by_std: true # set to false to disable normalization by std in GRPO (used in Dr. GRPO)
  zero_variance_filter: false # set to true to loss mask out prompts with zero variance rewards. only applicable when rewards are response-level.

  # GAE parameters
  lambd: 1.0
  gamma: 1.0

  # PPO parameters
  eps_clip_low: 0.2
  eps_clip_high: 0.2
  # dual clip parameters
  clip_ratio_c: 3.0

  # clip-cov parameters (only used when policy_loss_type: "clip_cov")
  clip_cov:
    clip_ratio: 0.0002 # fraction of tokens to clip based on covariance
    clip_cov_lb: 1.0 # lower bound for covariance clipping
    clip_cov_ub: 5.0 # upper bound for covariance clipping

  # kl-cov parameters (only used when policy_loss_type: "kl_cov")
  kl_cov:
    kl_cov_frac: 0.2 # percentage of tokens to apply KL regularization to (20%)
    ppo_kl_coef: 1.0 # coefficient for KL regularization term

  # cispo parameters (only used when policy_loss_type: "cispo")
  cispo:
    cispo_eps_clip_low: 0  # offset for lower bound of importance sampling ratio clipping (as opposed to PPO token update clipping)
    cispo_eps_clip_high: 5 # offset for upper bound of importance sampling ratio clipping (as opposed to PPO token update clipping)

  # value loss parameters
  value_clip: 0.2

  # dynamic sampling parameters
  dynamic_sampling:
    type: null # filter (DAPO), replace (POLARIS/WebSailor), or null
    max_sample_batches: 30 # sample at most this many batches before stopping, -1 to sample forever
    min_replace_ratio: 0.3 # minimum proportion of good samples with which to replace bad samples (for replace strategy only)

  # To be deprecated in favor of off_policy_correction.tis_ratio_type = "token"
  # and "token_tis_ratio_clip_high"
  use_tis: false
  tis_imp_ratio_cap: -1.0

  # Used for seq_mean_token_sum_norm loss reduction. Users should set this value for multi-turn for that loss.
  # If not set, will be calculated as generator.max_input_length + generator.sampling_params.max_generate_length
  max_seq_len: null

  # references
  # - https://github.com/szrlee/verl/blob/yingru/rollout_correction/docs/advance/rollout_corr_math.md
  # - https://fengyao.notion.site/off-policy-rl
  off_policy_correction:
    # type of importance sampling ratio to use for ppo loss correction
    # here importance sampling ratio refers to exp(logprobs_{policy_old} - logprobs_{rollout_policy})
    tis_ratio_type: null # null, "token", "sequence"

    # used if tis_ratio_type = "token", 1.5-5.0 is recommended for "token" tis_ratio_type
    token_tis_ratio_clip_high: 2.0
    # used if tis_ratio_type = "sequence", 2.0-10.0 is recommended for "sequence" tis_ratio_type
    sequence_tis_ratio_clip_high: 5.0

    # method of masking out sequences with cumulative importance sampling ratios outside the cap
    # "product" masks out sequences with product of importance ratios outside the cap
    # "geometric" masks out sequences with geometric mean of importance ratios outside the cap
    sequence_mask_metric: null # null, "product", "geometric"

    # used if sequence_mask_metric = "geometric"
    # values around 0.99-1.01 are recommended for "geometric" sequence_mask_metric - MoE models may need larger allowed ranges due to higher mismatch
    geo_mask_high: 1.01
    geo_mask_low: 0.99

    # used if sequence_mask_metric = "product"
    # values around 0.5-2.0 are recommended for "product" sequence_mask_metric
    product_mask_high: 2.0
    product_mask_low: 0.5

    # separate from sequence_mask_metric and tis_ratio_type
    # if any off_policy_correction is enabled, masks out sequences with any token having importance ratio
    # far outside an acceptable range (low and high thresholds) - set to null to disable
    # suggested values: 1e-4 for low and 100 for high
    outlier_token_is_threshold_low: null
    outlier_token_is_threshold_high: null

  # SAPO parameters (only used when policy_loss_type: "sapo") (https://arxiv.org/pdf/2511.20347)
  sapo:
    tau_pos: 1.0
    tau_neg: 1.05 # default values used in the paper with Qwen3-30B-A3B-Base

algorithm.advantage_estimator: Advantage estimator to use. We currently implement grpo, gae, rloo, reinforce++, and custom advantage estimators can be registered with the AdvantageEstimatorRegistry.
algorithm.kl_ctrl Configuration for the KL controller - only used if use_kl_in_reward is true (not applied in the case of use_kl_loss is true). kl_loss_coef is used as the initial KL coefficient for both fixed and adaptive KL controllers.
type: Type of KL controller to use. Options include: fixed or adaptive.
kl_target: Target KL divergence for adaptive KL controller.
horizon: Controls the update rate of the adaptive KL controller.
algorithm.kl_estimator_type: KL estimator type to use. Options include: k1, k2, k3, abs. See this blog post for details. We use k3 as the default.
algorithm.use_kl_in_reward: Whether to apply KL divergence penalty to rewards. The new rewards will be computed as rewards - kl * kl_loss_coef.
algorithm.use_kl_loss: Whether to add a KL divergence loss to the policy model. The policy loss will be computed as policy_loss + kl * kl_loss_coef.
algorithm.kl_loss_coef: Coefficient for the KL divergence loss.
algorithm.use_entropy_loss: Whether to add an entropy bonus to the policy loss. This encourages exploration by penalizing low-entropy (overly confident) policies.
algorithm.entropy_loss_coef: Coefficient for the entropy loss term. Only used when use_entropy_loss=true.
algorithm.temperature: Temperature for scaling logits in policy loss computation. This is automatically set from generator.sampling_params.temperature during config validation. When using HTTP endpoints and not utilizing generator.sampling_params.temperature, this value should be set to the temperature used during generation.
algorithm.advantage_batch_normalize: Whether to normalize advantages by the (global) batch mean and standard deviation.
algorithm.value_head_prefix: The name used to identify the value head in the critic model.
algorithm.policy_loss_type: Type of policy loss to use. Options include:
- regular: Vanilla PPO loss with token-level importance sampling
- dual_clip: Dual clip PPO loss proposed in this paper
- gspo: Group Sequence Policy Optimization with sequence-level importance sampling for improved training stability. Implements the "GSPO-token" variant from the paper.
- clip_cov: Clip-Cov combines standard PPO clipping with covariance-based correction masking for improved stability. Based on this paper.
- kl_cov: KL-Cov applies KL regularization to tokens selected based on covariance values. Based on this paper.
- cispo: Clipped Importance Sampling Weight Policy Optimization (CISPO) proposed in MiniMax-M1.
- Custom policy losses can be registered with the PolicyLossRegistry
algorithm.loss_reduction: Type of loss reduction to use. Options include:
- token_mean: computes average loss over all valid tokens in the batch. Used in DAPO.
- sequence_mean: computes per-sequence avg token loss, then averages over the batch.
- seq_mean_token_sum_norm: computes the sum of token losses for each sequence, normalizes by max_seq_len, and then averages over the batch. This is used in Dr. GRPO. If algorithm.max_seq_len is not explicitly set, it defaults to generator.max_input_length + generator.sampling_params.max_generate_length.
algorithm.grpo_norm_by_std: Whether to normalize advantages by the standard deviation in GRPO. This is set to false in Dr. GRPO.
algorithm.zero_variance_filter: Whether to loss mask out prompts with zero variance rewards. This is only applicable when rewards are response-level.
algorithm.lambd: Lambda parameter for GAE.
algorithm.gamma: Gamma parameter for GAE.
algorithm.eps_clip_low: Lower bound for PPO clipping.
algorithm.eps_clip_high: Upper bound for PPO clipping.
algorithm.clip_ratio_c: Clip ratio for dual clip PPO loss.
algorithm.value_clip: Clip value for value loss.
algorithm.dynamic_sampling: Dynamic sampling configuration.
- algorithm.dynamic_sampling.type: Type of dynamic sampling to use. Currently, we support filter (DAPO), replace (POLARIS / WebSailor), or null for no dynamic sampling.
- algorithm.dynamic_sampling.max_sample_batches: Maximum number of batches to sample before stopping. Set to -1 to sample forever.
- algorithm.dynamic_sampling.min_replace_ratio: Minimum proportion of good samples with which to replace bad samples for replace strategy.
algorithm.use_tis: Whether to use Truncated Importance Sampling (TIS) as proposed in this blog. This flag is to be deprecated, use off_policy_correction.tis_ratio_type = "token" instead.
max_seq_len: Used for seq_mean_token_sum_norm loss_reduction. Users should set this value for multi-turn for that loss. If not set, will be calculated as generator.max_input_length + generator.sampling_params.max_generate_length, which is incorrect for multi-turn.
algorithm.tis_imp_ratio_cap: Cap parameter for the importance ratio in TIS. This flag is to be deprecated, use off_policy_correction.token_tis_ratio_clip_high instead.
algorithm.clip_cov: Clip-Cov parameters (only used when policy_loss_type is clip_cov):
- clip_ratio: Fraction of tokens to clip based on covariance values.
- clip_cov_lb: Lower bound for covariance clipping.
- clip_cov_ub: Upper bound for covariance clipping.
algorithm.kl_cov: KL-Cov parameters (only used when policy_loss_type is kl_cov):
- kl_cov_frac: Percentage of tokens to apply KL regularization to.
- ppo_kl_coef: Coefficient for KL regularization term.
algorithm.cispo: CISPO parameters (only used when policy_loss_type is cispo):
- cispo_eps_clip_low: Offset for lower bound of importance sampling ratio clipping. Tokens with importance sampling ratio less than 1 - cispo_eps_clip_low will have their ratio clipped, but can still be updated in the policy gradient update.
- cispo_eps_clip_high: Offset for upper bound of importance sampling ratio clipping. Tokens with importance sampling ratio greater than 1 + cispo_eps_clip_high will have their ratio clipped, but can still be updated in the policy gradient update.
algorithm.sapo: SAPO (as proposed in this paper) parameters (only used when policy_loss_type is sapo):
- tau_pos: Temperature for gating function for tokens with positive advantages.
- tau_neg: Temperature for gating function for tokens with negative (or zero) advantages.

Off Policy Correction Configuration

algorithm.off_policy_correction: Off policy correction configuration. See the full configuration below, and see the guide for Off Policy Correction in SkyRL for more details:

off_policy_correction:
  tis_ratio_type: null # null, "token", "sequence"
  token_tis_ratio_clip_high: 2.0
  sequence_tis_ratio_clip_high: 5.0
  sequence_mask_metric: null # null, "product", "geometric"
  geo_mask_high: 1.01
  geo_mask_low: 0.99
  product_mask_high: 2.0
  product_mask_low: 0.5
  outlier_token_is_threshold_low: null
  outlier_token_is_threshold_high: null
  token_mask_is_threshold_low: null
  token_mask_is_threshold_high: null

algorithm.off_policy_correction.tis_ratio_type: Type of importance sampling ratio to use for ppo loss correction. Options include: null, token, sequence.
algorithm.off_policy_correction.token_tis_ratio_clip_high: Cap parameter for token tis_ratio_type.
algorithm.off_policy_correction.sequence_tis_ratio_clip_high: Cap parameter for sequence tis_ratio_type.
algorithm.off_policy_correction.sequence_mask_metric: Method of masking out sequences with cumulative importance sampling ratios outside the cap. Options include: null, product, geometric.
algorithm.off_policy_correction.geo_mask_high: High threshold for geometric sequence_mask_metric.
algorithm.off_policy_correction.geo_mask_low: Low threshold for geometric sequence_mask_metric.
algorithm.off_policy_correction.product_mask_high: High threshold for product sequence_mask_metric.
algorithm.off_policy_correction.product_mask_low: Low threshold for product sequence_mask_metric.
algorithm.off_policy_correction.outlier_token_is_threshold_low: Low threshold for outlier token mask — masks out entire sequences when any token has an IS ratio below this value. Set to null to disable. Suggested: 1e-4.
algorithm.off_policy_correction.outlier_token_is_threshold_high: High threshold for outlier token mask — masks out entire sequences when any token has an IS ratio above this value. Set to null to disable. Suggested: 100.
algorithm.off_policy_correction.token_mask_is_threshold_low: Low threshold for per-token masking — zeros individual tokens whose IS ratio falls below this value. Both low and high must be set to activate. Set to null to disable. Suggested: 0.5.
algorithm.off_policy_correction.token_mask_is_threshold_high: High threshold for per-token masking — zeros individual tokens whose IS ratio exceeds this value. Both low and high must be set to activate. Set to null to disable. Suggested: 2.0.

Policy Loss Formulation

It can be helpful to understand the final loss formulation to see how the different configuration options are used. The final loss is computed as below in the ppo_policy_loss function.

def ppo_policy_loss(
    log_probs: torch.Tensor,
    old_log_probs: torch.Tensor,
    advantages: torch.Tensor,
    config: AlgorithmConfig, # trainer.algorithm config
    loss_mask: Optional[torch.Tensor] = None,
) -> Tuple[torch.Tensor, dict]:

    ratio = (log_probs - old_log_probs).exp()
    surr1 = ratio * advantages
    surr2 = ratio.clamp(1 - config.eps_clip_low, 1 + config.eps_clip_high) * advantages
    loss = -torch.min(surr1, surr2)
    clip_ratio = masked_mean((-surr2 > -surr1).float(), loss_mask).mean().detach().item()
    clip_pg_losses1 = loss
    if config.policy_loss_type == "dual_clip":
      pg_losses3 = -advantages * config.clip_ratio_c
      clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
      loss = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
    loss = reduce_loss(loss, loss_mask, config.loss_reduction)
    return loss, {"clip_ratio": clip_ratio}

Fully Async Training Configuration

fully_async:
  max_staleness_steps: 4
  num_parallel_generation_workers: 768

fully_async.max_staleness_steps: Maximum off-policy steps allowed. If a trajectory group is scheduled at step i and trained at step j, then j - i <= max_staleness_steps. Larger values increase throughput but also off-policy-ness.
fully_async.num_parallel_generation_workers: Number of generation workers to spawn. Should be >= policy_mini_batch_size and <= policy_mini_batch_size * (max_staleness_steps + 1).

Generator Configuration

generator:
  n_samples_per_prompt: 5
  batched: false
  max_input_length: null # defaults to trainer.max_prompt_length
  max_turns: 1

  # Inference engine configuration
  inference_engine:
    model_dtype: "bfloat16" # should match dtype for inference engine
    run_engines_locally: true
    num_engines: 1
    backend: "vllm"
    weight_sync_backend: "nccl"
    tensor_parallel_size: 1
    pipeline_parallel_size: 1
    expert_parallel_size: 1
    data_parallel_size: 1
    async_engine: true
    enforce_eager: true
    enable_prefix_caching: true
    enable_chunked_prefill: true
    max_num_batched_tokens: 8192
    gpu_memory_utilization: 0.8
    max_num_seqs: 1024
    vllm_v1_disable_multiproc: true
    remote_urls: []
    engine_init_kwargs: {}
    override_existing_update_group: "auto" # "auto", "enable", "disable"

  # Custom chat template configuration if needed
  chat_template:
    source: "name"  # "name" or "file"
    name_or_path: null  # e.g., "qwen3_with_thinking" or "/path/to/template.j2"

  # Chat templating kwargs to pass to `tokenizer.apply_chat_template`
  chat_template_kwargs: {}

  # sampling params for generation phase
  sampling_params:
    max_generate_length: 1024
    temperature: 1.0
    top_p: 1.0
    min_p: 0.0
    top_k: -1
    repetition_penalty: 1.0
    logprobs: 1
    stop: null

  use_conversation_multi_turn: true

  # sampling params for evaluation
  # defaults to temperature=0.0 with same max_generate_length as sampling_params
  eval_sampling_params: null

  # number of samples per prompt for evaluation
  eval_n_samples_per_prompt: 1

  zero_reward_on_non_stop: false

  apply_overlong_filtering: false

Inference Engine Placement Configuration

generator.inference_engine.run_engines_locally: Whether to use local inference engines. If true, the inference engine will be initialized during the training run in the current Ray cluster. We use one Ray actor per inference replica and communication will happen via Ray object store. If set to false, then the generator expects a list of remote urls and communication will happen over HTTP.
generator.inference_engine.num_engines: Number of inference engines to use. If run_engines_locally is false, then this number should match the number of remote urls.
generator.inference_engine.remote_urls: List of remote urls to use. Applicable only when run_engines_locally is false.

For more details on how different placement options work, please refer to the placement guide.

HTTP Endpoint Configuration

generator.inference_engine.enable_http_endpoint: When true, launch an OpenAI-compatible HTTP endpoint for the inference engine client so that generators can send requests to this server instead of using .generate() Python calls. When using HTTP endpoints, it is also important to propagate the temperature appropriately to the trainer configuration at trainer.algorithm.temperature if you are not utilizing generator.sampling_params.temperature.
generator.inference_engine.http_endpoint_host: Host for the inference HTTP endpoint.
generator.inference_engine.http_endpoint_port: Port for the inference HTTP endpoint.
generator.inference_engine.served_model_name: The model name to use for HTTP endpoint validation. If set, this name must be used in the model field of /chat/completions and /completions requests instead of the model path. This is useful when:
- The model path differs from the desired API model name (e.g., local paths vs API names)
- Using LiteLLM or other clients that expect a specific model name format

Weight Transfer Configuration

generator.inference_engine.weight_sync_backend: Backend to use for weight synchronization. Currently, we support nccl and gloo.
generator.inference_engine.weight_transfer_threshold_cuda_ipc_GB: When using cuda_ipc weight sync, send weights in batches of this size (in GB). Default: 1.0.
generator.inference_engine.override_existing_update_group: Whether to override the existing update group for the inference engine. This is applicable only for remote inference engines. During training, the SkyRL-Train backend forms a custom process group ("update group") with the rank 0 training worker and all the inference engine ranks. If override_existing_update_group=enable, then during initialization, a previous weight update group will be overriden in the inference engine. For example, if you have a remote server setup and you run training for the same model multiple times, it is helpful to override the previous update group. We recommend leaving this to auto - since it will automatically determine if the previous update group should be overridden based on run_engines_locally.

Inference Engine Configuration

generator.inference_engine.backend: Backend to use for the inference engine. We support vllm.
generator.inference_engine.model_dtype: Dtype used for the inference engine. This is also used during weight transfer - the policy model weights are casted to this dtype before being sent to the inference engine during weight transfer.
generator.inference_engine.async_engine: Whether to use an asynchronous/ offline inference engine. Applicable only when backend="vllm".
generator.inference_engine.tensor_parallel_size: Tensor parallel size for the inference engine.
generator.inference_engine.pipeline_parallel_size: Pipeline parallel size for the inference engine. Currently, PP is only supported for vLLM backend with async_engine=true.
generator.inference_engine.expert_parallel_size: Expert parallel size for the inference engine. Currently, EP is only supported for vLLM backend and ep_size must equal dp_size * tp_size.
generator.inference_engine.data_parallel_size: Data parallel size for the inference engine. Currently, DP is only supported for vLLM backend.
generator.inference_engine.gpu_memory_utilization: GPU memory utilization for the inference engine. Applicable only for run_engines_locally=true.
generator.inference_engine.vllm_v1_disable_multiproc: If true, this will set VLLM_ENABLE_V1_MULTIPROCESSING=0 in the environment, which makes the scheduling deterministic. This is useful for reproducibility.
generator.inference_engine.enable_prefix_caching: Whether to enable prefix caching for the inference engine. Applicable only when backend="vllm". This can be left to the default true in most cases. Note that in the case of remote inference engines, you would need to match the setting used when you initialized the remote servers.
generator.inference_engine.enable_chunked_prefill: Whether to enable chunked prefill for the inference engine. Applicable only when backend="vllm". With vLLM, this can be left to the default true in most cases.
generator.inference_engine.max_num_seqs: Continous batching parameter for vLLM. Maximum number of sequences to pack into a batch.
generator.inference_engine.max_num_batched_tokens: Continous batching parameter for vLLM. Maximum number of tokens to pack into a batch.
generator.inference_engine.enforce_eager: Whether to disable CUDA graphs. Default is true for stability. Set to false for higher performance, but this may affect convergence for long-running or long-context training jobs.
generator.inference_engine.enable_ray_prometheus_stats: Whether to enable Ray Prometheus stats logger for vLLM inference engine metrics (vLLM v1 only). When enabled, uses vllm.v1.metrics.ray_wrappers.RayPrometheusStatLogger.
generator.inference_engine.engine_init_kwargs: Inference engine arguments passed directly to the vLLM engine. If duplicate kwargs are passed or kwargs clash with existing inference engine arguments (e.g., tensor_parallel_size), an error is raised.

Generation Parameters

generator.n_samples_per_prompt: Number of samples to generate per prompt. Note that the total size of the training batch will be trainer.train_batch_size * generator.n_samples_per_prompt.
generator.batched: Whether to use batched inference. This is applicable only for single turn generation.
generator.max_input_length: Maximum input length for the inference engine. For single turn generation, this can be same as trainer.max_prompt_length (i.e., the initial prompt length). For multi-turn generation, this is the maximum input length used for multi-turn conversations at each turn. Defaults to trainer.max_prompt_length if not set.
generator.sampling_params: Sampling parameters for the inference engine during trajectory generation phase.
- generator.sampling_params.max_generate_length: Maximum length of the generated response.
- generator.sampling_params.temperature: Temperature for the inference engine. This value is also automatically propagated to trainer.algorithm.temperature during config initialization.
- generator.sampling_params.top_p: Top-p sampling parameter for the inference engine.
- generator.sampling_params.min_p: Min-p sampling parameter for the inference engine, as proposed in this paper.
- generator.sampling_params.top_k: Top-k sampling parameter for the inference engine.
- generator.sampling_params.repetition_penalty: Repetition penalty parameter. Default 1.0 (no penalty).
- generator.sampling_params.logprobs: Number of logprobs to return from the inference engine. Set to 0 to return only the chosen token's logprob. Default 1.
- generator.sampling_params.stop: Optional list of stop strings for generation. Default null.
generator.eval_sampling_params: Sampling parameters for evaluation. Defaults to temperature=0.0 with the same max_generate_length as generator.sampling_params if not explicitly set.
generator.eval_n_samples_per_prompt: Number of samples to generate per prompt for evaluation.
generator.max_turns: Maximum number of turns for generation with multi-turn RL.
generator.use_conversation_multi_turn: Whether to use conversation format for multi-turn generation. If set to true then observations are appended to the chat history as a new turn. If set to false then observations are appended as-is to the assistant response in token space and generation is continued (after removing any EOS token in the response). We've observed some cases where model can be sensitive to chat history format (ex: in SkyRL-SQL), and thus false can be used for full control over the exact tokens added after environment interaction.
generator.chat_template: Custom chat template configuration if needed.
- generator.chat_template.source: Source of the chat template. Can be either name or file.
- generator.chat_template.name_or_path: Name or path of the chat template. If the source is name, then it should be one of the supported templates in skyrl/train/generators/utils.py. If the source is file, then this field should be a path to a Jinja2 template file.
generator.chat_template_kwargs: Chat templating kwargs to pass to tokenizer.apply_chat_template. Applicable only for non-batched generation with generator.batched=false.

Misc Configuration

generator.zero_reward_on_non_stop: Whether to set the reward to 0 if the stop_reason is not stop. Cases where this is useful: Often, we have format rewards for the LLM to follow, but in cases where the LLM didn't finish the response, we typically don't want to reward it. This is a general setting for all environments.
generator.apply_overlong_filtering: Whether to apply DAPO Overlong Filtering to the loss masks. For each trajectory that exceeds the max length (i.e., truncated and does not end with an EOS token), this masks out every token in the loss mask.
generator.step_wise_trajectories: Whether to return outputs in a step-wise fashion. If true, then the generator will return multi-turn generations with the (prompt, response) pair of each turn being a separate trajectory. Advantages are computed based on the last step of each trajectory and propagated to the previous steps.

Configuration Overview

On this page