PPO on GSM8K

This example uses the fsdp backend.

This example demonstrates how to run PPO training on GSM8K. See Quick Start Guide for a similar setup process for GRPO!

Dataset Preparation

To download and prepare the GSM8K dataset, run the following script.

uv run --isolated examples/train/gsm8k/gsm8k_dataset.py --output_dir $HOME/data/gsm8k

Training Configuration

Next, let's set up the training configuration. You can find a complete example in examples/train/ppo/run_ppo.sh - we highlight some key parameters here:

uv run --isolated --extra fsdp -m skyrl.train.entrypoints.main_base \
   # Data setup
   data.train_data="['$HOME/data/gsm8k/train.parquet']" \
   data.val_data="['$HOME/data/gsm8k/validation.parquet']" \

   # Trainer and training algorithm
   trainer.algorithm.advantage_estimator="gae" \
   trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
   trainer.critic.model.path="Qwen/Qwen2.5-1.5B-Instruct" \

   # Model placement and training strategy (colocate or disaggregate, sharding, etc.)
   trainer.strategy=fsdp2 \
   trainer.placement.colocate_all=true \
   trainer.placement.policy_num_gpus_per_node=4 \
   trainer.placement.critic_num_gpus_per_node=4 \

   # Batch sizes for critic and policy forward and training passes
   trainer.policy_mini_batch_size=256 \
   trainer.critic_mini_batch_size=256 \
   trainer.micro_forward_batch_size_per_gpu=64 \
   trainer.micro_train_batch_size_per_gpu=64 \

   # Evaluation and checkpointing
   trainer.eval_batch_size=1024 \
   trainer.eval_before_train=true \
   trainer.eval_interval=5 \
   trainer.ckpt_interval=10 \

   # Generator setup for spinning up InferenceEngines
   generator.inference_engine.backend=vllm \
   generator.inference_engine.num_engines=4 \
   generator.inference_engine.tensor_parallel_size=1 \
   generator.inference_engine.weight_sync_backend=nccl \

   # Environment class for the dataset
   # Can be specified here to apply to the full dataset, or at the per-prompt level during preprocessing
   environment.env_class=gsm8k \

   ... # Other parameters (see `examples/train/ppo/run_ppo.sh` for more)

Hardware Configuration

Depending on your hardware setup, you may want to adjust a few parameters:

GPU Configuration: Set trainer.placement.policy_num_gpus_per_node and trainer.placement.critic_num_gpus_per_node to match your available GPU count. Note that since we are setting trainer.placement.colocate_all to true, the total number of GPUs used for the policy model and critic model should be the same. Additionally generator.inference_engine.num_engines * generator.inference_engine.tensor_parallel_size should also be equal to the total number of GPUs used for the policy and critic models.

Launching Your Training Run

Let's get our PPO training run started! First, configure your WandB API key for logging:

export WANDB_API_KEY=your_wandb_api_key

Then launch your training run:

bash examples/train/ppo/run_ppo.sh

Congratulations! You've just launched your first PPO training run!

What's Next?

Now that you've got basic colocated PPO training down, you might want to explore some more advanced features:

Async Training: Asynchronous off-by-one training in < 100 lines of code!
Remote Inference Engine: Training with a remote inference engine