E2E Recipes with SkyRL
We provide a collection of end-to-end recipes for single and multi-turn RL training with SkyRL.
We provide reproduction runs for the following recipes:
- Simple training on GSM8K
- Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
- SkyRL-SQL
- SearchR1
Simple training on GSM8K
The scripts for training on GSM8K are available at examples/gsm8k/.
| Backend | Model | Eval Accuracy | Hardware | Training Steps | Commit | WandB |
|---|---|---|---|---|---|---|
| FSDP2 | Qwen/Qwen2.5-1.5B-Instruct | 0.796 | 4xH100 | 140 | a95b699 | Link |
| DeepSpeed | Qwen/Qwen2.5-1.5B-Instruct | 0.791 | 4xH100 | 140 | a95b699 | Link |
DAPO Recipes
The code for the DAPO recipe is available at examples/algorithms/dapo/.
For evals we report Pass@32 and Mean@32. In the WandB metrics we log "avg_score" - since reward is either -1 or 1 for the AIME task, mean@32 can be computed as mean@32 = (avg_score + 1) / 2. In the table below we report the peak mean@32 and pass@32 over the course of the run. All runs are DAPO but without Dynamic Sampling enabled (just clip-higher, overlong buffer, overlong filtering, and token level loss aggregation).
All results can be reproduced with commit 8263149145f2455b75c082f3280d344b8a554f5d, and the WandB report for all runs is available here.
| Recipe | Model | Training Backend | AIME24 Pass@32 | AIME24 Mean@32 | Hardware | Training Steps (at peak mean@32) |
|---|---|---|---|---|---|---|
| DAPO | Qwen/Qwen-2.5-32B | FSDP2 | 0.766 | 0.381 | 2x8xH100 | 260 |
| DAPO | Qwen/Qwen3-30B-A3B-Base | Megatron (tp=4, ep=8) | 0.733 | 0.4375 | 2x8xH100 | 120 |
| DAPO + LoRA (rank 128, alpha 128) | Qwen/Qwen3-30B-A3B-Base | Megatron (tp=4, ep=8) | 0.8 | 0.433 | 8xH100 | 165 |
| DAPO | Qwen/Qwen-2.5-7B-Math | FSDP2 | 0.633 | 0.348 | 8xH100 | 320 |
| DAPO | Qwen/Qwen3-1.7B-Base | FSDP2 | 0.366 | 0.144 | 8xH100 | 285 |
| DAPO + 0-Var Filtering | Qwen/Qwen3-1.7B-Base | FSDP2 | 0.433 | 0.169 | 8xH100 | 185 |
| DAPO | Qwen/Qwen3-4B-Base | FSDP2 | 0.6 | 0.254 | 8xH100 | 110 |
| DAPO | Qwen/Qwen3-4B-Base | Megatron (tp=4, pp=2) | 0.633 | 0.246 | 8xH100 | 210 |
| DAPO + LoRA (rank 32, alpha 64) | Qwen/Qwen3-4B-Base | Megatron (tp=4, pp=1) | 0.566 | 0.209 | 8xH100 | 160 |
SkyRL-SQL Recipes
For more details, please refer to SkyRL-SQL Recipe.
We provide two reference runs: single-turn and multi-training for Qwen/Qwen2.5-Coder-7B-Instruct (run on 1 8xH100 node up to convergence), with the WandB report here.
The evaluation results are shown below (using the evaluation code here):
| Eval Turns (Train) | Training Method | Spider-Dev | Spider-Test | Spider-Realistic | Spider-DK | Spider-Syn | Avg | WandB |
|---|---|---|---|---|---|---|---|---|
| 1 | Single-Turn | 81.2 | 83.8 | 76.8 | 67.9 | 70.1 | 76.0 | Link |
| 1 | Multi-Turn | 82.4 (+1.2%) | 83.7 (-0.1%) | 80.3 (+3.5%) | 70.5 (+2.6%) | 71.2 (+1.1%) | 77.6 (+1.6%) | Link |
| 5 | Single-Turn | 79.5 | 82.2 | 77.6 | 65.6 | 68.4 | 74.7 | Link |
| 5 | Multi-Turn | 83.9 (+4.4%) | 85.2 (+3%) | 81.1 (+3.5%) | 72.0 (+6.4%) | 73.7 (+5.3%) | 79.2 (+4.5%) | Link |
SearchR1 Recipes
For more details, please refer to SearchR1 Recipe.
The WandB report is available here.
Qwen/Qwen2.5-3B-Instruct
The evaluation results are shown below for Qwen/Qwen2.5-3B-Instruct, with all experiments run on 1 8xH100 node up to convergence (330 training steps).
| Dataset | Search-R1 (3 turns) | SkyRL + SearchR1 (2 turns) | SkyRL + SearchR1 (3 turns) | SkyRL + SearchR1 (4 turns) |
|---|---|---|---|---|
| NQ† | 0.397 | 0.455 | 0.449 | 0.449 |
| TriviaQA† | 0.565 | 0.613 | 0.616 | 0.611 |
| PopQA† | 0.391 | 0.447 | 0.444 | 0.435 |
| HotpotQA* | 0.331 | 0.334 | 0.417 | 0.407 |
| 2wiki* | 0.310 | 0.313 | 0.396 | 0.403 |
| Musique* | 0.124 | 0.086 | 0.179 | 0.163 |
| Bamboogle* | 0.232 | 0.242 | 0.448 | 0.352 |
| Average | 0.336 | 0.356 | 0.421 | 0.403 |
Qwen/Qwen3-30B-A3B
Evaluation results for Qwen3-30B-A3B on SearchR1, with experiments using 4 8xH100 nodes using the Megatron backend, are shown below. These results can be reproduced with commit 9b878cd.
| Dataset | SkyRL + SearchR1 (4 turns) |
|---|---|
| NQ† | 0.463 |
| TriviaQA† | 0.664 |
| PopQA† | 0.448 |
| HotpotQA* | 0.412 |
| 2wiki* | 0.361 |
| Musique* | 0.178 |
| Bamboogle* | 0.488 |
| Average | 0.457 |