SkyRL
Recipes

E2E Recipes with SkyRL

We provide a collection of end-to-end recipes for single and multi-turn RL training with SkyRL.

We provide reproduction runs for the following recipes:

  1. Simple training on GSM8K
  2. Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
  3. SkyRL-SQL
  4. SearchR1

Simple training on GSM8K

The scripts for training on GSM8K are available at examples/gsm8k/.

BackendModelEval AccuracyHardwareTraining StepsCommitWandB
FSDP2Qwen/Qwen2.5-1.5B-Instruct0.7964xH100140a95b699Link
DeepSpeedQwen/Qwen2.5-1.5B-Instruct0.7914xH100140a95b699Link

DAPO Recipes

The code for the DAPO recipe is available at examples/algorithms/dapo/.

For evals we report Pass@32 and Mean@32. In the WandB metrics we log "avg_score" - since reward is either -1 or 1 for the AIME task, mean@32 can be computed as mean@32 = (avg_score + 1) / 2. In the table below we report the peak mean@32 and pass@32 over the course of the run. All runs are DAPO but without Dynamic Sampling enabled (just clip-higher, overlong buffer, overlong filtering, and token level loss aggregation).

All results can be reproduced with commit 8263149145f2455b75c082f3280d344b8a554f5d, and the WandB report for all runs is available here.

RecipeModelTraining BackendAIME24 Pass@32AIME24 Mean@32HardwareTraining Steps (at peak mean@32)
DAPOQwen/Qwen-2.5-32BFSDP20.7660.3812x8xH100260
DAPOQwen/Qwen3-30B-A3B-BaseMegatron (tp=4, ep=8)0.7330.43752x8xH100120
DAPO + LoRA (rank 128, alpha 128)Qwen/Qwen3-30B-A3B-BaseMegatron (tp=4, ep=8)0.80.4338xH100165
DAPOQwen/Qwen-2.5-7B-MathFSDP20.6330.3488xH100320
DAPOQwen/Qwen3-1.7B-BaseFSDP20.3660.1448xH100285
DAPO + 0-Var FilteringQwen/Qwen3-1.7B-BaseFSDP20.4330.1698xH100185
DAPOQwen/Qwen3-4B-BaseFSDP20.60.2548xH100110
DAPOQwen/Qwen3-4B-BaseMegatron (tp=4, pp=2)0.6330.2468xH100210
DAPO + LoRA (rank 32, alpha 64)Qwen/Qwen3-4B-BaseMegatron (tp=4, pp=1)0.5660.2098xH100160

SkyRL-SQL Recipes

For more details, please refer to SkyRL-SQL Recipe.

We provide two reference runs: single-turn and multi-training for Qwen/Qwen2.5-Coder-7B-Instruct (run on 1 8xH100 node up to convergence), with the WandB report here.

The evaluation results are shown below (using the evaluation code here):

Eval Turns (Train)Training MethodSpider-DevSpider-TestSpider-RealisticSpider-DKSpider-SynAvgWandB
1Single-Turn81.283.876.867.970.176.0Link
1Multi-Turn82.4 (+1.2%)83.7 (-0.1%)80.3 (+3.5%)70.5 (+2.6%)71.2 (+1.1%)77.6 (+1.6%)Link
5Single-Turn79.582.277.665.668.474.7Link
5Multi-Turn83.9 (+4.4%)85.2 (+3%)81.1 (+3.5%)72.0 (+6.4%)73.7 (+5.3%)79.2 (+4.5%)Link

SearchR1 Recipes

For more details, please refer to SearchR1 Recipe.

The WandB report is available here.

Qwen/Qwen2.5-3B-Instruct

The evaluation results are shown below for Qwen/Qwen2.5-3B-Instruct, with all experiments run on 1 8xH100 node up to convergence (330 training steps).

DatasetSearch-R1
(3 turns)
SkyRL + SearchR1
(2 turns)
SkyRL + SearchR1
(3 turns)
SkyRL + SearchR1
(4 turns)
NQ†0.3970.4550.4490.449
TriviaQA†0.5650.6130.6160.611
PopQA†0.3910.4470.4440.435
HotpotQA*0.3310.3340.4170.407
2wiki*0.3100.3130.3960.403
Musique*0.1240.0860.1790.163
Bamboogle*0.2320.2420.4480.352
Average0.3360.3560.4210.403

Qwen/Qwen3-30B-A3B

Evaluation results for Qwen3-30B-A3B on SearchR1, with experiments using 4 8xH100 nodes using the Megatron backend, are shown below. These results can be reproduced with commit 9b878cd.

DatasetSkyRL + SearchR1
(4 turns)
NQ†0.463
TriviaQA†0.664
PopQA†0.448
HotpotQA*0.412
2wiki*0.361
Musique*0.178
Bamboogle*0.488
Average0.457

On this page