Megatron Backend for 5D Parallelism
SkyRL supports NVIDIA's Megatron-Core library as an RL training backend, inheriting support for 5D parallelism (tensor+sequence, pipeline, context, expert, and data parallelism), and optimized performance for large scale models.
We provide example scripts for running efficient large scale MoE training with models like Qwen3-30B-A3B using Megatron in the examples/megatron directory.
For details on configuring the Megatron backend, and enabling checkpointing, see the Megatron configuration guide and the Megatron checkpointing guide.
When to use the Megatron backend
SkyRL supports efficient data-parallel training with the FSDP backend, with support for Ulysses sequence parallelism for long context training. The Megatron backend is useful to stack additional parallelism strategies (TP, PP, EP) on top of data and sequence/context parallelism. This is helpful both for fitting larger models into memory and for training throughput for MoE models (with EP). The Megatron backend is thus useful for efficient training of small MoE models like Qwen3-30B-A3B as well as large-scale training with large models such as Qwen3-235B-A22B and/or large datasets. For resources on understanding different parallelism strategies, see parallelism-resources.
Comparison to FSDP
We show performance comparisons for the Megatron and FSDP2 backends on the Search-R1 task (4K max context length) for various model sizes in the table below. Training speed for small scale dense models with Megatron is similar to the FSDP2 backend, and Megatron enables high throughput training for larger scale MoE models where FSDP is no longer feasible as the only parallelism strategy.
| Model | Backend | Compute | Policy Training Time (s) | Forward Pass Time (s) | Avg Num Tokens |
|---|---|---|---|---|---|
| Qwen2.5-3B-Instruct | Megatron | 8xH100 | 48 | 39 | 658 |
| Qwen2.5-3B-Instruct | FSDP2 | 8xH100 | 42 | 28 | 658 |
| Qwen2.5-7B-Instruct | Megatron | 8xH100 | 93 | 46 | 819 |
| Qwen2.5-7B-Instruct | FSDP2 | 8xH100 | 100 | 33 | 834 |
| Qwen3-30B-A3B | Megatron | 4x8xH100 | 189 | 145 | 1158 |
For all experiments, we used a train batch size of 512. For Qwen2.5 3B and 7B Megatron was configured with only data parallel=8. For Qwen3-30B-A3B Megatron was configured with DP=4, TP=2, and EP=8, and the FSDP2 backend was unable to complete a training step due to memory constraints. All statistics shown were averaged over the first 10 steps of training. Micro batch sizes were tuned to be the max possible for each backend.
A script for running the Qwen3-30B-A3B experiment can be found here. Additionally, we provide a script for running basic GSM8K training on Qwen3-235B-A22B with Megatron here. Note that although training at 100B+ scale using Megatron is currently possible, we are in the process of further optimizing peformance.
For more details on configuring the Megatron backend, and enabling checkpointing, see Megatron configuration guide, and Megatron checkpointing guide.
Installation
Setting up the Docker image
To get started, you can follow the instructions for installing via Docker in the Installation Guide page, but using the novaskyai/skyrl-train-ray-2.51.1-py3.12-cu12.8-megatron image instead of the default image.
This ensures that the necessary dependencies needed for Megatron are installed and don't need to be built on each node for each run, which can be time consuming. Previously, we recommended setting PYTHONPATH to the location of TransformerEngine installation, but this is no longer necessary.
In order to use flash attention with the megatron backend, you must use `flash_attn` version `2.7.4.post1` or lower for compatibility with `TransformerEngine==2.7.0`.
This is handled in the `pyproject.toml` file for the `mcore` extra.
## Configuration
We provide the following options for fully configuring the Megatron backend, exposing the underlying Megatron optimizer, DDP, and model config objects
for advanced users to fully take advantage of all of Megatron-Core's feature flags. For more details, see the [Megatron configuration guide](../configuration/config.mdx#megatron-configuration) section.
```yaml title="config/megatron/policy.yaml"
# @package megatron_config.policy
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
context_parallel_size: 1
expert_model_parallel_size: 1
expert_tensor_parallel_size: null
ddp_config: # pass-through config to Megatron's `DistributedDataParallelConfig` object
# https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/distributed/distributed_data_parallel_config.py#L8
...
optimizer_config_kwargs: # pass-through kwargs to Megatron's `OptimizerConfig` object
# any overlapping arguments with those we attempt to resolve in trainer.policy.optimizer_config will be overridden by the values here
# https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/optimizer/optimizer_config.py#L12
...
transformer_config_kwargs: # pass-through kwargs to the Megatron's `TransformerConfig` object
# https://github.com/NVIDIA/Megatron-LM/blob/core_r0.13.0/megatron/core/transformer/transformer_config.py#L33
...
lora_config:
# see: https://docs.nvidia.com/nemo/megatron-bridge/0.2.0/apidocs/bridge/bridge.peft.lora.html for details - currently "lora" and "canonical_lora" are supported
lora_type: "lora"
# flag to manually empty torch's cuda cache between the forward/backward pass and the optimizer step
# this will free reserved but unallocated memory, and can help avoid OoMs in the optimizer
empty_cuda_cache: true
```
These default values can be overridden by passing in the corresponding arguments to `trainer.policy.megatron_config` in the launch script.
## Parallelism Resources
Understanding and configuring parallelism strategies for large models can be challenging.
Some helpful resources for understanding and tuning large scale parallelism strategies can be found at the [Huggingface Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=finding_the_best_training_configuration),
the [The Mesh Parallelism Zoo](https://blog.ezyang.com/2025/08/the-parallelism-mesh-zoo/), and the [Visualizing 6-D Parallelism](https://main-horse.github.io/posts/visualizing-6d).
Below, we show a diagram displaying how all 5 parallelism strategies - tensor, pipeline, context, expert, and data parallelism - can be utilized in SkyRL, as well as how dispatching data across these parallel groups works.