Checkpointing

This page covers checkpointing for the fsdp and megatron backends.

SkyRL provides checkpointing features to resume training from a previous state. Training state is saved at regular intervals and provides flexible configuration options for checkpoint management.

Checkpointed state can be stored in the local file system or uploaded to cloud storage (S3, GCP).

What State is Saved

SkyRL saves several types of state to enable complete training resumption:

Model States

Policy Model: Model parameters, optimizer state, and learning rate scheduler state
Critic Model: Model parameters, optimizer state, and learning rate scheduler state (if critic is enabled)
Reference Model: Not checkpointed (recreated from policy model)

Training State

Global Step: Current training step counter
Configuration: Complete training configuration used
Dataloader State: Current position in dataset iteration (enables resuming from exact data position)

Directory Structure

The checkpointing directory structure depends on the training backend used.

FSDP Checkpointing

FSDP checkpoints are organized according to the following directory hierarchy:

{ckpt_path}/
├── latest_ckpt_global_step.txt          # Holds the global step of the latest checkpoint
├── global_step_10/                      # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── fsdp_config.json             # stores fsdp version and world size
│   │   ├── huggingface/                  # HuggingFace config and tokenizer
│   │       ├── config.json                 # model config
│   │       ├── tokenizer_config.json       # tokenizer config
│   │       ├── generation_config.json      # generation config
│   │       ├── ...                         # other tokenizer config files
│   │   ├── model_state.pt               # Model parameters
│   │   ├── optimizer_state.pt           # Optimizer state
│   │   └── lr_scheduler_state.pt        # Learning rate scheduler state
│   ├── critic/                          # Critic model checkpoint (if enabled)
│   │   ├── fsdp_config.json             
│   │   ├── huggingface/
│   │   ├── model_state.pt
│   │   ├── optimizer_state.pt
│   │   └── lr_scheduler_state.pt
│   ├── data.pt                          # Dataloader state
│   └── trainer_state.pt                 # High-level trainer state
├── global_step_20/                      # Checkpoint at training step 20
│   └── ...
└── global_step_30/                      # Checkpoint at training step 30
    └── ...

Megatron checkpoints are handled by the Megatron dist_checkpointing library to perform checkpointing in parallel across ranks. This comes with support for reloading checkpoints in a different parallelism scheme than the one they were saved in.

{ckpt_path}/
├── latest_ckpt_global_step.txt          # Holds the global step of the latest checkpoint
├── global_step_10/                      # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── metadata.json                # Megatron checkpoint metadata
│   │   ├── huggingface/                 # HuggingFace config and tokenizer
│   │   ├── __0_0.distcp                 # Megatron checkpoint files
│   │   ├── __0_1.distcp                 
│   │   └── ...
├── global_step_20/                      # Checkpoint at training step 20
│   └── ...
└── global_step_30/                      # Checkpoint at training step 30
    └── ...

Key Configuration Parameters

Checkpointing behavior is controlled by several parameters in the YAML configuration (see configuration guide for the full training config):

Checkpoint Saving

ckpt_interval

Default: 10
Purpose: Save checkpoints every N training steps

ckpt_path

Default: "~/ckpts/"
Purpose: Base directory where all checkpoints are stored
Options:
- Local file system path (e.g., /path/to/ckpts/)
- Cloud storage path (S3, GCP) (e.g., s3://path/to/ckpts/)

Checkpoint Cleanup

max_ckpts_to_keep

Default: -1 (keep all checkpoints)
Purpose: Limit number of stored checkpoints to save disk space
Options:
- -1: Keep all checkpoints indefinitely
- N (positive integer): Keep only the last N checkpoints, automatically delete older ones

Training Resumption

resume_mode

Default: "latest"
Purpose: Controls how training resumption works
Options:
- "none" or null: Start training from scratch, ignore existing checkpoints
- "latest": Automatically resume from the most recent checkpoint
- "from_path": Resume from a specific checkpoint (requires resume_path)

resume_path

Default: null
Purpose: Specific checkpoint directory to resume from (only used when resume_mode: "from_path")
Format: Must point to a global_step_N directory

HuggingFace Model Export

In addition to checkpointing, users can optionally save the policy model in HuggingFace safetensors format at regular intervals.

Configuration Parameters:

hf_save_interval

Default: -1 (disabled)
Purpose: Save HuggingFace format policy models every N training steps

export_path

Default: "~/exports/"
Purpose: Base directory where HuggingFace models and other artifacts are saved
Structure: Models are saved to {export_path}/global_step_{N}/policy/