SkyRL
Checkpointing and Logging

Checkpointing

SkyRL provides checkpointing features to resume training from a previous state. Training state is saved at regular intervals and provides flexible configuration options for checkpoint management.

Checkpointed state can be stored in the local file system or uploaded to cloud storage (S3, GCP).

What State is Saved

SkyRL saves several types of state to enable complete training resumption:

Model States

  • Policy Model: Model parameters, optimizer state, and learning rate scheduler state
  • Critic Model: Model parameters, optimizer state, and learning rate scheduler state (if critic is enabled)
  • Reference Model: Not checkpointed (recreated from policy model)

Training State

  • Global Step: Current training step counter
  • Configuration: Complete training configuration used
  • Dataloader State: Current position in dataset iteration (enables resuming from exact data position)

Directory Structure

The checkpointing directory structure depends on the training backend used.

FSDP Checkpointing

FSDP checkpoints are organized according to the following directory hierarchy:

{ckpt_path}/
├── latest_ckpt_global_step.txt          # Holds the global step of the latest checkpoint
├── global_step_10/                      # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── fsdp_config.json             # stores fsdp version and world size
│   │   ├── huggingface/                  # HuggingFace config and tokenizer
│   │       ├── config.json                 # model config
│   │       ├── tokenizer_config.json       # tokenizer config
│   │       ├── generation_config.json      # generation config
│   │       ├── ...                         # other tokenizer config files
│   │   ├── model_state.pt               # Model parameters
│   │   ├── optimizer_state.pt           # Optimizer state
│   │   └── lr_scheduler_state.pt        # Learning rate scheduler state
│   ├── critic/                          # Critic model checkpoint (if enabled)
│   │   ├── fsdp_config.json             
│   │   ├── huggingface/
│   │   ├── model_state.pt
│   │   ├── optimizer_state.pt
│   │   └── lr_scheduler_state.pt
│   ├── data.pt                          # Dataloader state
│   └── trainer_state.pt                 # High-level trainer state
├── global_step_20/                      # Checkpoint at training step 20
│   └── ...
└── global_step_30/                      # Checkpoint at training step 30
    └── ...

Megatron Checkpointing

Megatron checkpoints are handled by the Megatron dist_checkpointing library to perform checkpointing in parallel across ranks. This comes with support for reloading checkpoints in a different parallelism scheme than the one they were saved in.

{ckpt_path}/
├── latest_ckpt_global_step.txt          # Holds the global step of the latest checkpoint
├── global_step_10/                      # Checkpoint at training step 10
│   ├── policy/                          # Policy model checkpoint directory
│   │   ├── metadata.json                # Megatron checkpoint metadata
│   │   ├── huggingface/                 # HuggingFace config and tokenizer
│   │   ├── __0_0.distcp                 # Megatron checkpoint files
│   │   ├── __0_1.distcp                 
│   │   └── ...
├── global_step_20/                      # Checkpoint at training step 20
│   └── ...
└── global_step_30/                      # Checkpoint at training step 30
    └── ...

Key Configuration Parameters

Checkpointing behavior is controlled by several parameters in the YAML configuration (see configuration guide for the full training config):

Checkpoint Saving

ckpt_interval

  • Default: 10
  • Purpose: Save checkpoints every N training steps

ckpt_path

  • Default: "${oc.env:HOME}/ckpts/"
  • Purpose: Base directory where all checkpoints are stored
  • Options:
    • Local file system path (e.g., /path/to/ckpts/)
    • Cloud storage path (S3, GCP) (e.g., s3://path/to/ckpts/)

Checkpoint Cleanup

max_ckpts_to_keep

  • Default: -1 (keep all checkpoints)

  • Purpose: Limit number of stored checkpoints to save disk space

  • Options:

    • -1: Keep all checkpoints indefinitely
    • N (positive integer): Keep only the last N checkpoints, automatically delete older ones

Training Resumption

resume_mode

  • Default: "latest"

  • Purpose: Controls how training resumption works

  • Options:

    • "none" or null: Start training from scratch, ignore existing checkpoints
    • "latest": Automatically resume from the most recent checkpoint
    • "from_path": Resume from a specific checkpoint (requires resume_path)

resume_path

  • Default: null
  • Purpose: Specific checkpoint directory to resume from (only used when resume_mode: "from_path")
  • Format: Must point to a global_step_N directory

HuggingFace Model Export

In addition to checkpointing, users can optionally save the policy model in HuggingFace safetensors format at regular intervals.

Configuration Parameters:

hf_save_interval

  • Default: -1 (disabled)
  • Purpose: Save HuggingFace format policy models every N training steps

export_path

  • Default: "${oc.env:HOME}/exports/"
  • Purpose: Base directory where HuggingFace models and other artifacts are saved
  • Structure: Models are saved to {export_path}/global_step_{N}/policy/

On this page