Dataset Preparation

This guide covers:

The dataset format that SkyRL expects for training, and
How to prepare and format a new dataset

Format Requirements

Each dataset entry is a dictionary with the following required (and some optional) values:

data = {
    "data_source": data_source,     # String: Name/identifier of the data source
    "prompt": [                     # List: Conversation format
        {
            "role": "user",            
            "content": question,       
        }
    ],
    "env_class": env_class,         # String: Environment class identifier
    "reward_spec": {
        "method": "rule",           # String: Either "rule" or "reward_model"
        "ground_truth": solution,   # Expected solution
    },
    "extra_info": {                 # Dict: Optional additional metadata
        # ... add your own fields here
    },
}

SkyRL supports loading datasets of this format from a local parquet file, a json file, or by Hugging Face dataset name that SkyRL will download. We load the dataset as a huggingface DatasetDict.

Key Requirements:

data_source: String identifier for the dataset origin (e.g., "gsm8k", "AIME24", etc.)
prompt: List of dictionaries following standard OpenAI chat format
env_class: Name of environment that the data sample corresponds to. This is used to tell the Generator which environment to instantiate for this prompt.
- Note: env_class can also be specified in the training configuration to apply to all dataset entries.
reward_spec: Dictionary containing the reward specification for the dataset entry (ie, how to get rewards).
- method: Must be either "rule" or "reward_model"
- ground_truth: If method is "rule", this is the expected solution.
extra_info: Extensible dictionary for additional metadata - you can add custom fields as needed.

Data Preparation Scripts

We provide several example scripts to help you prepare your dataset, including for gsm8k, LiveCodeBench, SearchR1, and the SynSQL text-to-SQL dataset.

To use a new dataset for training, you can use the provided scripts as a template to create your own.

Generally, only a single method (make_map_fn) must be implemented to convert the new dataset into the required format. Below is an example of converting the SynSQL text-to-SQL dataset into the required format:

def make_map_fn(split):
      def process_fn(example, idx):
          """Transform each dataset example into the required format"""
          if split == "train":
              user_content = ("{db_details}:" + example["schema"] + 
                            ";\n {external_knowledge}: " + example["external_knowledge"] + 
                            ";\n {question}: " + example["question"])
          else:
              user_content = ("{db_details}:" + example["schema"] + 
                            "; {question}: " + example["question"])

          data = {
              "data_source": "synsql",
              "prompt": [
                  {"role": "system", "content": short_system_prompt},
                  {
                      "role": "user",
                      "content": user_content,
                  },
              ],
              "env_class": "text2sql",
              "reward_spec": {
                  "method": "rule",
                  "ground_truth": example["sql"],
              },
              # Custom fields specific to the SynSQL dataset:
              "db_id": example["db_id"],
              "data": example["data"],
          }
          return data

      return process_fn

Then, the mapping function is called on each sample in the dataset, and the final converted dataset is saved to a parquet file:

train_dataset = input_dataset.map(function=make_map_fn("train"), with_indices=True)
train_dataset.to_parquet(os.path.join(args.output, "train.parquet"))

Note, however, that SkyRL can also load datasets from a local json file or by Hugging Face dataset name.

Using Dataset to Train

With your correctly formatted datasets, you can pass the dataset file paths to the training script:

# Dataset file paths
uv run -m skyrl.train.entrypoints.main_base \
  data.train_data="['path/to/train.parquet']" \
  data.val_data="['path/to/validation.parquet']" \

or specify HuggingFace dataset(s) prepared in the expected format:

# Huggingface dataset
uv run -m skyrl.train.entrypoints.main_base \
  data.train_data="['username/my_dataset:train']" \
  data.val_data="['username/my_dataset:validation']" \

Reference Scripts

Use the following scripts as a template to prepare your dataset:

Dataset Preparation

Format Requirements

Data Preparation Scripts

Using Dataset to Train

Reference Scripts

On this page