Dataset Preparation
This guide covers:
- The dataset format that SkyRL expects for training, and
- How to prepare and format a new dataset
Format Requirements
Each dataset entry is a dictionary with the following required (and some optional) values:
data = {
"data_source": data_source, # String: Name/identifier of the data source
"prompt": [ # List: Conversation format
{
"role": "user",
"content": question,
}
],
"env_class": env_class, # String: Environment class identifier
"reward_spec": {
"method": "rule", # String: Either "rule" or "reward_model"
"ground_truth": solution, # Expected solution
},
"extra_info": { # Dict: Optional additional metadata
# ... add your own fields here
},
}SkyRL supports loading datasets of this format from a local parquet file, a json file, or by Hugging Face dataset name that SkyRL will download. We load the dataset as a huggingface DatasetDict.
Key Requirements:
-
data_source: String identifier for the dataset origin (e.g., "gsm8k", "AIME24", etc.)
-
prompt: List of dictionaries following standard OpenAI chat format
-
env_class: Name of environment that the data sample corresponds to. This is used to tell the Generator which environment to instantiate for this prompt.
- Note: env_class can also be specified in the training configuration to apply to all dataset entries.
-
reward_spec: Dictionary containing the reward specification for the dataset entry (ie, how to get rewards).
- method: Must be either
"rule"or"reward_model" - ground_truth: If
methodis"rule", this is the expected solution.
- method: Must be either
-
extra_info: Extensible dictionary for additional metadata - you can add custom fields as needed.
Data Preparation Scripts
We provide several example scripts to help you prepare your dataset, including for gsm8k, LiveCodeBench, SearchR1, and the SynSQL text-to-SQL dataset.
To use a new dataset for training, you can use the provided scripts as a template to create your own.
Generally, only a single method (make_map_fn) must be implemented to convert the new dataset into the required format. Below is an example of converting the SynSQL text-to-SQL dataset into the required format:
def make_map_fn(split):
def process_fn(example, idx):
"""Transform each dataset example into the required format"""
if split == "train":
user_content = ("{db_details}:" + example["schema"] +
";\n {external_knowledge}: " + example["external_knowledge"] +
";\n {question}: " + example["question"])
else:
user_content = ("{db_details}:" + example["schema"] +
"; {question}: " + example["question"])
data = {
"data_source": "synsql",
"prompt": [
{"role": "system", "content": short_system_prompt},
{
"role": "user",
"content": user_content,
},
],
"env_class": "text2sql",
"reward_spec": {
"method": "rule",
"ground_truth": example["sql"],
},
# Custom fields specific to the SynSQL dataset:
"db_id": example["db_id"],
"data": example["data"],
}
return data
return process_fnThen, the mapping function is called on each sample in the dataset, and the final converted dataset is saved to a parquet file:
train_dataset = input_dataset.map(function=make_map_fn("train"), with_indices=True)
train_dataset.to_parquet(os.path.join(args.output, "train.parquet"))Note, however, that SkyRL can also load datasets from a local json file or by Hugging Face dataset name.
Using Dataset to Train
With your correctly formatted datasets, you can pass the dataset file paths to the training script:
# Dataset file paths
uv run -m skyrl_train.entrypoints.main_base \
data.train_data="['path/to/train.parquet']" \
data.val_data="['path/to/validation.parquet']" \or specify HuggingFace dataset(s) prepared in the expected format:
# Huggingface dataset
uv run -m skyrl_train.entrypoints.main_base \
data.train_data="['username/my_dataset:train']" \
data.val_data="['username/my_dataset:validation']" \Reference Scripts
Use the following scripts as a template to prepare your dataset: