SkyRL-SQL

We provide scripts to reproduce the results for SkyRL-SQL-7B using FSDP and Megatron backend with SkyRL-Gym.

You can find a WandB run for both single-turn and multi-turn Text2SQL training at this link.

Pre-requisites

Make sure to have followed the installation commands in Installation Guide.

Start Ray

Start ray in your cluster following the guide: https://docs.ray.io/en/latest/ray-core/starting-ray.html.

Data Preparation

We provide the dataset we used on HuggingFace: https://huggingface.co/datasets/NovaSky-AI/SkyRL-SQL-653-data-newfmt You can download the dataset by running the following command

hf download NovaSky-AI/SkyRL-SQL-653-data-newfmt --local-dir $HOME/data/sql --repo-type dataset

DB environment

Make sure to setup the database files needed for training. We use the database files from OmniSQL.

You can download the datasets from:

The datasets include BIRD, Spider, ScienceBenchmark, EHRSQL, Spider2-SQLite, Spider-DK, Spider-Realistic, Spider-Syn, and SynSQL-2.5M. In our training pipeline, we only need to access databases from SynSQL-2.5M and Spider.

Unzip data.zip in this folder, and set the corresponding DB_PATH in the training script below. You can download and unzip the data by running

hf download seeklhy/OmniSQL-datasets data.zip --repo-type dataset --local-dir <path_to_file.zip>
unzip <path_to_file.zip>

Running the scripts

We provide a script examples/train/text_to_sql/run_skyrl_sql.sh for reproducing the results for SkyRL-SQL-7B. Make sure to substitute the DB_PATH and DATA_PATH variables with your own.

export WANDB_API_KEY=<wandb-api-key>
bash examples/train/text_to_sql/run_skyrl_sql.sh