SkyRL-SQL
We provide scripts to reproduce the results for SkyRL-SQL-7B using SkyRL-train and SkyRL-Gym.
You can find a WandB run for both single-turn and multi-turn Text2SQL training at this link.
Pre-requisites
Make sure to have followed the installation commands in Installation Guide.
Start Ray
Start ray in your cluster following the guide: https://docs.ray.io/en/latest/ray-core/starting-ray.html.
Data Preparation
We provide the dataset we used on HuggingFace: https://huggingface.co/datasets/NovaSky-AI/SkyRL-SQL-653-data-newfmt You can download the dataset by running the following command
hf download NovaSky-AI/SkyRL-SQL-653-data-newfmt --local-dir $HOME/data/sql --repo-type datasetDB environment
Make sure to setup the database files needed for training. We use the database files from OmniSQL.
You can download the datasets from:
The datasets include BIRD, Spider, ScienceBenchmark, EHRSQL, Spider2-SQLite, Spider-DK, Spider-Realistic, Spider-Syn, and SynSQL-2.5M. In our training pipeline, we only need to access databases from SynSQL-2.5M and Spider.
Unzip data.zip in this folder, and set the corresponding DB_PATH in the training script below. You can download and unzip the data by running
hf download seeklhy/OmniSQL-datasets data.zip --repo-type dataset --local-dir <path_to_file.zip>
unzip <path_to_file.zip>Running the scripts
We provide a script examples/text_to_sql/run_skyrl_sql.sh for reproducing the results for SkyRL-SQL-7B. Make sure to substitute the DB_PATH and DATA_PATH variables with your own.
export WANDB_API_KEY=<wandb-api-key>
bash examples/text_to_sql/run_skyrl_sql.sh