Multi-Turn GRPO Text2SQL with Async Rollouts
In this example, we walk through how to train an effective multi-turn Text2SQL model (beating GPT-4o) with SkyRL using async rollouts.
We provide an implementation of a multi-turn Text2SQL environment at :skyrl_gym_link:skyrl_gym/envs/sql/env.py.
You can find the exact recipe for reproducing our prior SkyRL-SQL-7B release at SkyRL-SQL Recipe.
Task Overview
In this task, the agent is given a natural language question and a database schema, and is tasked with generating a SQL query to answer the question. An abbreviated example prompt is shown below:
You are a data science expert. Below, you are provided with a database schema and a natural language question. Your task is to understand the schema and generate a valid SQL query to answer the question within limited turns. You should breakdown the problem, draft your reasoning process, and generate the solution.
Database Engine:
SQLite
Database Schema:
{db_details}
This schema describes the database's structure, including tables, columns, primary keys, foreign keys, and any relevant relationships or constraints.
External Knowledge:
{external_knowledge}
Question:
{question}
Instructions:
- Make sure you only output the information that is asked in the question. If the question asks for a specific column, make sure to only include that column in the SELECT clause, nothing more.
- The generated query should return all of the information asked in the question without any missing or extra information.
- Before generating the final SQL query, please think through the steps of how to write the query. It should include detailed considerations such as analyzing questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, thinking of how to call SQL tools, and revisiting previous steps.
Format:
- Conduct thinking inside <think>...</think> blocks every time you get new observation or information.
- You can use SQL tool written within a single <sql>your sql</sql> block to explore or verify. SQL tool output will be shown as dataframe inside <observation>...</observation>. Based on this observation, you can think again and refine.
- The returned dataframe will be truncated in 50 rows if observation is too long.
- If you find no further exploration is needed or reaches max turns, you MUST directly provide the final SQL query solution inside <solution>...</solution>.
...
{db_details}:
CREATE TABLE country_aliases (
alias_id INTEGER,
country_id INTEGER,
alias_name TEXT,
alias_type TEXT,
PRIMARY KEY (alias_id),
FOREIGN KEY (country_id) REFERENCES countries (country_id)
);
...
{question}:
Could you please provide me with the English name, current population, average population over the last five years, and population from last year for each country, based on the country's population metrics? I need this information to analyze population trends.The agent is given a maximum of 6 turns to generate a valid SQL query with the correct formatting, with negative rewards being assigned for incorrect formatting, 0 rewards for correct formatting but incorrect SQL, and a positive reward for correct SQL.
Data Preparation
First, follow the data preparation instructions (see \<skyrl-sql-data\> section) to download the dataset, as well as the database files (note that
the database files are quite large - around 50GB in total). We reproduce the commands here:
hf download NovaSky-AI/SkyRL-SQL-653-data-newfmt --local-dir $HOME/data/sql --repo-type dataset
hf download seeklhy/OmniSQL-datasets data.zip --repo-type dataset --local-dir <path_to_file.zip>
unzip <path_to_file.zip>Training Configuration
Now that we have our dataset and database files, let's walk through the some of the key training configurations.
# path for dataset (.parquet files) containing the prompts and metadata for each question
DATA_DIR="$HOME/data/sql"
# path for .db files for environment interaction
DB_PATH="$HOME/path/to/db_files"
uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
#### Environment configuration
environment.env_class=text2sql \
environment.skyrl_gym.text2sql.db_path=$DB_PATH \
#### Multi-turn Async Rollouts configuration
# this is used to set the max turns for the environment
generator.max_turns=6 \
# we need to make sure to set async_engine=true for async rollouts
generator.async_engine=true \
# we need to make sure to set batched=false for async rollouts
generator.batched=false \
#### context length related configurations
# trainer.max_prompt_length is the max length of the initial prompt
trainer.max_prompt_length=6000 \
# generator.max_input_length is the max length of the input to the model after any number of turns (including the initial prompt)
generator.max_input_length=29000 \
# generator.sampling_params.max_generate_length is the max length of the generated response for EACH turn
generator.sampling_params.max_generate_length=3000 \
#### multi-turn generation format - see `skyrl_train/generators/skyrl_gym_generator.py` for more details
generator.use_conversation_multi_turn=false \
#### data configuration
data.train_data="['$DATA_DIR/train.parquet']" \
data.val_data="['$DATA_DIR/validation.parquet']" \
#### Placement configuration - note since we set use_kl_loss=false below, we don't need to use a ref model
# policy placement configuration
trainer.policy.model.path="Qwen/Qwen2.5-Coder-7B-Instruct" \
trainer.placement.colocate_all=true \
trainer.placement.policy_num_gpus_per_node=8 \
# inference engine placement configuration
generator.num_inference_engines=2 \
generator.inference_engine_tensor_parallel_size=4 \
#### algorithm configuration
trainer.epochs=30 \
trainer.algorithm.advantage_estimator="grpo" \
trainer.algorithm.use_kl_loss=false \
trainer.algorithm.use_kl_loss=false \
generator.n_samples_per_prompt=5 \
#### generation sampling params (relevant to algorithm correctness)
generator.sampling_params.temperature=0.6 \
generator.sampling_params.top_p=0.95 \
generator.sampling_params.stop='["</sql>", "</solution>"]' \
generator.eval_sampling_params.stop='["</sql>", "</solution>"]' \
#### training configuration
trainer.policy.optimizer_config.lr=1.0e-6 \
trainer.train_batch_size=256 \
trainer.policy_mini_batch_size=256 \
trainer.micro_forward_batch_size_per_gpu=8 \
trainer.micro_train_batch_size_per_gpu=1 \
trainer.eval_batch_size=1024 \
trainer.eval_before_train=true \
trainer.eval_interval=5 \
... # Other parameters (see `examples/text_to_sql/run_skyrl_sql.sh` for the full script)-
All we have to do to enable multi-turn training with async rollouts is to simply set
generator.max_turnsto the maximum number of turns we want the agent to take, and to make suregenerator.async_engine=trueandgenerator.batched=false. -
Chat templating and loss masking for multi-turn conversations are handled by the
SkyRLGymGeneratorclass.- In the above example, we set
use_conversation_multi_turn=falseto enforce that the multi-turn conversation is formatted as a single assistant response. - We also set
stop='["</sql>", "</solution>"]'for bothsampling_paramsandeval_sampling_paramsas a part of the training recipe. - If you are using
generator.use_conversation_multi_turn=true, you might want to append an EOS token ID to the end of the response after these stop strings to adhere to the model's behavior (i.e. ending generation with an EOS token ID rather than say</solution>). This can be done by settinggenerator.append_eos_token_after_stop_str_in_multi_turn=truein the generator config. The full script is available inexamples/text_to_sql/run_skyrl_sql_conversation_format.sh. - If you want to use a conversation-based format, you can set
use_conversation_multi_turn=trueand the model will generate a separate assistant response for each turn. This is supported only withbackend="vllm"as of now. - See
skyrl_train/generators/skyrl_gym_generator.pyfor more details on both options!
- In the above example, we set
Launching Your Training Run
Let's get our training run started! Make sure to set your WandB API key for logging, and that your database and dataset paths are correctly set.
export WANDB_API_KEY=your_wandb_api_key
bash examples/text_to_sql/run_skyrl_sql.shAnd now watch your model start to learn to generate better SQL queries!
What's Next?
Now that you've seen what's possible with multi-turn training with async rollouts, you might want to start building your own multi-turn environments!
- Creating a New Environment: Learn how to build your own multi-turn environments!
- Search Example: Learn how to train a multi-turn search agent with SkyRL!