Creating a New Environment or Task
To demonstrate how to create custom environments in SkyRL-Gym and train with SkyRL, let's build a simple multiplication environment!
We'll walk through the complete process: implementing the environment, registering it, preparing training data, and running your first training session.
What we're building: An environment that asks the model to multiply numbers and checks if the answer is correct. The completed code is available at examples/multiply.
Environment Interface
SkyRL-Gym includes a simple text-in/text-out environment interface for LLM tasks, BaseTextEnv, which looks like this:
Base text environment interface (from skyrl_gym/envs/base_text_env.py):
class BaseTextEnv(Env[ConversationType, str]):
def step(self, action: str) -> BaseTextEnvStepOutput:
"""
Runs one environment step.
Args:
action: The LLM's response as a string
Returns:
BaseTextEnvStepOutput containing:
- observations: New messages from the environment
- reward: Float reward for the action
- done: Whether the episode is finished
- metadata: Additional info (optional)
"""
pass
def init(self, prompt: ConversationType) -> Tuple[ConversationType, Dict[str, Any]]:
return prompt, {}
def close(self):
passThis class inherits from Env, which is a generic environment interface (i.e., not specific to text-based tasks). An API reference for Env can be found in Environment API.
For our multiplication environment, we only need to implement the step method above because we don't have any initialization or cleanup to do.
Simple Single-Turn Environment
Let's start with a basic version that gives the model only one chance to get the answer right.
The prompt and response format expected by the multiply environment is as follows:
- The model prompt will be a multiplication problem of 2 n-digit numbers, such as "123 * 456" or "999 * 999".
- The model output should be in the format of
\\boxed{answer}, whereansweris the product of the two numbers.
So, the environment step must simply parse the answer out of \\boxed{answer} and check if it matches the ground truth.
Simple multiplication environment:
class MultiplyEnv(BaseTextEnv):
def _parse_action(self, action: str) -> str:
"""Extract answer from \\boxed{answer} format"""
match = re.search(r"\\boxed\{([^}]+)\}", action)
return match.group(1) if match else None
def step(self, action: str) -> BaseTextEnvStepOutput:
answer = self._parse_action(action)
is_correct = answer is not None and answer.strip() == str(self.ground_truth).strip()
return BaseTextEnvStepOutput(
observations=[],
reward=1.0 if is_correct else 0.0,
done=True,
metadata={"parsed_answer": answer}
)That's it! The environment checks if the model's answer matches the ground truth and gives a reward of 1.0 for correct answers, 0.0 for incorrect ones.
Multi-Turn Environment
Want to give the model multiple attempts? Let's extend our environment to allow multiple turns.
We will make a few simple extensions to our step() method:
-
Keep track of the number of turns (
self.turns) and indicate the trajectory isdoneafter a configured maximum number of turns (self.max_turns) -
If the turns expire or the model provides a correct answer, we indicate the trajectory is
doneand return a reward as follows:- Correct answer: 1.0.
- Incorrect answer, but in format of
\\boxed{...}: 0.5. - Incorrect answer, and not in format of
\\boxed{...}: 0.0.
-
If the model is incorrect and has more turns remaining, we also provide feedback as a new
observation.
Multi-turn multiplication environment in examples/multiply/env.py:
def step(self, action: str) -> BaseTextEnvStepOutput:
self.turns += 1
answer = self._parse_action(action)
is_correct = answer is not None and answer.strip() == str(self.ground_truth).strip()
found_boxed = answer is not None
# Episode ends if max turns reached or correct answer found
done = self.turns >= self.max_turns or is_correct
# Reward structure:
# - Correct answer: 1.0
# - Wrong answer in correct format: 0.5
# - No boxed answer: 0.0
if is_correct:
reward = 1.0
elif found_boxed:
reward = 0.5
else:
reward = 0.0
if done:
return BaseTextEnvStepOutput(
observations=[],
reward=reward,
done=True,
metadata={"parsed_answer": answer}
)
# Give feedback for another attempt
if answer is not None:
feedback = f"Your answer '{answer}' is incorrect. Please try again."
else:
feedback = "Please provide your answer in the format \\boxed{your_answer}."
return BaseTextEnvStepOutput(
observations=[{"role": "user", "content": feedback}],
reward=0.0,
done=False,
metadata={"parsed_answer": answer}
)The multi-turn version gives partial credit for formatting the answer correctly, even if it's wrong. This helps the model learn the expected output format.
The final implementation is available in examples/multiply/env.py.
(Turn-level) Rewards And Metrics
In the example above, unless done=True, the reward is 0.0. That is, the model only receives a single reward for the entire trajectory.
You can experiment with turn-level rewards by returning a non-zero reward in any turn. Otherwise, if you only want to use outcome rewards, you can simply return reward=0.0 for all intermediate turns.
SkyRL automatically computes the following metrics for logging purposes:
pass_at_n: Theninpass_at_nis the number of trajectories we generate for each example.pass_at_nis 1 if any trajectory succeeded, and 0 otherwise. For each trajectory, we assume that the last turn's reward signifies the entire trajectory's reward, and any positive value is considered a "pass".mean_raw_reward: for each trajectory, we sum over all the turns' rewards. We then take the average over all the trajectories.
Whether you use turn-level rewards or outcome rewards, the rewards used to train the model will be translated to per-token rewards. For example, if there are 3 turns with 4 response tokens each and the turn-level rewards are [1.0, 2.0, 3.0], the resulting per-token rewards are:
[
0.0, 0.0, 0.0, 1.0,
0.0, 0.0, 0.0, 2.0,
0.0, 0.0, 0.0, 3.0,
]If there is only an outcome reward of 1.0 (i.e. intermediate turns' rewards are all 0.0), the per-token rewards are:
[
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 1.0,
]Registering Your New Environment
Finally, we need to register the new environment so the training stack can find it by name (which we refer to as env_class). We will name this environment multiply.
We will create a new entrypoint for training with the multiply environment by creating a file at examples/multiply/main_multiply.py that looks like this:
Environment registration at examples/multiply/main_multiply.py:
@ray.remote(num_cpus=1)
def skyrl_entrypoint(cfg: DictConfig):
# Register the multiply environment
# this needs to be done inside the entrypoint task
register(
id="multiply", # <-- The name of the environment.
entry_point="examples.multiply.env:MultiplyEnv", # <-- The path to the environment class.
)
# make sure that the training loop is not run on the head node.
exp = BasePPOExp(cfg)
exp.run()
@hydra.main(config_path=config_dir, config_name="ppo_base_config", version_base=None)
def main(cfg: DictConfig) -> None:
# validate the arguments
validate_cfg(cfg)
initialize_ray(cfg)
ray.get(skyrl_entrypoint.remote(cfg))
if __name__ == "__main__":
main()Now, the training stack can simply build the new environment with skyrl_gym.make("multiply")!
All example code written in this document is outside of the skyrl-train and skyrl-gym packages. There is no need to fork and edit skyrl-train or skyrl-gym code -- just implement and register your environment, and the training stack can find the environment seamlessly!
Preparing Training Data
Before we can train, we need a dataset of problems to train on.
We can generate a dataset of multiplication problems using examples/multiply/multiply_dataset.py. See the file for more details, but the core idea is to generate random multiplication problems of n-digit numbers, and ensure the dataset example is in the correct format:
Generating a dataset of random multiplication problems.:
for idx in range(num_examples):
question, answer = generate_multiplication_problem(num_digits)
data = {
"data_source": "synthetic_multiply",
"prompt": [
system_prompt,
{
"role": "user",
"content": question,
}
],
"env_class": "multiply",
"reward_spec": {
"method": "rule",
"ground_truth": answer,
},
"extra_info": {
"num_digits": num_digits,
"split": split_name,
},
}
examples.append(data)Note that the env_class here should match the name of the environment we registered. In this case, it is multiply. You can optionally omit the env_class here and instead set it in the training configuration to apply to all training samples, but setting env_class per-sample allows for multi-environment training so it is the recommended practice.
See the doc on Dataset Preparation for more details on the required dataset format and how to prepare your own dataset.
Now we can generate the datsaet:
uv run --isolated examples/multiply/multiply_dataset.py \
--output_dir $HOME/data/multiply \
--num_digits 4 \
--train_size 10000 \
--test_size 200This creates train.parquet and validation.parquet files in the $HOME/data/multiply directory.
Training Your Model
Time to train! 🚀
We will use the run_multiply.sh script to train the model. This script is located in examples/multiply/run_multiply.sh, which sets up the training configuration and calls main_multiply.py.
Common Configuration Parameters
First, ensure sure your config matches your available GPUs. You may need to adjust the following parameters to match your GPU count (which we set via an environment variable NUM_GPUS):
trainer.placement.policy_num_gpus_per_nodegenerator.num_inference_engines
Then, configure how the environment should be executed. For multi-turn environments, we recommend setting generator.batched=false and generator.async_engine=true to ensure that each environment is executed asynchronously. If your environment is single-turn, you may get better performance by reversing these settings.
Launch Training
export WANDB_API_KEY=your_wandb_api_key # or set trainer.logger="console" to print to stdout
bash examples/multiply/run_multiply.shNext Steps: Want to make multiplication easier? Try integrating a calculator tool into your environment! Check out the tools_guide documentation for details.
That's it! You've created a custom environment, prepared training data, and started training. The same pattern works for any text-based task you want to train on.
Now watch your model become a multiplication master!