Multi-Turn VLM RL on Geometry-3K
Uses the fsdp backend. See the Vision-Language RL tutorial for the VLM setup background (required flags and local vLLM source override).
In this example we train a vision-language model to solve geometry problems from the Geometry-3K dataset. The model sees a geometry diagram plus a text question, reasons, and iteratively checks its answer with a calc_score tool before committing to a final boxed answer.
Source code: examples/train/geometry3k/.
Task Overview
The dataset contains ~3,000 geometry problems, each with:
- image(s) — geometry diagrams (PIL images).
- problem — problem text that may reference the image.
- answer — ground-truth answer.
The model interacts over up to 3 turns per episode:
- Read the diagram and question.
- Reason, then call
calc_scoreto check an answer:<tool_call>{"name": "calc_score", "arguments": {"answer": "<your_answer>"}}</tool_call> - Environment responds with
calc_score result: 0.0|1.0plus feedback; model can retry with a different approach. - Final answer as
\boxed{answer}.
Reward is binary (1.0 correct / 0.0 otherwise), extracted from \boxed{} or the last calc_score tool call.
Data Preparation
The training script auto-generates the dataset on first run. To prepare it explicitly:
uv run examples/train/geometry3k/geometry_3k_dataset.py --output_dir ~/data/geometry_3kThis downloads hiyouga/geometry3k, base64-encodes each PIL diagram as a JPEG data URI, and writes train.parquet / val.parquet. Each row's prompt is a chat-format message with an image_url content part followed by the question text. See the Vision-Language RL tutorial for the record format.
Tool-Calling Protocol
The environment at examples/train/geometry3k/env.py parses <tool_call> blocks from the model output:
TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.DOTALL)
SUPPORTED_TOOL_NAMES = {"calc_score"}If the tool call scores 1.0, the env returns "you can now provide the final answer." If 0.0 and not on the final turn, the env nudges the model to reason differently. If no tool call is found, the env falls back to extracting a boxed answer directly from the text and terminates the episode.
Observations returned by step() are plain-text feedback strings (no images) — the model sees the diagram once in the initial prompt and reasons about it across turns without new image inputs. This is the text-only observation pattern described in the VLM tutorial.
Training Configuration
Key VLM-specific overrides from examples/train/geometry3k/run_geometry3k.sh:
_SKYRL_USE_NEW_INFERENCE=1 uv run --isolated --extra fsdp \
python examples/train/geometry3k/geometry3k_entrypoint.py \
trainer.policy.model.path="Qwen/Qwen3-VL-8B-Instruct" \
trainer.strategy=fsdp2 \
# VLM flags — see the tutorial for background
generator.vision_language_generator=true \
generator.batched=false \
trainer.use_sample_packing=false \
# Multi-turn rollouts
generator.max_turns=3 \
generator.inference_engine.async_engine=true \
# Algorithm
trainer.algorithm.advantage_estimator="grpo" \
trainer.algorithm.use_kl_loss=false \
generator.n_samples_per_prompt=4 \
# ... full config in the scriptDefaults: Qwen3-VL-8B-Instruct, GRPO, 4 samples per prompt, 3 turns max, max prompt length 1024 / max generate 2048 per turn.
Launching
bash examples/train/geometry3k/run_geometry3k.shOverride with env vars:
# W&B logging
LOGGER=wandb bash examples/train/geometry3k/run_geometry3k.sh
# Custom data dir
DATA_DIR=/path/to/data bash examples/train/geometry3k/run_geometry3k.shResults

What's Next
- VisGym example — multi-image multi-turn VLM RL where the env returns a new image per step.
- Vision-Language RL tutorial — shared setup background for both VLM examples.
- Creating a New Environment — build your own multi-turn env.
Citation
Geometry-3K was introduced by Lu et al. (2021) as part of Inter-GPS. The parquet build used in this example is derived from the hiyouga/geometry3k repackaging on Hugging Face. If you use this example in your work, please cite the original paper:
@inproceedings{lu2021inter,
title = {Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning},
author = {Lu, Pan and Gong, Ran and Jiang, Shibiao and Qiu, Liang and Huang, Siyuan and Liang, Xiaodan and Zhu, Song-Chun},
booktitle = {The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021)},
year = {2021}
}