Remote Inference Engine
This example shows how to run GRPO training on GSM8K using a standalone inference engine instance behind an HTTP server.
In this example, we will be using 4 GPUs for training (policy and ref models) and 4 GPUs for inference (inference engine) on a single node.
Prerequisites
First, make sure you are familiar with the standard setup process for running GRPO training. See Quick Start Guide for more details.
Starting the Remote Inference Engine
We provide an example remote server implementation at skyrl_train/inference_engines/vllm/vllm_server.py.
In addition to the standard OpenAI-compatible generation endpoints (/v1/chat/completions and /v1/completions), the remote server simply needs to expose endpoints for the following operations:
/sleepand/wake_upto sleep and wake up the inference engine (only needed if you want to colocate the inference engine with the training process)/init_weight_update_communicatorto create a process group for weight updates/update_weightto update the model weights
That's it! We use a custom WorkerWrap class at skyrl_train/inference_engines/vllm/vllm_engine.py with VLLM's worker extension API to enable efficient NCCL weight updates.
To launch the server, run the following command (the full script is at examples/remote_inference_engine/run_vllm_server.sh):
uv run --isolated --extra vllm -m skyrl_train.inference_engines.vllm.vllm_server \
# model and tensor parallel size
--model Qwen/Qwen2.5-1.5B-Instruct \
--tensor-parallel-size 4 \
# host and port for the server
--host 127.0.0.1 \
--port 8001 \
# seed for reproducibility
--seed 42 \
# max context length
--max-model-len 4096 \
# vllm performance related settings
--enable-prefix-caching \
--enable-chunked-prefill \
--dtype bfloat16 \
--gpu-memory-utilization 0.9 \
--enable-sleep-mode \
--max-num_batched_tokens 8192 \
--max-num-seqs 1024 \
# other misc settings
--trust-remote-code \
--distributed-executor-backend ray \
# worker extension class for handling weight updates
--worker-extension-cls skyrl_train.inference_engines.vllm.vllm_engine.WorkerWrapStarting Training
Now that we've started our remote inference engine (located behind 127.0.0.1:8001), we can start a training run!
To start training, we need to set up our training script. You can find a complete example in examples/remote_inference_engine/run_remote.sh:
uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
# Setup for training with a remote inference engine
generator.run_engines_locally=False \
generator.remote_inference_engine_urls="['127.0.0.1:8001']" \
generator.override_existing_update_group=True \
# sampling parameters for generation
generator.sampling_params.temperature=0.6 \
generator.sampling_params.top_p=0.95 \
# Data setup
data.train_data="['$HOME/data/gsm8k/train.parquet']" \
data.val_data="['$HOME/data/gsm8k/validation.parquet']" \
# Policy model - make sure this is the same model used to launch the inference engine server
trainer.policy.model.path="Qwen/Qwen2.5-1.5B-Instruct" \
trainer.algorithm.advantage_estimator="grpo" \
# Whether or not to colocate all models on the same set of GPUs - we set it to false here,
# but you can colocate even with a standalone inference engine!
trainer.placement.colocate_all=False \
# Model placement arguments for policy and ref models - make sure that the total number of gpus
# used for training and inference is maximized
trainer.placement.policy_num_gpus_per_node=4 \
trainer.placement.ref_num_gpus_per_node=4 \
# Training batch size and mini/micro batch sizes for logprobs + training passes
trainer.train_batch_size=64 \
trainer.policy_mini_batch_size=64 \
trainer.micro_forward_batch_size_per_gpu=20 \
trainer.micro_train_batch_size_per_gpu=20 \
# Evaluation
trainer.eval_batch_size=1024 \
trainer.eval_before_train=true \
trainer.eval_interval=5 \
... # Other parameters (see `examples/remote_inference_engine/run_remote.sh` for more)With remote servers, there can be non-trivial HTTP overhead during generation. When running training and inference in the same Ray cluster, it is recommended to use run_engines_locally=True to maximize throughput.
Launching Your Training Run
You're done setting up! Now let's get our training run started!
export WANDB_API_KEY=your_wandb_api_key
bash examples/remote_inference_engine/run_remote.shWhat's Next?
Now that you've set up training with a remote inference engine, you might want to explore ways of speeding up training:
- Async Training: Asynchronous off-by-one training in < 100 lines of code!