SkyRL
Troubleshooting

Troubleshooting

Placement Group Timeouts

In SkyRL, we use Ray placement groups to request resources for different actors. In Ray clusters that autoscale with KubeRay, placement group creation can take a long time since the cluster might have to add a new node, pull the relevant image and start the container, etc. You can use the SKYRL_RAY_PG_TIMEOUT_IN_S environment variable (Used in the .env file passed to the uv run command with --env-file) to increase the timeout for placement group creation (By default, this is 180 seconds)

Multi-node Training

For multi-node training, it is helpful to first confirm that your cluster is properly configured. We provide a script at scripts/multi_node_nccl_test.py to test multi-node communication.

To run the script, you can use the following command:

uv run --isolated --env-file .env scripts/multi_node_nccl_test.py --num-nodes 2

.env is optional, but it is recommended to use for configuring environment variables.

Note on LD_LIBRARY_PATH

If you are using RDMA, you may need to customize the LD_LIBRARY_PATH to include the RDMA libraries (Ex: EFA on AWS). We've seen issues with uv where the LD_LIBRARY_PATH is not exported even if it is set in the .env file. It is recommended to set the SKYRL_LD_LIBRARY_PATH_EXPORT=1 in the .env file and set LD_LIBRARY_PATH directly in the current shell.

Illegal Memory Access with vLLM

In some cases, you may encounter "illegal memory access" errors with vLLM >= 0.10.0: https://github.com/vllm-project/vllm/issues/23814. Currently, we recommend a workaround by downgrading to vLLM 0.9.2.

With SkyRL, this can be done with the following overrides:

uv run --isolated --extra vllm --with vllm==0.9.2 --with transformers==4.53.0 --with torch==2.7.0 --with "flash-attn@https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiTRUE-cp312-cp312-linux_x86_64.whl" -- ...

On this page