Troubleshooting
Placement Group Timeouts
In SkyRL, we use Ray placement groups to request resources for different actors. In Ray clusters that autoscale with KubeRay, placement group creation can take a long time since the cluster might have to add a new node, pull the relevant image and start the container, etc.
You can use the SKYRL_RAY_PG_TIMEOUT_IN_S environment variable (Used in the .env file passed to the uv run command with --env-file) to increase the timeout for placement group creation (By default, this is 180 seconds)
Multi-node Training
For multi-node training, it is helpful to first confirm that your cluster is properly configured. We provide a script at scripts/multi_node_nccl_test.py to test multi-node communication.
To run the script, you can use the following command:
uv run --isolated --env-file .env scripts/multi_node_nccl_test.py --num-nodes 2.env is optional, but it is recommended to use for configuring environment variables.
Note on LD_LIBRARY_PATH
If you are using RDMA, you may need to customize the LD_LIBRARY_PATH to include the RDMA libraries (Ex: EFA on AWS). We've seen issues with uv where the LD_LIBRARY_PATH is not exported even if it is set in the .env file. It is recommended to set the SKYRL_LD_LIBRARY_PATH_EXPORT=1 in the .env file and set LD_LIBRARY_PATH directly in the current shell.
Worker startup timeout
Sometimes, you might see the following logs from the raylet:
(raylet) \[2025-08-29 23:30:23,740 E 2711113 2711113] (raylet) worker\_pool.cc:586: Some workers of the worker process(2769699) have not registered within the timeout. The process is still alive, probably it's hanging during start. \[repeated 4x across cluster]
What this means
In Ray, all tasks and actors are run in a "worker process". Each worker process has to register with Ray's centralized Global Control Service. Code inside the actor class' __init__ method is run after registration. In our case, we use the uv + ray integration, and thus startup includes starting the worker process with the uv run command (exact same command used to launch the entrypoint).
The error message is referring to worker process not starting up within timeout. This can be due to
- Incorrect startup command : maybe the path to an env file passed to
uv run --env-file <env_file> ..is incorrect, or not available on the worker node, etc- Fix: Correct the startup command so that it is valid when run inside the working directory on the head as well as worker nodes.
- Slow startup : Your run can hang even if you simply exceed Ray's timeout limit for worker startup. This can happen with a cold-start run - the first SkyRL run in your cluster when
uvhas to download all the dependencies and populate its cache.- Fix: Increase the registration timeout for Ray workers with
RAY_worker_register_timeout_seconds(say 600). This should be set before the ray cluster is launched with ray start. The ideal fix is to bake this large value in the container.
- Fix: Increase the registration timeout for Ray workers with
Memory leaks during training with code that uses multiprocessing
One of the most common issues during training is unexplained memory leaks that lead to CPU RAM OOMs. The root cause can be varied, but typically you can get to the problematic piece of code by just inspecting memory usage:
ps -eo pid,comm,rss --sort=-rss | awk 'NR==1{print $0} NR>1{printf "%-8s %-20s %.2f GB\n", $1, $2, $3/1024/1024}' | headOne common anti-pattern that causes memory leaks: Using os.fork with Ray. If your custom environment uses fork to spawn multiple processes, it can leak to unexplained behaviour because of its interaction with Ray. Ray uses sockets to communicate with other Ray-related processes and using fork is not a good idea in this case. If you’re seeing a memory leak, check out which of the processes are using the most memory with the above command, and if the processes are doing the same thing , then it is likely an issue with fork + Ray (although not always). See: https://docs.ray.io/en/latest/ray-core/patterns/fork-new-processes.html for more details.