SkyRL
Checkpointing and Observability

vLLM Engine Metrics

SkyRL can route vLLM's engine-level metrics (queue depth, KV cache usage, throughput, latency, prefix-cache hit rate) through Ray's per-node Prometheus metrics agents. A small fixed subset is also scraped once per training step and merged into the trainer's wandb payload.

Enabling

This is on by default. To disable it:

generator:
  inference_engine:
    enable_ray_prometheus_stats: false

When enabled, vLLM's RayPrometheusStatLogger is installed on every engine. Each engine reports its stats through ray.util.metrics, and Ray's per-node metrics agent exposes them at http://<node-ip>:<MetricsExportPort>/metrics in Prometheus text format.

Inference path support

Inference pathSupported
New inference (_SKYRL_USE_NEW_INFERENCE=1, default)Yes
Old inference + generator.async_engine=trueYes
Old inference + generator.async_engine=falseNo

The new inference path (vllm_server_actor.py:329-339) always uses AsyncLLMEngine and wires the stat logger unconditionally.

The legacy path supports it only when async_engine=true (vllm_engine.py:359-370). The synchronous VLLMInferenceEngine pops the flag and emits a warning (vllm_engine.py:240-247): vLLM's sync LLM class doesn't accept stat_loggers. Set generator.async_engine=true if you need engine metrics on the legacy path.

Metrics logged to wandb

When the flag is on, the trainer constructs a VLLMMetricsScraper (trainer.py:122-124) that scrapes every alive Ray node's metrics endpoint once per training step and merges its output into the wandb log payload — the same payload used for training metrics, so the keys appear under whatever logger backend is configured (wandb, mlflow, swanlab, tensorboard, or console).

Both Trainer and FullyAsyncTrainer log these:

KeySourceAggregation
vllm/num_requests_runninggaugesum across replicas
vllm/num_requests_waitinggaugesum across replicas
vllm/kv_cache_usage_percgaugemean across replicas
vllm/generation_throughput_tok_scounter delta / Δtsummed before differencing
vllm/prompt_throughput_tok_scounter delta / Δtsummed before differencing
vllm/prefix_cache_hit_ratehits Δ / queries Δsummed before ratio
vllm/ttft_seconds_avghistogram sum Δ / count Δsummed before ratio
vllm/tpot_seconds_avghistogram sum Δ / count Δsummed before ratio

Rate- and ratio-style metrics need two consecutive samples to take a delta, so they appear starting from the second training step. Counter resets (e.g. engine restart) are skipped rather than reported as negative rates.

The full set of vLLM metrics is still available via the Prometheus endpoints themselves — only this curated subset is forwarded to wandb. The selection lives in vllm_metrics_scraper.py:27-51.

Querying additional metrics

The curated subset above is only what SkyRL forwards to wandb. Every metric the vLLM engine exports is still available on the same Ray metrics endpoints, so you can query anything vLLM emits — not just the keys in the table.

To actually query these over time, point a Prometheus server at Ray's metrics endpoints (Ray exposes them for exactly this) and use PromQL.

Names are sanitized on the way out: vLLM's : becomes _, and Ray's metrics agent prepends ray_. So vLLM's vllm:kv_cache_usage_perc is exported as ray_vllm_kv_cache_usage_perc, and histograms expose _sum / _count / _bucket samples.

Example: KV cache block lifetime

vLLM can emit KV-cache residency histograms — block lifetime (allocation → eviction), idle time before eviction, and reuse gaps. These are off by default; start the engine with --kv-cache-metrics (they also require log stats to be enabled, and are sampled at 1% of blocks via --kv-cache-metrics-sample). Once on, the series are ray_vllm_kv_block_lifetime_seconds_{sum,count,bucket}.

Because it's a histogram, query it with histogram_quantile over the bucket rate — e.g. p95 block lifetime over a 5m window:

# p95 KV block lifetime
histogram_quantile(
  0.95,
  sum by (le) (
    rate(ray_vllm_kv_block_lifetime_seconds_bucket[5m])
  )
)

On this page