vLLM Engine Metrics

SkyRL can route vLLM's engine-level metrics (queue depth, KV cache usage, throughput, latency, prefix-cache hit rate) through Ray's per-node Prometheus metrics agents. A small fixed subset is also scraped once per training step and merged into the trainer's wandb payload.

Enabling

This is on by default. To disable it:

generator:
  inference_engine:
    enable_ray_prometheus_stats: false

When enabled, vLLM's RayPrometheusStatLogger is installed on every engine. Each engine reports its stats through ray.util.metrics, and Ray's per-node metrics agent exposes them at http://<node-ip>:<MetricsExportPort>/metrics in Prometheus text format.

Metrics logged to wandb

When the flag is on, the trainer constructs a VLLMMetricsScraper (trainer.py:122-124) that scrapes every alive Ray node's metrics endpoint once per training step and merges its output into the wandb log payload — the same payload used for training metrics, so the keys appear under whatever logger backend is configured (wandb, mlflow, swanlab, tensorboard, or console).

Both Trainer and FullyAsyncTrainer log these:

Key	Source	Aggregation
`vllm/num_requests_running`	gauge	sum across replicas
`vllm/num_requests_waiting`	gauge	sum across replicas
`vllm/kv_cache_usage_perc`	gauge	mean across replicas
`vllm/generation_throughput_tok_s`	counter delta / Δt	summed before differencing
`vllm/prompt_throughput_tok_s`	counter delta / Δt	summed before differencing
`vllm/prefix_cache_hit_rate`	hits Δ / queries Δ	summed before ratio
`vllm/ttft_seconds_avg`	histogram sum Δ / count Δ	summed before ratio
`vllm/tpot_seconds_avg`	histogram sum Δ / count Δ	summed before ratio

Rate- and ratio-style metrics need two consecutive samples to take a delta, so they appear starting from the second training step. Counter resets (e.g. engine restart) are skipped rather than reported as negative rates.

The full set of vLLM metrics is still available via the Prometheus endpoints themselves — only this curated subset is forwarded to wandb. The selection lives in vllm_metrics_scraper.py:27-51.

Querying additional metrics

The curated subset above is only what SkyRL forwards to wandb. Every metric the vLLM engine exports is still available on the same Ray metrics endpoints, so you can query anything vLLM emits — not just the keys in the table.

To actually query these over time, point a Prometheus server at Ray's metrics endpoints (Ray exposes them for exactly this) and use PromQL.

Names are sanitized on the way out: vLLM's : becomes _, and Ray's metrics agent prepends ray_. So vLLM's vllm:kv_cache_usage_perc is exported as ray_vllm_kv_cache_usage_perc, and histograms expose _sum / _count / _bucket samples.

Example: KV cache block lifetime

vLLM can emit KV-cache residency histograms — block lifetime (allocation → eviction), idle time before eviction, and reuse gaps. These are off by default; start the engine with --kv-cache-metrics (they also require log stats to be enabled, and are sampled at 1% of blocks via --kv-cache-metrics-sample). Once on, the series are ray_vllm_kv_block_lifetime_seconds_{sum,count,bucket}.

Because it's a histogram, query it with histogram_quantile over the bucket rate — e.g. p95 block lifetime over a 5m window:

# p95 KV block lifetime
histogram_quantile(
  0.95,
  sum by (le) (
    rate(ray_vllm_kv_block_lifetime_seconds_bucket[5m])
  )
)

vLLM Engine Metrics

Enabling

Metrics logged to wandb

Querying additional metrics

Example: KV cache block lifetime

On this page