Skip to main content

Monitoring

Inference service exposes Prometheus metrics for monitoring system stats.

vLLM runtime

vLLM exposes Prometheus metrics at the /metrics endpoint. These metrics provide detailed insights into the system's performance, resource utilization, and request processing statistics. The following table lists all available metrics:

vLLM version: 0.8.2

note

vLLM V1 engine is now enabled by default. Check here for details.

CategoryMetric NameTypeDescription
System Stats - Schedulervllm:num_requests_runningGaugeNumber of requests currently running on GPU
System Stats - Schedulervllm:num_requests_waitingGaugeNumber of requests waiting to be processed
System Stats - Schedulervllm:lora_requests_infoGaugeRunning stats on lora requests
System Stats - Schedulervllm:num_requests_swappedGaugeNumber of requests swapped to CPU (DEPRECATED: KV cache offloading is not used in V1)
System Stats - Cachevllm:gpu_cache_usage_percGaugeGPU KV-cache usage. 1 means 100 percent usage
System Stats - Cachevllm:cpu_cache_usage_percGaugeCPU KV-cache usage. 1 means 100 percent usage (DEPRECATED: KV cache offloading is not used in V1)
System Stats - Cachevllm:cpu_prefix_cache_hit_rateGaugeCPU prefix cache block hit rate (DEPRECATED: KV cache offloading is not used in V1)
System Stats - Cachevllm:gpu_prefix_cache_hit_rateGaugeGPU prefix cache block hit rat (DEPRECATED: use vllm:gpu_prefix_cache_queries and vllm:gpu_prefix_cache_hits in V1)
System Stats - Cachevllm:gpu_cache_usage_percCounterGPU prefix cache queries, in terms of number of queried blocks (V1 only)
System Stats - Cachevllm:gpu_prefix_cache_hitsCounterGPU prefix cache hits, in terms of number of cached blocks (V1 only)
Iteration Statsvllm:num_preemptions_totalCounterCumulative number of preemption from the engine
Iteration Statsvllm:prompt_tokens_totalCounterNumber of prefill tokens processed
Iteration Statsvllm:generation_tokens_totalCounterNumber of generation tokens processed
Iteration Statsvllm:iteration_tokens_totalHistogramNumber of tokens per engine_step
Iteration Statsvllm:time_to_first_token_secondsHistogramTime to first token in seconds
Iteration Statsvllm:time_per_output_token_secondsHistogramTime per output token in seconds
Request Statsvllm:e2e_request_latency_secondsHistogramEnd to end request latency in seconds
Request Statsvllm:request_queue_time_secondsHistogramTime spent in WAITING phase for request
Request Statsvllm:request_inference_time_secondsHistogramTime spent in RUNNING phase for request
Request Statsvllm:request_prefill_time_secondsHistogramTime spent in PREFILL phase for request
Request Statsvllm:request_decode_time_secondsHistogramTime spent in DECODE phase for request
Request Statsvllm:request_prompt_tokensHistogramNumber of prefill tokens processed per request
Request Statsvllm:request_generation_tokensHistogramNumber of generation tokens processed per request
Request Statsvllm:request_max_num_generation_tokensHistogramMaximum number of requested generation tokens
Request Statsvllm:request_params_nHistogramThe 'n' request parameter
Request Statsvllm:request_params_max_tokensHistogramThe 'max_tokens' request parameter
Request Statsvllm:request_success_totalCounterCount of successfully processed requests
Speculative Decodingvllm:spec_decode_draft_acceptance_rateGaugeSpeculative token acceptance rate (DEPRECATED: Unused in V1)
Speculative Decodingvllm:spec_decode_efficiencyGaugeSpeculative decoding system efficiency (DEPRECATED: Unused in V1)
Speculative Decodingvllm:spec_decode_num_accepted_tokens_totalCounterNumber of accepted tokens
Speculative Decodingvllm:spec_decode_num_draft_tokens_totalCounterNumber of draft tokens
Speculative Decodingvllm:spec_decode_num_emitted_tokens_totalCounterNumber of emitted tokens (DEPRECATED: Unused in V1)