AI

Install vLLM on Linux for Production LLM Serving

vLLM is the open-source inference engine that turned PagedAttention from a research paper into the default way to serve open-weight LLMs at production throughput. It speaks the OpenAI API natively, batches thousands of requests on a single GPU, supports tensor parallelism across multiple GPUs, ships first-class quantization (FP8, AWQ, GPTQ, INT4), and runs every serious open model the day it lands on Hugging Face. This guide walks the full path from a clean Linux box with an NVIDIA GPU to a hardened, monitored, TLS-fronted vLLM server you can actually point traffic at.

Original content from computingforgeeks.com - post 167404

Most install guides stop at pip install vllm and a single curl. That gets you a demo, not a service. The flow here adds the parts a sysadmin actually has to do: a hardened systemd unit, an nginx reverse proxy that handles SSE streaming and per-key rate limiting, Let’s Encrypt TLS with auto-renewal, multi-tenant API keys via a gateway, Prometheus + Grafana with three real alert rules, performance tuning that matters, and benchmarks captured live on real hardware so the numbers mean something. If you only need a single-user local LLM and don’t care about concurrent throughput, Ollama is a simpler path; vLLM is for the case where you actually need a server.

Tested May 2026 with vLLM 0.20.1 on Ubuntu 24.04 LTS. Hardware: NVIDIA RTX 4090 (24 GB VRAM) with NVIDIA driver 580.95.05 and CUDA 13.0. All terminal output and benchmark numbers in this article were captured live; nothing is fabricated.

What you’ll build

By the end of this guide the box answers OpenAI-compatible API calls at https://vllm.example.com/v1/chat/completions, runs as a systemd service that restarts on failure, sits behind nginx with TLS and per-key rate limiting, exposes Prometheus metrics on a private port, and feeds a Grafana dashboard with alerts on queue depth, KV cache pressure, and time-to-first-token. The architecture is intentionally boring:

client ──HTTPS──► nginx (TLS, rate limit, auth) ──HTTP──► vllm.service (systemd) ──CUDA──► GPU
                                                                  │
                                                                  └──/metrics──► prometheus ──► grafana

That single GPU box gives a small team or a single internal tool a credible LLM endpoint. The same flow scales horizontally by adding more replicas behind nginx, or vertically by adding more GPUs and tensor parallelism, both of which the article covers in their own sections.

Hardware and OS prerequisites

vLLM is GPU-first. CPU-only builds exist for development but are not what this article configures. The compute capability of the GPU has to be 7.0 or higher, which covers every NVIDIA card from Volta (V100) onward. The practical floor is consumer Ampere (RTX 30-series) and the practical ceiling is whatever Blackwell card you can buy.

GPUVRAMComputeComfortable model size (FP16)
NVIDIA T416 GB7.5up to 7B with quantization
RTX 3060 12GB12 GB8.67B Q4 / 8B AWQ
RTX 409024 GB8.98B FP16, 13B AWQ, 32B Q4
L4 / A1024 GB8.6/8.98B FP16, 13B AWQ
L40S48 GB8.913B FP16, 32B AWQ, 70B INT4
A100 80GB80 GB8.032B FP16, 70B FP8, 70B AWQ
H100 80GB80 GB9.032B FP16, 70B FP8 native
H200 / B200141 / 192 GB9.0 / 10.070B FP16, 405B INT4 single GPU

The rough sizing rule is two bytes per parameter for FP16, one byte for FP8 or INT8, and roughly half a byte for INT4. Add 10 to 30 percent on top for the KV cache, more if you serve long contexts. A 70B FP16 model needs around 140 GB of weights alone; that does not fit on any single consumer GPU, which is why production 70B serving usually means tensor parallelism across two H100s or quantization down to FP8 or INT4.

OS support is straightforward. Ubuntu 24.04 LTS is the primary tier and the example commands here use it. Ubuntu 22.04 LTS still works. RHEL 9, Rocky Linux 9, and AlmaLinux 9 all work; the only differences are package manager (dnf in place of apt) and the NVIDIA driver source, called out in the relevant sections. Disk-wise, plan for 200 GB minimum: the Hugging Face cache for one big model is easily 50 GB, and you’ll want room for two or three.

Install the NVIDIA driver and CUDA

vLLM ships its own CUDA runtime via the PyTorch wheel, so you do not need to install the full CUDA Toolkit on the host. You only need the NVIDIA kernel driver. The driver minimum for vLLM 0.20.x is 550 for CUDA 12.4 wheels and 570 for the 12.8 / 12.9 / 13.0 wheels that ship by default. Older drivers will load the engine but fail with cryptic CUBLAS_STATUS_NOT_INITIALIZED errors at the first request.

On Ubuntu, install the recommended server driver with the metapackage. The -server variant uses an older, more conservative branch than the gaming driver and is the right pick for a headless box.

sudo apt update
sudo ubuntu-drivers devices
sudo apt install -y nvidia-driver-570-server
sudo reboot

The ubuntu-drivers devices output names a recommended driver for the GPU it detected. If it suggests something newer than 570, install the suggested version instead. On a fresh Ubuntu 24.04 install, the 570-server branch is the safe stable pick. Newer branches (575, 580) are available in the ppa:graphics-drivers/ppa PPA when you need them for Blackwell.

On Rocky / AlmaLinux 9, enable the NVIDIA CUDA repository and install the open kernel module. The open module is required for Hopper and Blackwell GPUs and works fine on Ada (RTX 4090) and Ampere.

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf module install -y nvidia-driver:open-dkms
sudo reboot

After the reboot, confirm the driver loaded and sees the GPU. The output below was captured on the test box used throughout this article.

nvidia-smi

The driver and CUDA versions in the top banner matter; everything below is informational.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:C2:00.0 Off |                  Off |
| 60%   31C    P8             33W /  450W |       1MiB /  24564MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+

Persistence mode On is worth keeping; it shaves a couple of seconds off CUDA context creation on each cold start. Enable it permanently with sudo systemctl enable --now nvidia-persistenced.

Install vLLM

The vLLM project recommends uv over plain pip as of 0.20.x. uv resolves the right PyTorch wheel for your CUDA version automatically with --torch-backend=auto and is significantly faster than pip on a fresh install. Two minutes versus seven, in the install captured here.

Create a dedicated system user, install uv, build the virtualenv, and install vLLM inside it. The whole stack lives under /opt/vllm so it never collides with the system Python.

sudo useradd -r -m -d /opt/vllm -s /bin/false vllm
sudo apt install -y python3-venv curl ca-certificates
curl -LsSf https://astral.sh/uv/install.sh | sudo -u vllm sh
sudo -u vllm /opt/vllm/.local/bin/uv venv /opt/vllm/.venv --python 3.12 --seed
sudo -u vllm /opt/vllm/.local/bin/uv pip install --python /opt/vllm/.venv/bin/python vllm --torch-backend=auto

The install pulls PyTorch with the matching CUDA build, FlashAttention, xFormers, the vLLM kernels, and the OpenAI-compatible server. The tail of the output looks like this:

 + torch==2.11.0+cu130
 + torchaudio==2.11.0+cu130
 + torchvision==0.26.0+cu130
 + tokenizers==0.22.2
 + transformers==5.8.0
 + triton==3.6.0
 + vllm==0.20.1
 + xgrammar==0.2.0
real    1m37.994s

Confirm the install with vllm --version. The path to the binary is /opt/vllm/.venv/bin/vllm; the rest of the article uses that absolute path so nothing depends on shell activation.

/opt/vllm/.venv/bin/vllm --version

The version string should match the installed wheel:

0.20.1

If you prefer Docker, the project publishes vllm/vllm-openai on Docker Hub with versioned tags like v0.20.1-cu130-ubuntu2404. Pin a versioned tag, never latest: the floating tag lags releases by a week or two and silently breaks reproducibility. Docker is the right pick if you want kernel isolation or run multiple LLM stacks on one box; the bare-metal install in this article is the right pick when you want a single dedicated server with the smallest possible attack surface.

One pitfall the install can hit on a minimal Ubuntu base image: Triton, vLLM’s kernel JIT compiler, needs the Python development headers at runtime. The error looks like fatal error: Python.h: No such file or directory inside a temp directory under /tmp/tmpXXXX/cuda_utils.c. The fix is one package:

sudo apt install -y python3-dev build-essential

Standard Ubuntu Server installs already have these packages, but slim base images (Docker images, vast.ai / RunPod containers, cloud-init minimal images) often do not. Install them once at the same time as python3-venv and the issue never appears.

First boot with vllm serve

Before turning the server into a systemd service, run it manually once. This is the fastest way to confirm the model loads, the API binds, and the GPU is being used. Pick a model, set Hugging Face credentials if it’s gated, and launch the OpenAI-compatible server.

This guide uses Qwen/Qwen2.5-7B-Instruct. It’s open weight (no Hugging Face token needed), well behaved on a 24 GB card at FP16, and recognized by every vLLM tool-calling parser. Substitute meta-llama/Llama-3.1-8B-Instruct if you have a Hugging Face token and have accepted the Meta license; the rest of the commands work identically. The same model selection logic from our Ollama models cheat sheet applies here, just with vLLM-flavored quantization names instead of Ollama tag names.

Pre-download the weights with the Hugging Face CLI. Doing this separately from the first vllm serve run keeps the boot logs readable and surfaces network errors before they get tangled with CUDA initialization. Anyone coming from the Ollama CLI will recognize the pattern: pull the model, then run.

sudo -u vllm /opt/vllm/.venv/bin/pip install --quiet huggingface_hub
sudo -u vllm /opt/vllm/.venv/bin/huggingface-cli download Qwen/Qwen2.5-7B-Instruct

For gated models, add HF_TOKEN=hf_xxx in front of the huggingface-cli call or run huggingface-cli login as the vllm user once. The token only needs to be present at download time; vLLM does not need it again after the model is on disk.

Now boot the server. Bind to 127.0.0.1 for the manual test so nothing is exposed to the network yet; the systemd unit in the next section keeps it on localhost permanently and lets nginx do the public-facing TLS termination.

sudo -u vllm /opt/vllm/.venv/bin/vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 127.0.0.1 --port 8000 \
  --api-key sk-cfg-demo-key \
  --gpu-memory-utilization 0.90 \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --served-model-name qwen2.5-7b

The first boot takes 30 to 90 seconds on a fast disk. vLLM loads the weights, allocates KV cache blocks, captures CUDA graphs at every batch size from 1 to --max-num-seqs, and finally starts the API server. The line you’re looking for is the last one:

INFO [core.py:109] Initializing a V1 LLM engine (v0.20.1) with config: model='Qwen/Qwen2.5-7B-Instruct'
INFO [cuda.py:368] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
INFO [flash_attn.py:646] Using FlashAttention version 2
INFO [gpu_model_runner.py:4879] Model loading took 14.25 GiB memory and 3.654477 seconds
INFO [gpu_worker.py:440] Available KV cache memory: 5.41 GiB
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000

Two numbers in that output are worth remembering. The 14.25 GiB is how much VRAM the weights occupy at bfloat16 (the default for Qwen 2.5). The 5.41 GiB is what’s left for the KV cache after vLLM reserves space for activations and overhead, governed by --gpu-memory-utilization 0.90. That KV cache budget directly determines how many concurrent sequences and how much context the server can handle. Bumping --gpu-memory-utilization up to 0.95 reclaims another GiB or two; pushing past 0.95 risks OOM on long-running services as memory fragments.

vllm serve startup logs showing FlashAttention 2 backend, model loading 14.25 GiB in 3.65s, KV cache 5.41 GiB available

The boot completes in well under a minute on a fast NVMe-backed cache. First-ever boots are slower because the model has to download from Hugging Face and CUDA graphs are captured from scratch; later restarts skip both steps and finish in twenty to thirty seconds.

From a second shell on the same box, list the served models. The API key was set with --api-key and travels in the standard OpenAI Authorization header.

curl -s http://127.0.0.1:8000/v1/models \
  -H "Authorization: Bearer sk-cfg-demo-key" | python3 -m json.tool

The response confirms the alias set with --served-model-name:

{
    "object": "list",
    "data": [
        {
            "id": "qwen2.5-7b",
            "object": "model",
            "created": 1778106064,
            "owned_by": "vllm",
            "root": "Qwen/Qwen2.5-7B-Instruct",
            "max_model_len": 16384,
            "permission": [
                {
                    "allow_sampling": true,
                    "allow_logprobs": true,
                    "allow_view": true,
                    "allow_fine_tuning": false
                }
            ]
        }
    ]
}

Send a real chat completion. Streaming is the more interesting test because Server-Sent Events behave differently through proxies; getting it right at this step saves debugging later.

curl -s -N http://127.0.0.1:8000/v1/chat/completions \
  -H "Authorization: Bearer sk-cfg-demo-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b",
    "stream": true,
    "messages": [{"role": "user", "content": "In one sentence, what is PagedAttention?"}]
  }' | head -20

The response arrives as data: chunks, each carrying a token or two:

data: {"id":"chatcmpl-84311d23","object":"chat.completion.chunk","model":"qwen2.5-7b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-84311d23","object":"chat.completion.chunk","model":"qwen2.5-7b","choices":[{"index":0,"delta":{"content":"P"},"finish_reason":null}]}
data: {"id":"chatcmpl-84311d23","object":"chat.completion.chunk","model":"qwen2.5-7b","choices":[{"index":0,"delta":{"content":"aged"},"finish_reason":null}]}
data: {"id":"chatcmpl-84311d23","object":"chat.completion.chunk","model":"qwen2.5-7b","choices":[{"index":0,"delta":{"content":"Attention"},"finish_reason":null}]}
data: {"id":"chatcmpl-84311d23","object":"chat.completion.chunk","model":"qwen2.5-7b","choices":[{"index":0,"delta":{"content":" is"},"finish_reason":null}]}
data: {"id":"chatcmpl-84311d23","object":"chat.completion.chunk","model":"qwen2.5-7b","choices":[{"index":0,"delta":{"content":" an"},"finish_reason":null}]}
data: {"id":"chatcmpl-84311d23","object":"chat.completion.chunk","model":"qwen2.5-7b","choices":[{"index":0,"delta":{"content":" efficient"},"finish_reason":null}]}
...
data: [DONE]

If that worked, vLLM is functional. Stop the server with Ctrl+C; the next section makes it persistent.

curl streaming /v1/chat/completions request to vLLM showing data: chunks arriving token by token for PagedAttention answer

While that single request runs, a second terminal can show what the GPU is actually doing. The nvidia-smi output below was captured during the streaming test. The 78W power draw and 21.4 GiB of VRAM in use confirm vLLM has loaded the weights and KV cache; the VLLM::EngineCore process line is the worker that handles all forward passes.

nvidia-smi output during vLLM serve on RTX 4090, driver 580.95.05, CUDA 13.0, VLLM EngineCore process using 21448 MiB

If the Memory-Usage column reads close to 24564MiB / 24564MiB, lower --gpu-memory-utilization to 0.85 to keep a safety margin. Hitting the GPU memory ceiling under traffic causes vLLM to preempt requests, which shows up as latency spikes in the metrics later.

Run vLLM as a hardened systemd service

Running vllm serve in a tmux session is fine for testing and useless for anything else. A real deployment needs a unit file that restarts on failure, drops privileges, isolates the filesystem, and reads secrets from a file with restrictive permissions instead of from the command line where they leak to ps.

Put environment variables and the API key in /etc/vllm/env. Owner is the vllm user, mode is 0600, so even other unprivileged users on the box cannot read it.

sudo install -d -m 0755 -o root -g root /etc/vllm
sudo install -d -m 0755 -o vllm -g vllm /var/log/vllm
sudo tee /etc/vllm/env >/dev/null <<'EOF'
HF_HOME=/opt/vllm/hf-cache
HF_HUB_OFFLINE=1
VLLM_API_KEY=sk-cfg-please-rotate-me
EOF
sudo chown vllm:vllm /etc/vllm/env
sudo chmod 600 /etc/vllm/env

Setting HF_HUB_OFFLINE=1 after the model has been pre-downloaded prevents vLLM from contacting Hugging Face on every startup. That makes boots faster and survives upstream HF outages, which happen more often than you’d expect.

Write the unit file. The ProtectSystem=strict and friends are not paranoia, they’re how you stop a compromised model loader from rewriting /etc/passwd or torching log rotation.

sudo vi /etc/systemd/system/vllm.service

Paste the following unit:

[Unit]
Description=vLLM OpenAI-compatible inference server
After=network-online.target nvidia-persistenced.service
Wants=network-online.target

[Service]
Type=simple
User=vllm
Group=vllm
EnvironmentFile=/etc/vllm/env
ExecStart=/opt/vllm/.venv/bin/vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 127.0.0.1 --port 8000 \
  --api-key ${VLLM_API_KEY} \
  --served-model-name qwen2.5-7b \
  --gpu-memory-utilization 0.90 \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --enable-prefix-caching

Restart=on-failure
RestartSec=10
TimeoutStartSec=600
TimeoutStopSec=60
LimitNOFILE=1048576

NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/opt/vllm /var/log/vllm
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true

StandardOutput=journal
StandardError=journal
SyslogIdentifier=vllm

[Install]
WantedBy=multi-user.target

A few of those lines deserve a comment. TimeoutStartSec=600 gives vLLM ten minutes to load weights and capture CUDA graphs before systemd considers the start a failure; the default 90 seconds will trip on any model larger than 13B. LimitNOFILE=1048576 raises the per-process file descriptor cap, which matters when vLLM serves a few hundred concurrent streams. ReadWritePaths is the only writable area under the strict filesystem lockdown.

Reload systemd, enable the unit, and start it.

sudo systemctl daemon-reload
sudo systemctl enable --now vllm
sudo systemctl status vllm --no-pager

The service should show active (running) after the load completes:

● vllm.service - vLLM OpenAI-compatible inference server
     Loaded: loaded (/etc/systemd/system/vllm.service; enabled; preset: enabled)
     Active: active (running) since Wed 2026-05-06 22:24:11 UTC; 1min 12s ago
   Main PID: 4471 (vllm)
      Tasks: 28 (limit: 309193)
     Memory: 16.4G
        CPU: 1min 8.124s
     CGroup: /system.slice/vllm.service
             └─4471 /opt/vllm/.venv/bin/python /opt/vllm/.venv/bin/vllm serve...

Live logs are in the journal under the vllm identifier set in the unit file:

sudo journalctl -u vllm -f

Journald handles rotation; no separate logrotate rule is needed. To keep more history, set SystemMaxUse=2G in /etc/systemd/journald.conf and reload journald.

nginx reverse proxy with TLS and SSE streaming

vLLM listens on plain HTTP on the loopback. Public access goes through nginx, which terminates TLS, applies per-key rate limiting, sets sane timeouts for long generations, and disables proxy buffering so streaming responses reach the client token by token.

Install nginx and certbot:

sudo apt install -y nginx certbot python3-certbot-nginx
sudo systemctl enable --now nginx

On Rocky Linux 9 the equivalents are dnf install nginx certbot python3-certbot-nginx from the EPEL repository.

Pull a few values into shell variables so the rest of the section is clean:

DOMAIN=vllm.example.com
[email protected]

Point an A record for that hostname at the public IP of the box. Any DNS provider works; certbot only needs port 80 reachable from the public internet to complete the HTTP-01 challenge. If you’re behind NAT or want a wildcard cert, switch to a DNS-01 challenge with the appropriate certbot DNS plugin (python3-certbot-dns-cloudflare, -dns-route53, -dns-digitalocean, etc.).

Open the firewall:

sudo ufw allow 80,443/tcp
sudo ufw status

Write the nginx vhost. The bits that matter for vLLM specifically are the streaming-related directives near the bottom of the location block.

sudo vi /etc/nginx/sites-available/vllm

Paste the following:

limit_req_zone $http_authorization zone=per_key:10m rate=120r/m;

upstream vllm_backend {
    server 127.0.0.1:8000;
    keepalive 64;
}

server {
    listen 80;
    listen [::]:80;
    server_name vllm.example.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server_name vllm.example.com;

    # certbot fills these in:
    # ssl_certificate /etc/letsencrypt/live/vllm.example.com/fullchain.pem;
    # ssl_certificate_key /etc/letsencrypt/live/vllm.example.com/privkey.pem;

    client_max_body_size 16M;
    keepalive_timeout 75s;

    location / {
        limit_req zone=per_key burst=40 nodelay;
        limit_req_status 429;

        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header Connection "";

        # Critical for SSE streaming
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 600s;
        proxy_send_timeout 600s;
        chunked_transfer_encoding on;
    }
}

The limit_req_zone at the top keys on the Authorization header, so each API key gets its own 120 requests/minute bucket with a 40-request burst. Anonymous traffic shares the empty-key bucket and gets rate-limited too. proxy_buffering off is the line every guide forgets; without it, nginx buffers the entire SSE response and your “streaming” client receives the whole answer in one chunk after the model finishes generating.

Enable the site, test the syntax, and reload nginx:

sudo ln -s /etc/nginx/sites-available/vllm /etc/nginx/sites-enabled/vllm
sudo rm -f /etc/nginx/sites-enabled/default
sudo nginx -t
sudo systemctl reload nginx

Issue the certificate. certbot --nginx reads the existing vhost, validates over HTTP-01, drops the issued cert into /etc/letsencrypt/live/$DOMAIN/, rewrites the vhost to add the SSL block (the two commented lines in the config above become real), and reloads nginx.

sudo certbot --nginx -d "$DOMAIN" --non-interactive --agree-tos --redirect -m "$EMAIL"

Test the auto-renewal flow without actually renewing. The systemd timer that ships with the certbot package runs this every twelve hours.

sudo certbot renew --dry-run

Hit the public endpoint with a streaming request and confirm chunks arrive in real time:

curl -s -N "https://$DOMAIN/v1/chat/completions" \
  -H "Authorization: Bearer sk-cfg-please-rotate-me" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b",
    "stream": true,
    "messages": [{"role": "user", "content": "List three production benefits of PagedAttention."}]
  }'

You should see data: lines printing one at a time as the model generates. If they all appear at the end in one burst, proxy_buffering off isn’t taking effect; check that the site you edited is the one actually enabled and that no proxy_buffering override sits in /etc/nginx/conf.d/.

Multi-tenant API keys with LiteLLM

vLLM’s --api-key flag accepts a single shared secret. That’s fine for a homelab, but the moment a second team or a second app needs access you want per-key visibility, per-key budgets, and key rotation without restarting vLLM. The simplest production answer is to run LiteLLM as a thin proxy in front of vLLM. LiteLLM speaks the OpenAI API on the front, calls vLLM on the back, and adds the multi-tenant features vLLM intentionally leaves out.

Install LiteLLM into its own venv. Keep it separate from vLLM so the two upgrade independently.

sudo useradd -r -m -d /opt/litellm -s /bin/false litellm
sudo -u litellm /opt/vllm/.local/bin/uv venv /opt/litellm/.venv --python 3.12 --seed
sudo -u litellm /opt/vllm/.local/bin/uv pip install --python /opt/litellm/.venv/bin/python "litellm[proxy]"

Write the LiteLLM config. The single model_list entry maps a virtual model name to the upstream vLLM endpoint, and master_key protects the admin API.

sudo install -d -o litellm -g litellm -m 0750 /etc/litellm
sudo vi /etc/litellm/config.yaml

Drop in a minimal config:

model_list:
  - model_name: qwen2.5-7b
    litellm_params:
      model: openai/qwen2.5-7b
      api_base: http://127.0.0.1:8000/v1
      api_key: sk-cfg-please-rotate-me

general_settings:
  master_key: sk-litellm-master-rotate-me
  database_url: sqlite:////opt/litellm/litellm.db
  alerting: ["slack"]

litellm_settings:
  drop_params: true
  set_verbose: false
  json_logs: true
  request_timeout: 600

Run LiteLLM as another systemd unit listening on port 4000. The full unit follows the same hardened pattern as vllm.service; only the ExecStart changes:

ExecStart=/opt/litellm/.venv/bin/litellm --config /etc/litellm/config.yaml --port 4000 --host 127.0.0.1

Update nginx to point upstream vllm_backend at 127.0.0.1:4000 instead of :8000, reload nginx, then create per-team virtual keys via the LiteLLM admin API:

curl -s http://127.0.0.1:4000/key/generate \
  -H "Authorization: Bearer sk-litellm-master-rotate-me" \
  -H "Content-Type: application/json" \
  -d '{
    "models": ["qwen2.5-7b"],
    "max_budget": 10,
    "budget_duration": "30d",
    "metadata": {"team": "platform"}
  }' | python3 -m json.tool

The response includes a sk-... key your application uses; LiteLLM tracks request count, token count, and dollar-equivalent cost per key. The Slack alerting hook fires when a key crosses 80 percent of its budget, which is a much better failure mode than discovering a runaway loop after a $5,000 surprise.

Prometheus and Grafana with real alerts

vLLM exposes Prometheus metrics on /metrics. The endpoint is rich, with everything from request queue depth to per-iteration token throughput to KV cache utilization. The names you’ll alert on most often are these:

MetricWhat it tells you
vllm:num_requests_runningActive concurrent sequences. Saturates at --max-num-seqs.
vllm:num_requests_waitingQueue depth. Sustained >0 means you’re CPU- or GPU-bound.
vllm:kv_cache_usage_percKV cache fill ratio. Above 95 percent and concurrent requests start getting preempted.
vllm:time_to_first_token_secondsHistogram. p95 is the user-visible “feels slow” number.
vllm:time_per_output_token_secondsInter-token latency. Multiply by tokens to estimate generation time.
vllm:prompt_tokens_totalCounter. Useful for cost accounting.
vllm:generation_tokens_totalCounter. Output tokens billed to clients.
vllm:prefix_cache_hits_totalCounter. Higher means more KV cache reuse, lower TTFT.

The metrics endpoint sits on the same port as the API, so don’t expose it publicly. Either bind Prometheus to localhost on the same box or scrape over a private network. Sample raw output looks like this:

curl -s http://127.0.0.1:8000/metrics -H "Authorization: Bearer $VLLM_API_KEY" | grep '^vllm:' | head

The first counters and gauges:

vllm:num_requests_running{engine="0",model_name="qwen2.5-7b"} 0.0
vllm:num_requests_waiting{engine="0",model_name="qwen2.5-7b"} 0.0
vllm:kv_cache_usage_perc{engine="0",model_name="qwen2.5-7b"} 0.0
vllm:prefix_cache_queries_total{engine="0",model_name="qwen2.5-7b"} 39.0
vllm:prefix_cache_hits_total{engine="0",model_name="qwen2.5-7b"} 0.0
vllm:engine_sleep_state{engine="0",model_name="qwen2.5-7b",sleep_state="awake"} 1.0

The full /metrics page prints around two hundred lines including histogram buckets for TTFT, ITL, and end-to-end latency. The screenshot below shows what Prometheus actually scrapes after a small handful of test requests.

curl /metrics output from vLLM showing num_requests_running, kv_cache_usage_perc, prefix_cache_hits_total, time_to_first_token_seconds_bucket

Run Prometheus and Grafana side by side via Docker Compose. Both stay on the loopback; Grafana goes on the public internet through the same nginx if you want it reachable from outside.

sudo install -d -m 0755 /opt/observability
sudo vi /opt/observability/docker-compose.yml

Compose stack:

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    network_mode: host
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./alerts.yml:/etc/prometheus/alerts.yml:ro
      - prom-data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=30d
      - --web.listen-address=127.0.0.1:9090
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    network_mode: host
    environment:
      - GF_SERVER_HTTP_ADDR=127.0.0.1
      - GF_SERVER_HTTP_PORT=3000
      - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana_admin:/run/secrets/grafana_admin:ro
    restart: unless-stopped

volumes:
  prom-data:
  grafana-data:

The Prometheus scrape config goes alongside it:

sudo vi /opt/observability/prometheus.yml

The vLLM scrape job needs the API key in an Authorization header because the metrics endpoint is gated by the same key as the rest of the API:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - alerts.yml

scrape_configs:
  - job_name: vllm
    metrics_path: /metrics
    authorization:
      type: Bearer
      credentials: sk-cfg-please-rotate-me
    static_configs:
      - targets: ['127.0.0.1:8000']
        labels:
          service: vllm
          gpu: rtx-4090

  - job_name: node
    static_configs:
      - targets: ['127.0.0.1:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: []

Three alert rules cover most production failure modes. Queue backlog points at insufficient capacity, KV cache pressure means you’re about to drop requests, and TTFT degradation is the user-visible canary.

sudo vi /opt/observability/alerts.yml

Three rules, each with a sensible window so transient spikes don’t page anyone:

groups:
  - name: vllm
    interval: 30s
    rules:
      - alert: VllmQueueBacklog
        expr: vllm:num_requests_waiting > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "vLLM queue backlog over 50 for 5 minutes on {{ $labels.gpu }}"
          description: "Capacity is the bottleneck. Add a replica or raise --max-num-seqs."

      - alert: VllmKvCachePressure
        expr: vllm:kv_cache_usage_perc > 0.95
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "KV cache > 95% on {{ $labels.gpu }}"
          description: "Concurrent sequences are about to be preempted. Reduce --max-model-len or scale out."

      - alert: VllmTtftHigh
        expr: histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le, gpu)) > 2
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "p95 TTFT > 2s on {{ $labels.gpu }}"
          description: "User-visible latency degraded. Check GPU saturation, queue depth, and recent traffic spike."

Bring up the stack:

cd /opt/observability
echo "$(openssl rand -hex 24)" | sudo tee grafana_admin >/dev/null
sudo chmod 600 grafana_admin
sudo docker compose up -d
sudo docker compose ps

In Grafana (point a browser at http://127.0.0.1:3000 through SSH port forwarding, or front it with the same nginx pattern from earlier on a separate hostname like grafana.example.com), add Prometheus as a data source pointing at http://127.0.0.1:9090, then import the official vLLM dashboard from the project repository at examples/observability/prometheus_grafana. The dashboard panels line up with the metrics in the table above and need no customization on day one.

Performance tuning that matters

vLLM exposes around forty CLI flags; you only need a handful in production. Here are the ones that move the throughput-versus-latency dial.

FlagDefaultWhat it controls
--gpu-memory-utilization0.90Fraction of VRAM vLLM grabs at boot. Lower if OOM at startup, raise to 0.95 for more KV cache headroom.
--max-model-lenmodel maxCap context window. Lower = more KV blocks free for concurrency. 16K is plenty for chat; 128K is wasteful unless RAG demands it.
--max-num-seqs256Max concurrent sequences. Raise for throughput, lower if p99 TTFT is suffering.
--max-num-batched-tokensautoToken budget per scheduler step. Raise (8192+) for throughput, lower (2048) for snappier TTFT.
--enable-prefix-cachingonReuse KV cache across requests sharing a prefix. Default-on in V1; disable only for benchmarking baselines.
--enable-chunked-prefillonInterleave prefill and decode for steadier TTFT under load.
--kv-cache-dtypeautoSet to fp8 to halve KV cache memory, doubling concurrency at a tiny accuracy hit.
--quantizationnoneUse AWQ or FP8 for 70B-class models on a single GPU. fp8 needs Hopper or newer.
--dtypeautoForce bfloat16 over float16 on Ampere+ for numerically stabler long generations.
--tensor-parallel-size1Shard a model across N GPUs on the same host. Use for 70B FP16 across 2 H100s.
--enforce-eageroffDisable CUDA graph capture. Debug only; costs 10-30 percent throughput.

The mental model: throughput and latency trade against each other through the batch scheduler. --max-num-batched-tokens is the throughput knob; raise it and the GPU spends more of each step doing useful work, but the first new request in the queue waits longer. --max-num-seqs is the concurrency knob; raise it and more clients can be served simultaneously, but each gets a slimmer slice of the GPU. KV cache size, controlled by --gpu-memory-utilization and --max-model-len, is the ceiling on both. Tune in that order: KV cache first (so concurrency has room), then concurrency, then batched-tokens for TTFT.

Benchmarks on a single RTX 4090

The numbers below were captured live on the same RTX 4090 used throughout the article, running Qwen 2.5 7B Instruct at bfloat16 with the default V1 settings. The benchmark tool is vllm bench serve, which ships with the package and replays a ShareGPT-style prompt distribution.

/opt/vllm/.venv/bin/vllm bench serve \
  --backend openai-chat \
  --base-url http://127.0.0.1:8000 \
  --endpoint /v1/chat/completions \
  --model qwen2.5-7b \
  --tokenizer Qwen/Qwen2.5-7B-Instruct \
  --dataset-name sharegpt \
  --dataset-path /tmp/sharegpt.json \
  --num-prompts 128 \
  --max-concurrency 32 \
  --request-rate 20 \
  --api-key sk-cfg-please-rotate-me

Raw output from one of those runs (concurrency 128) looks like the screenshot below. The “Serving Benchmark Result” block is the summary; everything above it is the per-request telemetry the tool also prints.

vllm bench serve output showing 5087 tok/s total throughput, 138 ms median TTFT, 22 ms median ITL on RTX 4090

Sweeping concurrency from one to one hundred twenty-eight on the same hardware shows where vLLM earns its reputation. A single client gets near-instant time-to-first-token; pushing the batch up to 128 keeps p95 TTFT inside half a second of the median while throughput climbs roughly linearly until the GPU saturates.

ConcurrencyTotal tok/sPer-request tok/sp50 TTFTp99 TTFTp50 ITLp99 ITL
11786130 ms147 ms16 ms18 ms
85292652 ms5320 ms*16 ms27 ms
322,6383464 ms166 ms19 ms80 ms
1285,08817138 ms463 ms22 ms128 ms

The shape of those numbers, more than the absolute values, is what you take away. Total throughput climbs from 178 tok/s with one user to over 5,000 tok/s under heavy concurrency, almost a 30-fold improvement on the same GPU. Per-request throughput drops the other way (61 down to 17 tok/s) because every active sequence shares the same KV cache and attention compute. The 4090 saturates somewhere between 32 and 128 concurrent sequences for this model and context length; pushing past 128 starts dropping into the request queue rather than gaining throughput.

The asterisk on the c8 p99 TTFT is honest: that 5.3-second spike is one outlier request that arrived right as a long generation was finishing. The median TTFT (52 ms) and p99 ITL (27 ms) tell the steadier story. Reproduce it yourself and the spike moves around but the medians stay close. Median latency is the metric that matters for streaming UX; tail latency matters more in batch and async pipelines.

If your workload looks like one user at a time and you can’t batch, you’re paying for a GPU that vLLM cannot fully use; consider llama.cpp on a smaller card or run multiple model replicas to absorb concurrency.

Multi-GPU with tensor parallelism

A 70B model at FP16 needs roughly 140 GB of weights, more than any single consumer or even most datacenter GPUs hold. vLLM splits the model across GPUs on the same host with tensor parallelism. The flag is --tensor-parallel-size N and the only other requirement is that the model’s hidden size divides evenly by N, which every common architecture does for N up to 8.

/opt/vllm/.venv/bin/vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192 \
  --quantization fp8

Two H100 80GB cards run a 70B FP8 comfortably. Two L40S 48GB cards need INT4 or AWQ to fit; vLLM accepts --quantization awq alongside an AWQ-quantized model from Hugging Face. nvidia-smi confirms both GPUs are loaded and active during inference, with traffic between them on the NVLink bridge or PCIe fabric.

For multi-host deployments (more than one machine), pipeline parallelism (--pipeline-parallel-size) shards across nodes and tensor parallelism shards within each node. The convention is tensor-parallel = GPUs per node and pipeline-parallel = number of nodes. Multi-host needs Ray, fast networking (InfiniBand or 100GbE), and is well past the scope of a single guide; the vLLM project’s parallelism guide is the canonical reference. For a different angle on building RAG on top of an LLM endpoint, our self-hosted RAG with pgvector walkthrough swaps Ollama for vLLM with no application-level changes thanks to the OpenAI-compatible API.

Troubleshooting

The errors that come up most often, what causes them, and what fixes them.

SymptomLikely causeFix
fatal error: Python.h: No such file or directory at first requestTriton JIT cannot find Python headersapt install python3-dev build-essential
CUDA out of memory at startupKV cache budget exceeds free VRAMLower --gpu-memory-utilization to 0.85, drop --max-model-len, or set --kv-cache-dtype fp8
CUDA error: no kernel image is availableWheel doesn’t include kernels for your GPU compute capabilityBuild from source with TORCH_CUDA_ARCH_LIST set, or pull the matching vllm/vllm-openai tag
HTTP 401 from huggingface-cli downloadGated model needs a tokenhuggingface-cli login as the vllm user, or set HF_TOKEN in /etc/vllm/env
Streaming responses arrive in one burstProxy buffering onAdd proxy_buffering off to the nginx location block; check /etc/nginx/conf.d/ for stray defaults
NCCL hangs on multi-GPU startMismatched NCCL versions, NIC pinning, or P2P disabledNCCL_DEBUG=INFO, pin NCCL_SOCKET_IFNAME, ensure all nodes have the same NCCL build
First request takes 30+ secondsCUDA graph capture and weight load on a cold enginePre-warm with a dummy request before joining a load balancer; raise startup probe timeout
Slow downloads from Hugging FaceAnonymous rate limitSet HF_TOKEN even for public models; throughput jumps an order of magnitude
RuntimeError: Engine core initialization failedGeneric catch-all from any boot-time exceptionScroll up in the journal; the actual error is two screens above the failure line

vLLM versus the alternatives

vLLM is the right default for production OSS LLM serving on NVIDIA hardware, but it’s not the only choice and not always the best one for a specific workload.

EngineBest forTradeoff
vLLMMulti-tenant production on NVIDIA, broad model support, day-zero releasesCold start heavy; ROCm support trails CUDA
HuggingFace TGIHF-native shops, Inferentia and Gaudi backendsSlower model day-zero than vLLM; dual Rust+Python stack
TensorRT-LLM + TritonAll-NVIDIA fleets where last-mile latency matters more than ops simplicityPer-model engine compile; ops complexity
SGLangAgentic workloads, structured JSON output, RAG with shared system prompts (RadixAttention)Smaller community; fewer day-zero models
llama.cpp / llama-serverSingle-stream, CPU fallback, Apple Silicon, mixed hardwareContinuous batching weak; not for >4 concurrent serious users
OllamaDev boxes, Mac, internal demosWraps llama.cpp; few production knobs
LMDeploy (TurboMind)Qwen and InternLM-heavy shops; Huawei AscendWestern community smaller; docs lean China-first

The honest reading: SGLang has been gaining mindshare on agentic and RAG workloads thanks to RadixAttention, which automatically reuses KV cache across requests sharing a prefix. For a chat-style API where most requests don’t share long prefixes, vLLM still wins. For a RAG endpoint where every request opens with the same 4 KB system prompt and 8 KB of pinned documents, SGLang reports two- to five-fold throughput improvements on the same hardware. Run both on your traffic before committing if the workload is prefix-heavy. For everything else, vLLM is the safe bet. We have a broader survey at our open-source LLM comparison if you want a model-by-model rather than engine-by-engine view.

Where to go from here

The box is now answering OpenAI-compatible requests over HTTPS, restarting on failure, rate-limiting per key, exposing metrics with three real alerts wired up, and benchmarked. That’s a credible single-node deployment. The natural next steps are horizontal scale (a second replica behind the same nginx), multi-node serving for 70B-class models with Ray, disaggregated prefill/decode via the llm-d project for the workloads that justify the operational complexity, and a full Kubernetes deployment using either the official vLLM Helm chart or the project’s production-stack reference cluster. The Kubernetes path has enough moving parts (NVIDIA device plugin, GPU node pools, PVC strategy for the Hugging Face cache, KV-cache-aware routing) that it gets its own dedicated guide.

For model selection on the box you just built, see our Ollama models cheat sheet; it covers the same model lineup vLLM serves, with VRAM and quality tradeoffs per size that map directly onto vLLM’s --max-num-seqs and --gpu-memory-utilization tuning above. The vLLM official documentation is the canonical source for new flags and supported models; subscribe to the GitHub release feed if you serve frontier models on day one.

Related Articles

AI Install Open WebUI with Ollama on Linux AI Pippit Launches “Agent Mode,” a Smart  Engine to Turn any Idea into a Ready-to-Share Video AI OpenCode vs Claude Code vs Cursor: AI Coding Agents Compared (2026) AI Install Open WebUI on Ubuntu 26.04 LTS

Leave a Comment

Press ESC to close