How Much VRAM to Run an LLM (7B to 70B)

A 70B model needs about 42 GB of VRAM at Q4_K_M. A 32B fits in 24 GB. A 13B needs about 11 GB, and a 7B needs about 6 GB. Those four numbers answer most of the question of how much VRAM you need to run an LLM. What they hide is the second half: a model that fits by weight alone can still stall the moment its context overflows the card, and two GPUs holding the same model at the same 24 GB run it at very different speeds. This guide gives the exact VRAM per model size and quantization, the VRAM for the specific models people actually run, the measured tokens per second on each GPU tier, and the card that fits each tier.

Original content from computingforgeeks.com - post 168756

Measured June 2026 on Ollama with Q4_K_M quantization.

Some links below are affiliate links. If you buy through them we may earn a small commission at no extra cost to you. Recommendations are based on measured behavior, not commission.

The short answer: VRAM by model size

This table is the number most people come for. It is the total VRAM (weights plus a modest 4096-token context plus runtime overhead) for a representative current model at each size. The Q4_K_M column is the one that matters for local use, because Q4_K_M is the default quantization in Ollama and llama.cpp and the best quality per gigabyte. The card that fits each tier, with prices, is in the tested GPU buyer guide.

Model size	Example (2026)	FP16	Q8	Q4_K_M	Fits on
3B	Llama 3.2 3B	~7 GB	~4 GB	~4 GB	8 GB card
7B to 8B	Llama 3.1 8B	~16 GB	~9 GB	~6 GB	12 GB card
13B to 14B	Qwen2.5 14B	~30 GB	~16 GB	~11 GB	16 GB card
32B	Qwen3 32B	~65 GB	~35 GB	~22 GB	24 GB card (cap context)
70B	Llama 3.3 70B	~141 GB	~75 GB	~42 GB	48 GB card
120B (MoE)	gpt-oss-120b	native 4-bit	n/a	~64 GB (MXFP4)	80 GB card
235B (MoE)	Qwen3 235B	~470 GB	~250 GB	~142 GB	2x 80 GB

Read it as a floor, not a ceiling. The figures assume a short 4096-token context. Push the context to 32K or 128K and the requirement climbs, sometimes past the weights themselves. The 32B row says it fits on a 24 GB card, and it does, but only with the context capped. At the default context a 32B overflows 24 GB and falls off a performance cliff, covered below. The reason both behaviors exist is the KV cache.

How much VRAM for popular models

Sizes are convenient; named models are what people actually pull. These are approximate total VRAM at Q4_K_M with a working context, for the models most asked about in 2026. Add roughly 20 to 40 percent for a 32K context, depending on how aggressively the model uses grouped-query attention.

Model	Params	VRAM at Q4_K_M	Smallest practical card
Llama 3.2 3B	3B	~4 GB	8 GB
Llama 3.1 8B	8B	~6 GB	12 GB
gpt-oss-20b	21B MoE (3.6B active)	~14 GB (MXFP4)	16 GB
Mistral Small 24B	24B	~16 GB	16 to 24 GB
Gemma 3 27B	27B	~18 GB	24 GB
Qwen3 32B / DeepSeek-R1 Distill 32B	32B	~22 GB	24 GB (cap context) or 32 GB
Llama 3.3 70B	70B	~42 GB	48 GB
gpt-oss-120b	117B MoE (5.1B active)	~64 GB	80 GB
Qwen3 235B	235B MoE (22B active)	~142 GB	2x 80 GB

The formula: weights plus KV cache plus overhead

Three things consume VRAM during inference. The weights are the big, fixed cost. The KV cache grows with context length. Overhead is a smaller fixed tax for activations and the CUDA runtime.

Weights are parameters multiplied by bytes per parameter. The bytes per parameter depend entirely on quantization:

Format	Bytes/param	Quality
FP16 / BF16	2.0	Full precision baseline
Q8_0	~1.06	Near lossless, about 1% drop
Q5_K_M	~0.71	High quality
Q4_K_M	~0.60	The default sweet spot
Q4_0	~0.56	Older 4-bit, lower quality

The mental math is simple. FP16 gigabytes is roughly parameters in billions times two. Q4_K_M is roughly parameters times 0.6. A 70B at Q4_K_M lands at about 42 GB of weights, which matches the published GGUF file size of 42.5 GB almost exactly.

The KV cache stores the key and value tensors for every token already in the context, so the model does not recompute them. The naive size is:

kv_cache_bytes = 2 * n_layers * hidden_size * seq_len * bytes_per_element

That naive formula is an upper bound. It assumes every attention head keeps its own key and value, which would put a 70B at 32K context near 84 GB of cache alone. Real models use grouped-query attention, which shares each cached key and value across several query heads and cuts the cache four to eight times on current architectures. Llama 3.3 70B has 80 layers but only 8 key-value heads against 64 query heads, so its real cache is closer to 0.3 MB per token, not the 2.6 MB the naive formula predicts. Grouped-query attention is the single reason a 32K context is runnable on consumer hardware at all.

Overhead adds roughly 15 to 20 percent on top of weights and cache for activations and framework buffers. The practical rule that falls out of all three: pick a quantization that leaves one to two gigabytes of headroom below your card’s VRAM, never one that exactly equals it.

How fast will it run? Measured tokens per second per GPU

Fitting a model is half the question. The half almost every VRAM guide skips is speed, and speed does not track VRAM capacity. The numbers below were measured on single rented GPUs running Ollama at Q4_K_M, generation rate in tokens per second, one request at a time. The 32B figures are on Qwen2.5 32B; Qwen3 32B is the current dense 32B and sizes almost identically.

Model	RTX 3090 (24 GB)	RTX 4090 (24 GB)	RTX 5090 (32 GB)	L40S (48 GB)
Llama 3.1 8B	119 tok/s	148 tok/s	250 tok/s	113 tok/s
Qwen2.5 32B	37 tok/s	43 tok/s	71 tok/s	33 tok/s
Llama 3.3 70B	does not fit	does not fit	does not fit	16 tok/s

The number that matters is in the 32B row. The RTX 5090 with 32 GB runs the 32B model at 71 tokens per second, more than double the L40S with its larger 48 GB. More VRAM did not make it faster. Token generation is bound by memory bandwidth, not capacity, and the 5090’s GDDR7 moves data far faster than the L40S’s GDDR6. VRAM capacity decides what fits; memory bandwidth decides how fast it runs once it fits. The same logic explains why the L40S also trails a 3090 on the 8B model that fits either card: it trades peak bandwidth for capacity. The full card-by-card breakdown is in the tested price-versus-speed guide.

The offload cliff: why “it fits” is not enough

When a model plus its KV cache exceed VRAM, the runtime does not fail. It keeps the overflow layers in system RAM and computes them on the CPU. That sounds graceful and is not. GPU memory moves at 500 to 1800 gigabytes per second; DDR5 system memory moves at 50 to 80. Every offloaded layer drags token generation across the PCIe bus, and the result is a cliff, not a slope.

A concrete case from testing: the 32B model on a 24 GB RTX 3090. The weights are about 20 GB and fit. But recent Ollama versions size the context window to the VRAM they detect, and a 24 GB card is handed a 32K window by default. That 32K KV cache on top of the weights pushes the total past 24 GB, the overflow spills to the CPU, and throughput collapses:

ollama run qwen2.5:32b "Explain quantization in two sentences."

With the context left at its auto-sized default, generation dropped to roughly 3.6 tokens per second, a tenth of the card’s real rate. The fix is to cap the context so the cache stays inside VRAM. Ollama reads the context length from the server process, so set it there and restart:

OLLAMA_CONTEXT_LENGTH=4096 ollama serve

If Ollama runs as a systemd service, add Environment="OLLAMA_CONTEXT_LENGTH=4096" to the unit and restart it, or cap a single model from inside the session with /set parameter num_ctx 4096. Setting the variable in front of ollama run does nothing, because ollama run is only the client and the server holds the context. With the cache capped to fit, the same model on the same card held at 37 tokens per second. Same hardware, same weights, ten times the speed, decided entirely by whether the cache fit. VRAM is a hard boundary, not a soft limit.

Offloading also sets a floor on system RAM. The layers that spill out of VRAM live in main memory, so size system RAM to hold the offloaded portion plus headroom. A rough rule is system RAM of at least twice the model file size if you expect to offload heavily. GGUF memory-maps the file, so only the offloaded layers stay resident, and fast NVMe speeds the initial load but does nothing for generation once the model is mapped.

How context length eats VRAM

At 4K context the KV cache is a rounding error against the weights. At 128K it can exceed them. The cache is allocated up front for the maximum context you configure, and it grows linearly with that length. A 7B model that fits comfortably in 8 GB by weight needs several more gigabytes at 32K context, and at 128K the cache alone would overflow a 24 GB card were grouped-query attention not shrinking it. The two levers that keep long context affordable are grouped-query attention, which the model architecture decides, and KV cache quantization to 8-bit or 4-bit, which trades a little quality for a much smaller cache. The practical takeaway: if a model that should fit keeps running out of memory, the context length is almost always the cause, not the weights.

Which GPU for each VRAM tier

Match the card to the largest model you intend to run at Q4_K_M, then leave a tier of headroom for context. Prices through 2026 sit well above list because of the memory shortage, so treat the figures as direction, not a quote.

VRAM	Card	Runs at Q4_K_M
16 GB	RTX 4060 Ti 16GB	13B to 14B, Mistral Small 24B, gpt-oss-20b
24 GB	RTX 3090 / RTX 4090	32B, the sweet spot. 70B does not fit
32 GB	RTX 5090	32B with long context, fastest single card
48 GB	RTX 6000 Ada	70B at Q4_K_M, the entry to 70B
96 GB	RTX PRO 6000 Blackwell	70B with room, into the 100B class

The two lines worth memorizing: 24 GB is the 32B sweet spot, and 48 GB is the entry ticket to a 70B on a single card. The RTX 5090’s 32 GB is the fastest single card here, but it is still one tier short of a 70B at Q4_K_M, which lands near 42 GB before context.

Can two 24 GB GPUs run a 70B model?

Yes. Two 24 GB cards give 48 GB of combined VRAM, which is enough for a 70B at Q4_K_M, and it is often cheaper than one 48 GB card. Ollama splits a model across multiple GPUs automatically, and llama.cpp does it with --tensor-split. NVLink is not required; a pair of cards on ordinary PCIe slots works, adding a small inter-GPU latency. The catch is that VRAM scales but speed does not. Splitting a 70B across two RTX 3090s lets it fit and run, but at roughly the speed of a single card holding the model, not double, because generation is still bandwidth-bound per layer and the two GPUs work in sequence on each token. Two cheaper cards are the value path to a 70B; they are not a path to a faster 70B.

How much extra VRAM for multiple users?

Every number in this guide assumes one request at a time. Serving multiple users changes the math, because the weights are shared but the KV cache is per request. Ten concurrent sessions do not load the model ten times, but they do allocate ten KV caches, and at long context that cache is the larger cost. Budget VRAM for peak concurrency, not for a single chat.

The serving framework matters here more than the model. Ollama and llama.cpp pre-allocate the KV cache for the full context and are built for one user at a time. vLLM uses PagedAttention, which allocates the cache in small pages on demand and wastes under 4 percent of it, against the large fraction a pre-allocating runtime leaves idle. That efficiency is why vLLM serves far more concurrent requests in the same VRAM and is the right tool once more than one person is hitting the endpoint. For a single local user, Ollama is simpler and the difference does not show. The production path, vLLM behind a real API, is covered in the vLLM production guide.

Unified memory: Apple and Strix Halo

Apple Silicon and AMD’s Strix Halo share one memory pool between CPU and GPU, and the GPU uses it directly as VRAM. That changes the capacity math. A 128 GB machine can hold a 70B model, or a large mixture-of-experts model, that no consumer discrete GPU can fit, and the mini PCs built to hold one are compared separately. Two caveats keep it honest. Only about 70 to 75 percent of unified memory is usable for the model, because the OS reserves the rest, so a 64 GB Mac behaves like a 48 GB GPU. And bandwidth, not capacity, still sets the speed. An RTX 5090 with 32 GB and 1792 GB/s of bandwidth is faster than a 128 GB Strix Halo at around 256 GB/s on any model that fits in 32 GB. Capacity and speed are different axes. Unified memory buys the first and pays for it in the second. The same tradeoff drives the local-AI section of the best laptops for programming and DevOps guide.

Mixture-of-experts models change the math

The 120B and 235B rows carry a trap. A model like gpt-oss-120b activates only about 5 billion of its 117 billion parameters per token, and Qwen3 235B activates 22 billion of 235 billion. Those active counts make the model compute faster, but they do nothing for VRAM. The full weight set still has to fit in memory. gpt-oss-120b ships natively in 4-bit and needs about 64 GB, which is why it lands on a single 80 GB data-center card rather than a consumer GPU. Sizing a mixture-of-experts model by its active parameters instead of its total is the most common and most expensive mistake in this whole topic.

Can you run a 70B model on a 24 GB GPU?

No, not on a single 24 GB card in a usable way. A 70B at Q4_K_M needs about 42 GB before context, and a 24 GB card is short by almost half. The runtime will start it by offloading half the layers to the CPU, and throughput drops to a few tokens per second, slower than reading speed. The real options for a 70B are a single 48 GB card such as the RTX 6000 Ada, two 24 GB cards splitting the model, or a unified-memory machine with 64 GB or more. A single 24 GB card is the right tool for a 32B, not a 70B.

Q4_K_M or Q8: which quantization?

Q4_K_M for almost everyone. It uses roughly 0.6 bytes per parameter against Q8’s 1.06, so it nearly halves the VRAM, and the measured quality loss against the full-precision model is around one percent on most tasks. Q8 is worth the extra memory only when a task is sensitive to small numerical drift, such as code generation against a strict test suite, and the card has the room to spare. Below Q4_K_M, quality falls off faster than the memory savings justify, so Q4_0 and the 3-bit and 2-bit formats are last resorts to squeeze a model onto a card that is genuinely too small.

How much VRAM for fine-tuning?

More than inference, and it depends on the method. Full fine-tuning loads the weights, the gradients, and the optimizer states, which pushes a 7B well past 60 GB and often beyond 100 GB, and puts a 70B firmly in multi-GPU territory. Parameter-efficient methods change that completely. QLoRA quantizes the base model to 4-bit and trains only small adapter layers on top, which brings a 7B fine-tune under 16 GB and a 70B fine-tune to roughly 48 GB, the same single 48 GB card that runs a 70B at inference. For anyone fine-tuning on local hardware, QLoRA is the method that makes the VRAM math work.

VRAM to model size at a glance

The whole guide compresses to one reference. Pick the row for the largest model you want to run, buy a card one tier above it for context headroom, and keep the context capped to what the work needs.

You want to run	Minimum VRAM (Q4_K_M)	Practical card
3B to 8B	8 to 12 GB	RTX 3060 12GB, RTX 4060 Ti 16GB
13B to 27B	16 to 24 GB	RTX 4060 Ti 16GB, RTX 3090
32B	24 GB	RTX 3090, RTX 4090, RTX 5090
70B	48 GB	RTX 6000 Ada, 2x 24 GB, 64 GB unified
120B+ (MoE)	64 GB and up	RTX PRO 6000, data-center 80 GB

The weights tell you what fits, the bandwidth tells you how fast, and the context tells you whether either number holds under real load. Get all three right and a local model runs the way the benchmark says it should. Miss the third and a card that should fly will crawl. The card that fits each tier, with the measured speed to back it, is laid out in the best GPU for LLMs guide.