AI

Unified Memory vs VRAM for AI, Explained

This post contains affiliate links. If you buy through them, we may earn a small commission at no extra cost to you. Learn more.

When you go shopping for hardware to run AI models locally, you run into two completely different ways of attaching memory to a processor, and the difference decides which models you can run and how fast they go. A discrete graphics card carries its own dedicated memory soldered next to the GPU chip, called VRAM. A unified-memory machine like an Apple Silicon Mac or an AMD Ryzen AI Max+ box has the CPU, GPU, and NPU all share one physical pool of RAM. The two designs optimize for opposite things, and most buying confusion comes from not knowing which one a given workload actually wants.

Original content from computingforgeeks.com - post 169122

This guide explains what unified memory and VRAM each are, the mechanism behind why they behave so differently for large language models, and how to read the two specs that decide a purchase: memory capacity and memory bandwidth. We will tie it to real measured token rates so the abstract numbers land somewhere concrete. The figures here were checked against current hardware and published reviews in June 2026.

What VRAM and unified memory actually are

VRAM is video memory: a bank of fast GDDR6, GDDR6X, GDDR7, or HBM chips wired directly to a discrete GPU on the same board. It exists for one reason, to feed the GPU’s thousands of cores data fast enough to keep them busy. A graphics card cannot borrow your system RAM at full speed, so whatever VRAM the card ships with is a hard ceiling. An RTX 3090 has 24GB and will always have 24GB; an RTX 5090 has 32GB. There is no upgrade path. When a model does not fit, the card spills the overflow to slow system RAM over the PCIe bus, and performance falls off a cliff.

Unified memory takes a different approach. Instead of giving the GPU its own private memory, the CPU, GPU, and the neural processing unit all read and write the same physical RAM through one memory controller. Apple Silicon built its whole architecture around this idea, and AMD’s Ryzen AI Max+ 395 (the chip codenamed Strix Halo) brought the same model to a Windows and Linux APU. Because there is only one pool, you can hand a large slice of it to the GPU as effective VRAM. A 128GB Strix Halo box can allocate up to 96GB of that pool to the graphics side, which is more usable model memory than any consumer graphics card sells, in a unit the size of a paperback.

The key insight is that these are not two grades of the same thing. They are two answers to a different question. VRAM answers “how do I feed a huge array of cores as fast as physically possible.” Unified memory answers “how do I let one big pool serve everything cheaply, without copying data back and forth between separate CPU and GPU memories.” Those goals pull the hardware in opposite directions, which is exactly why one design wins on capacity and the other wins on speed.

The two axes that decide everything: capacity and bandwidth

Every memory system has two numbers that matter for AI, and they trade against each other. Capacity is how many gigabytes you have. Bandwidth is how many gigabytes per second the processor can read from that memory. A model has to fit in capacity before it can run at all, and once it fits, bandwidth largely decides how fast it generates text.

Here is where the two designs split. A discrete GPU keeps its memory small but blisteringly fast. The 24GB on an RTX 4090 runs at about 1 TB/s; the 32GB on an RTX 5090 runs near 1.8 TB/s. That is a narrow, deep pipe: not much capacity, enormous speed. Unified memory does the reverse. A Strix Halo box gives you up to 128GB total, but that pool runs at roughly 256 GB/s over its LPDDR5X-8000 bus, around four to seven times slower than a desktop GPU’s VRAM. Apple’s M3 Ultra pushes its unified bandwidth far higher, up to 819 GB/s on the top configuration, which is why Macs sit between the AMD boxes and discrete GPUs on speed.

Memory design Typical capacity Typical bandwidth Upgrade path
Discrete GPU VRAM (RTX 3090) 24GB fixed ~936 GB/s None, soldered
Discrete GPU VRAM (RTX 5090) 32GB fixed ~1.8 TB/s None, soldered
AMD unified (Ryzen AI Max+ 395) Up to 128GB (~96GB to GPU) ~256 GB/s None, soldered
Apple unified (M3 Ultra) Up to 256GB (current top tier) Up to 819 GB/s None, configured at order

The reason this matters for buying: you cannot maximize both axes at once at a sane price. The physics and the packaging force the choice. The card vendors optimize for bandwidth and accept a small capacity. The unified-memory vendors optimize for capacity and accept lower bandwidth. Knowing which axis your workload leans on tells you which machine to buy, and for local LLMs the answer depends on one detail of how the model is built.

Why capacity decides if a model loads and bandwidth decides how fast

To run a language model, its weights have to live in the memory the processor reads from. The size of those weights depends on the parameter count and the quantization level, which is the precision the numbers are stored at. The level most people run locally is Q4_K_M, roughly 4.8 bits per weight, which cuts the footprint to about half of the FP16 original while keeping output quality close. As a rough rule at Q4, an 8B model needs around 6GB, a 32B needs around 22GB, and a 70B needs around 42GB before you add anything else.

That “anything else” is the KV cache, the model’s running memory of the current conversation. It grows with the context length, so a long prompt on a large model can add tens of gigabytes on top of the weights. The practical takeaway is to size for the weights, then leave headroom, and leave a lot of headroom if you plan to run long contexts. We cover the full sizing math in the guide on how much VRAM it takes to run an LLM from 7B up to 70B.

This is the mechanism that makes a 128GB unified box hold a 70B model that a 32GB graphics card cannot. The 70B simply does not fit in 32GB of VRAM, so the card either refuses to load it or spills it to system RAM and crawls. The unified box has 96GB of usable model memory, so the same 70B loads with room to spare for a generous context window. Capacity decided the outcome, and the cheaper-per-gigabyte memory won.

Now flip to speed, and the verdict inverts. Generating each token in a dense model means reading every one of its parameters from memory once. That makes token generation almost purely a memory-bandwidth problem. A dense 70B has to pull roughly 42GB of weights through the bus for every single token. On a graphics card at 1.8 TB/s that is fast; on a unified box at 256 GB/s it is slow. This is why the same 70B that the unified machine can hold but the GPU cannot, the GPU would still generate far faster if only it fit. What’s actually happening is that the model loads on the box with the most capacity but runs fastest on the box with the most bandwidth, and those are usually different machines.

Why mixture-of-experts models change the math

There is one architecture that breaks the “bandwidth decides speed” rule in favor of unified memory, and it is the single most important thing to understand before buying a high-capacity box. A dense model activates all of its parameters for every token. A mixture-of-experts (MoE) model splits its parameters into many “experts” and routes each token to only a few of them, so only a fraction of the total weights are read per token even though all of them must still be resident in memory.

The consequence is exactly what a unified-memory machine wants. An MoE model needs huge capacity to hold all its experts, which unified memory has in abundance, but it only touches a small active slice per token, so the modest bandwidth is enough to keep it fast. Take gpt-oss-120b: it carries 117 billion parameters total but activates only about 5 billion per token. On a Beelink GTR9 Pro (a Strix Halo box), ServeTheHome measured it at 31.41 tokens per second while a dense 70B on the same machine managed only about 5. The 120B model is larger, yet it runs six times faster, purely because it reads so much less memory per token.

That is the real argument for a unified-memory box in 2026: it makes big sparse models fast and big dense models merely possible. If the models you care about are large MoE designs, a Strix Halo or a high-memory Mac gives you capacity a graphics card cannot touch and genuinely usable speed on the sparse ones. If your models are dense and they fit in 24GB or 32GB, a discrete GPU will run them several times faster for less money. The architecture of the model you run, dense versus sparse, is the deciding factor more than the brand of the box.

What the measured token rates actually look like

Numbers make this concrete. The first set below is our own, measured in June 2026 on rented hardware (an RTX 3090, RTX 4090, RTX 5090, and an L40S) running Ollama at Q4_K_M with a fixed 4096-token context, taken from the best GPU for local LLM testing. The figure is the warm generation rate in tokens per second.

Discrete GPU (VRAM) 8B Q4 32B Q4 70B Q4
RTX 3090 (24GB) 119 tok/s 37 tok/s does not fit
RTX 4090 (24GB) 148 tok/s 43 tok/s does not fit
RTX 5090 (32GB) 250 tok/s 71 tok/s does not fit
L40S (48GB) 113 tok/s 33 tok/s 16 tok/s

Two things stand out. The fastest card cannot fit a 70B at all, because 32GB of VRAM is not enough for a model that needs about 42GB plus its cache. And more VRAM did not mean more speed: the L40S has 48GB but its slower GDDR6 left it behind the 32GB RTX 5090 on the models both could run. Bandwidth, not capacity, set the speed once a model fit.

The second set is the unified-memory side, and these are cited from published reviews on real hardware, not our own bench. On a 128GB Strix Halo box, ServeTheHome measured the dense Llama 3.3 70B at about 5 tokens per second, the MoE gpt-oss-120b at about 31 tokens per second, and reported that a Q4 70B sits comfortably in roughly 42 to 48GB with room for the KV cache. We size and benchmark these machines in detail in the best mini PC for local AI roundup.

Put the two tables side by side and the trade-off is plain. The graphics card runs a 70B at zero tokens per second, because it cannot load it. The unified box runs the same dense 70B at about 5, which is fine for batch jobs and background drafting but slow for live chat, where roughly 10 tokens per second is the floor for comfortable reading. Yet that same unified box runs a 120-billion-parameter MoE model at 31, faster than the dense 70B and on a model no consumer card can hold. Capacity unlocked the model; the model’s architecture decided whether the bandwidth was enough.

The 2026 pricing wrinkle worth knowing

One market detail reshapes this decision right now. The mid-2026 DRAM and LPDDR shortage pushed memory prices up across the board, and it hit the high-capacity tiers hardest because they use the most chips. The clearest sign is Apple: in early 2026 it removed the 512GB unified-memory option on the M3 Ultra Mac Studio and raised the price of the 256GB configuration by roughly $400, a move widely read as a response to rising memory costs. This matters because the whole appeal of unified memory is cheap capacity, and the shortage narrows that advantage at the very top of the range.

For most buyers the practical effect is modest, since a 128GB Strix Halo box such as the GMKtec EVO-X2 still offers far more model memory per dollar than stacking graphics cards. But if your plan depended on the largest Apple memory tiers, confirm the configuration is still sold and check the current price before committing, because the lineup has shifted under the shortage.

How to pick between them for your workload

The decision comes down to what you run, not which spec sheet looks more impressive. Work through it in this order.

If everything you run fits in 24GB or 32GB and the models are dense, buy a discrete graphics card. Nothing in the unified-memory world will match a card’s bandwidth on a model that fits its VRAM, and the card costs less than a high-memory box. A used RTX 3090 or an RTX 5090 covers most local work up to 32B comfortably. The 3090-versus-5090 comparison walks through that specific choice.

If you want to run models larger than a graphics card can hold, especially big mixture-of-experts models, buy a unified-memory machine. A 128GB Strix Halo box holds and runs MoE models in the 100B-plus range at usable speed, and it does so at low power in a small chassis. Accept that dense models in the 70B range will run, but slowly. Buy a high-bandwidth Mac instead if you live in macOS and want the fastest dense-model speed in the unified-memory class, accepting the higher price and the reduced top memory tier.

If you are torn between the two, the cleanest test is the model architecture. Dense models reward bandwidth, so they favor a GPU on anything that fits and a Mac on anything that does not. Sparse MoE models reward capacity with modest active bandwidth, which is exactly what unified memory is good at regardless of brand. Match the machine to the models you actually plan to run, and the rest of the spec comparison falls into place.

Common misconceptions about unified memory and VRAM

A few beliefs trip up buyers repeatedly, and clearing them up saves real money.

“Unified memory is just shared system RAM, so it must be slow.” Partly true and partly not. It is shared, but the bandwidth varies enormously by design. Strix Halo’s 256 GB/s is real and far above a typical laptop’s dual-channel DDR5; Apple’s top configurations clear 800 GB/s. The point is not that unified memory is slow, it is that it is slower than dedicated VRAM while offering far more of it.

“More memory always means faster inference.” The L40S in our table disproves this directly: 48GB of slower GDDR6 lost to 32GB of faster GDDR7 on every model both could run. Capacity decides whether a model loads; bandwidth decides how fast it runs once loaded. They are independent.

“A 128GB box runs a 70B as fast as a graphics card would.” No. The box holds the 70B that the card cannot, which is genuinely useful, but it generates at roughly 5 tokens per second against the 1 TB/s-plus a 70B would see on a card that fit it. The unified box wins on possibility, not speed, for dense models.

The mental model that ties it together: think of capacity as the door and bandwidth as the hallway behind it. A model has to fit through the door before anything else happens, and unified memory has the widest doors at a reasonable price. But once inside, how fast you move depends on the hallway, and dedicated VRAM has the widest hallways. Dense models race down the hallway and need it wide; sparse MoE models barely use it and just need the big door. Buy for whichever the models you run actually depend on.

Keep reading

Claude Code Cheat Sheet – Commands, Shortcuts, Tips AI Claude Code Cheat Sheet – Commands, Shortcuts, Tips Setup and Customize OpenCode – The Open Source AI Coding Agent AI Setup and Customize OpenCode – The Open Source AI Coding Agent Open Source LLM Comparison Table (2026) AI Open Source LLM Comparison Table (2026) AMD Ryzen AI Max+ 395 Mini PCs Compared: Framework Desktop vs GMKtec EVO-X2 vs Beelink GTR9 Pro AI AMD Ryzen AI Max+ 395 Mini PCs Compared: Framework Desktop vs GMKtec EVO-X2 vs Beelink GTR9 Pro Mac Mini vs Mini PC vs GPU for Local LLMs AI Mac Mini vs Mini PC vs GPU for Local LLMs Best GPU for LLMs in 2026: Tested by VRAM, Budget, and Model Size AI Best GPU for LLMs in 2026: Tested by VRAM, Budget, and Model Size

Leave a Comment

Press ESC to close