Open Source LLM Comparison Table (2026)

The open-source LLM landscape has shifted dramatically. Models like Qwen 3.5, DeepSeek V3.2, GLM-5, and Llama 4 now match or beat proprietary alternatives on key benchmarks, and you can run them on your own hardware. The box you choose decides which models fit and how fast they generate, so it helps to compare a Mac mini, a mini PC, and a GPU for local LLMs and to understand why unified memory and VRAM behave so differently. Two years ago, open-weight models were curiosities. Today, they power production workloads at companies that don’t want to send their data to someone else’s API. Running the largest of them locally now fits on a mini PC for local AI.

Original content from computingforgeeks.com - post 164556

This reference covers every major open-source and open-weight large language model available as of March 2026, with verified benchmark scores, license terms, hardware requirements, and hands-on performance data from real self-hosting tests. The tables below compare architecture details, benchmark results, licensing restrictions, and what each model actually needs to run on your own machine using Ollama.

Current as of March 2026. Benchmark data sourced from official model papers and the Hugging Face Open LLM Leaderboard. Self-hosting tests run on Ubuntu 24.04 LTS, 4 vCPUs, 16 GB RAM, CPU-only inference via Ollama.

Master Comparison Table

This table covers every major open-source/open-weight LLM family. “Active params” refers to how many parameters are used per inference pass in Mixture-of-Experts (MoE) architectures. Dense models use all parameters on every token.

Model	Developer	Total Params	Active Params	Architecture	Context Window	Multimodal	License	Release
Qwen 3.5 397B-A17B	Alibaba	397B	17B	MoE	256K tokens	Text + Image	Apache 2.0	Feb 2026
Qwen 3.5 122B-A10B	Alibaba	122B	10B	MoE	256K tokens	Text + Image	Apache 2.0	Feb 2026
Qwen 3.5 27B	Alibaba	27B	27B	Dense	256K tokens	Text + Image	Apache 2.0	Feb 2026
Qwen 3 235B	Alibaba	235B	22B	MoE (128e, 8 active)	128K tokens	No	Apache 2.0	Apr 2025
Qwen 3 32B	Alibaba	32B	32B	Dense	128K tokens	No	Apache 2.0	Apr 2025
Qwen 3 8B	Alibaba	8B	8B	Dense	128K tokens	No	Apache 2.0	Apr 2025
GLM-5	Zhipu AI	744B	40B	MoE	205K tokens	Text + Image	MIT	Feb 2026
DeepSeek V3.2	DeepSeek	671B	37B	MoE	128K tokens	No	MIT	Dec 2025
DeepSeek R1	DeepSeek	671B	37B	MoE	128K tokens	No	MIT	Jan 2025
DeepSeek V3	DeepSeek	671B	37B	MoE	128K tokens	No	MIT	Jan 2025
Llama 4 Scout	Meta	109B	17B	MoE (16 experts)	10M tokens	Text + Image	Llama 4 Community	Apr 2025
Llama 4 Maverick	Meta	400B	17B	MoE (128 experts)	1M tokens	Text + Image	Llama 4 Community	Apr 2025
Llama 3.3	Meta	70B	70B	Dense	128K tokens	No	Llama 3.3 Community	Dec 2024
Mistral Small 4	Mistral AI	119B	6B	MoE (128e, 4 active)	256K tokens	Text + Image	Apache 2.0	Mar 2026
Mistral Large 3	Mistral AI	675B	41B	MoE	256K tokens	Text + Image	Apache 2.0	Dec 2025
Gemma 3 27B	Google	27B	27B	Dense	128K tokens	Text + Image	Gemma (permissive)	Mar 2025
Gemma 3 12B	Google	12B	12B	Dense	128K tokens	Text + Image	Gemma (permissive)	Mar 2025
Gemma 3 4B	Google	4B	4B	Dense	128K tokens	Text + Image	Gemma (permissive)	Mar 2025
Phi-4 Reasoning Vision	Microsoft	15B	15B	Dense	16K tokens	Text + Image	MIT	Mar 2026
Phi-4	Microsoft	14B	14B	Dense	16K tokens	No	MIT	Jan 2025
Phi-4 Mini	Microsoft	3.8B	3.8B	Dense	128K tokens	No	MIT	Jan 2025
Mixtral 8x7B	Mistral AI	46.7B	12.9B	MoE (8 experts)	32K tokens	No	Apache 2.0	2023
Command R+	Cohere	104B	104B	Dense	128K tokens	No	CC-BY-NC	2024
Command A	Cohere	111B	111B	Dense	256K tokens	No	CC-BY-NC	Mar 2025
Falcon 3 10B	TII Abu Dhabi	10B	10B	Dense	32K tokens	No	TII Falcon 2.0	Dec 2024
DBRX	Databricks	132B	36B	MoE	32K tokens	No	Databricks Open Model	Mar 2024
Grok-1	xAI	314B	N/A	MoE	N/A	No	Apache 2.0	Mar 2024

Benchmark Scores

Benchmarks tell part of the story. MMLU tests general knowledge, GPQA Diamond tests graduate-level reasoning, AIME covers competition math, MATH-500 tests problem solving, and SWE-bench Verified measures real-world coding ability. All scores below come from official model papers or verified third-party evaluations. Empty cells mean the score hasn’t been published or independently verified.

Model	MMLU	MMLU-Pro	GPQA Diamond	AIME ’24	MATH-500	SWE-bench Verified
GLM-5	N/A	N/A	N/A	N/A	N/A	77.8%
DeepSeek V3.2-Speciale	N/A	N/A	N/A	N/A	N/A	N/A
Qwen 3 235B	N/A	83.6%	77.2%	85.7%	N/A	N/A
DeepSeek R1	N/A	84.0%	71.5%	79.8%	97.3%	N/A
Llama 4 Maverick	85.5%	N/A	69.8%	N/A	N/A	N/A
Llama 4 Scout	79.6%	N/A	N/A	N/A	N/A	N/A
Gemma 3 27B	78.6%	N/A	N/A	N/A	50.0%	N/A
Mistral Small 4	N/A	N/A	N/A	N/A	N/A	N/A

A few things jump out. Qwen 3 235B leads on GPQA Diamond (77.2%) and AIME ’24 (85.7%), making it the strongest open-weight model for reasoning and math. DeepSeek R1 dominates MATH-500 at 97.3%, which is near-perfect. GLM-5 posts 77.8% on SWE-bench Verified, the strongest coding benchmark result among open models. DeepSeek V3.2-Speciale achieved gold-medal performance at IMO 2025, IOI 2025, and ICPC World Finals, though official benchmark numbers haven’t been published yet. Llama 4 Maverick posts the highest raw MMLU at 85.5%, but MMLU alone doesn’t capture reasoning depth.

License Comparison

Licensing is where “open source” gets complicated. Some models are truly permissive (Apache 2.0, MIT), while others come with usage caps, geographic restrictions, or prohibitions on training derivative models. Read the fine print before building a product on any of these.

License	Models	Commercial Use	Key Restrictions
Apache 2.0	Qwen 3/3.5 (all), Mistral Large 3, Mistral Small 4, Mixtral 8x7B, Mistral 7B, Grok-1	Yes, unrestricted	None
MIT	DeepSeek V3/V3.2/R1, Phi-4 (all variants), GLM-5	Yes, unrestricted	None
Llama 4 Community	Llama 4 Scout, Llama 4 Maverick	Yes, free under 700M MAU	EU multimodal restrictions; Meta license required above 700M monthly active users
Llama 3.3 Community	Llama 3.3 70B	Yes, free under 700M MAU	Same MAU threshold as Llama 4
Gemma	Gemma 3 (all sizes)	Yes (requires agreement)	Must accept Google’s terms; commercial use permitted after agreement
CC-BY-NC	Command R+, Command A	No	Non-commercial only; separate agreement required for commercial deployment
TII Falcon 2.0	Falcon 3 (all sizes)	Yes, under $1M revenue	10% royalty above $1M revenue
Databricks Open Model	DBRX	Yes	Cannot use to train other LLMs

If license flexibility is your top priority, Qwen 3/3.5 under Apache 2.0, DeepSeek under MIT, or GLM-5 under MIT are the safest choices. You can do whatever you want with them, including fine-tuning and commercial deployment with zero royalties. Both Mistral Large 3 and Mistral Small 4 now ship under Apache 2.0, a significant shift from Mistral’s earlier restrictive licensing. The Llama licenses look permissive at first glance, but the 700M MAU cap and EU restrictions matter for larger operations.

Self-Hosting Resource Requirements

Benchmarks don’t tell you how a model feels when you’re actually running it. We tested six popular small/medium models on a modest Ubuntu 24.04 VM (4 vCPUs, 16 GB RAM, CPU-only inference) using Ollama. Each model answered the same prompt to keep things consistent.

Model	Ollama Tag	Disk Size	RAM Usage	Response Time (CPU)	Notes
Llama 3.2 3B	llama3.2:3b	2.0 GB	11.4 GB	88s	Clear, well-structured responses
Gemma 3 4B	gemma3:4b	3.3 GB	4.2 GB	94s	Clean, structured, concise output
Phi-4 Mini 3.8B	phi4-mini	2.5 GB	8.9 GB	97s	Good reasoning, occasional formatting artifacts
Mistral 7B	mistral:7b	4.4 GB	7.4 GB	125s	Concise, accurate
Qwen 3 8B	qwen3:8b	5.2 GB	5.8 GB	433s	Thinking mode adds latency; strong final answers
DeepSeek R1 8B	deepseek-r1:8b	5.2 GB	5.8 GB	433s	Chain-of-thought reasoning; slow on CPU

Gemma 3 4B stands out for RAM efficiency at just 4.2 GB, making it the best fit for memory-constrained environments. Llama 3.2 3B is the fastest responder at 88 seconds but uses a surprising 11.4 GB of RAM. The reasoning models (Qwen 3 8B and DeepSeek R1 8B) took over 7 minutes each because their chain-of-thought process generates significantly more tokens before producing a final answer. On a GPU, those times drop dramatically. If you are choosing hardware for local inference, our GPU buyer guide for local LLMs breaks down which card fits each model size by VRAM.

Pull and run any of these models with a single command:

ollama run gemma3:4b

Check our Ollama commands cheat sheet for the full list of management commands, and browse the Ollama model library for all available tags and quantization options.

Qwen 3 and Qwen 3.5 (Alibaba)

Qwen is arguably the most versatile open-source model family available. The Qwen 3 series, released under Apache 2.0 with no usage restrictions, spans from a 0.6B edge model to a 235B MoE flagship that competes with the best proprietary models. The 235B variant posts 85.7% on AIME ’24 and 77.2% on GPQA Diamond, both top scores among open models.

Qwen 3.5, released in waves from February to March 2026, represents a generational leap. The entire family is now natively multimodal (text + vision trained jointly from the start, not bolted on after the fact). Key upgrades over Qwen 3 include 256K context windows (up from 128K), support for 201 languages (up from 119), and significantly improved agentic coding capabilities. The flagship 397B-A17B MoE model activates only 17B parameters per token while delivering performance that rivals closed-source alternatives. The 27B dense variant is particularly compelling for teams that want high quality without MoE serving complexity.

One feature that sets the Qwen family apart is its toggleable thinking mode. You can enable chain-of-thought reasoning when you need it (math, logic, coding) and disable it for faster responses on straightforward queries. This flexibility means a single model serves both use cases. All Qwen 3 and 3.5 models remain Apache 2.0 licensed.

GLM-5 (Zhipu AI)

GLM-5, released in February 2026 by Chinese AI lab Zhipu AI, is one of the most significant open-source releases of the year. At 744B total parameters with 40B active, it’s a large MoE model under the MIT license with no usage restrictions.

What makes GLM-5 notable beyond its size is its training infrastructure: the entire model was trained on 100,000 Huawei Ascend 910B chips with no US-manufactured hardware involved. This matters for organizations operating under export control constraints. The model posts 50.4% on Humanity’s Last Exam and 77.8% on SWE-bench Verified, the latter being the strongest coding benchmark result among open models.

GLM-5 supports a 205K token context window and multimodal input. Zhipu AI has announced GLM-5.1, an open-source successor, though no release date has been confirmed. The closed-source GLM-5-Turbo launched in March 2026 for API access.

DeepSeek R1, V3, and V3.2

DeepSeek made waves in January 2025 with two models under the MIT license. V3 is a general-purpose 671B MoE model (37B active), while R1 uses the same architecture but specializes in step-by-step reasoning. R1’s 97.3% on MATH-500 is the highest score of any open model on that benchmark, and its 84.0% on MMLU-Pro puts it ahead of most competitors on professional-level knowledge tasks.

DeepSeek V3.2, released in December 2025, is the first model to integrate thinking directly into tool-use workflows. The companion V3.2-Speciale variant, designed for high-compute reasoning, achieved gold-medal performance at IMO 2025, IOI 2025, and ICPC World Finals. Both remain MIT licensed.

The distilled versions (1.5B through 70B) make DeepSeek accessible on consumer hardware. The 8B distill, tested in our Ollama benchmarks above, delivers genuine chain-of-thought reasoning at a size that runs on a laptop. For a detailed walkthrough, see our guide on running DeepSeek R1 locally with Ollama.

DeepSeek V4 has been anticipated since early 2026 (expected to be a ~1T parameter model with 1M context and native multimodal support), but multiple rumored release windows have passed without an announcement. Current expectation is Q2 or Q3 2026.

Llama 4 (Meta)

Meta’s Llama 4 family, released in April 2025, introduced MoE architecture to the Llama line for the first time. Both Scout and Maverick use only 17B active parameters per token despite having 109B and 400B total parameters respectively. The practical benefit is that you get large-model quality with smaller-model inference costs.

Scout’s headline feature is its 10 million token context window, the longest of any open model by a wide margin. Maverick, with 128 experts and a 1M context window, targets production deployments where quality matters more than context length. Both are natively multimodal (text and image).

Behemoth, the flagship at roughly 2 trillion total parameters (288B active), was originally expected to follow Scout and Maverick. However, reports from May 2025 indicate Meta paused its release after internal evaluations showed performance improvements were incremental. As of March 2026, Meta has not officially cancelled Behemoth but has provided no release timeline.

The Llama 4 Community License is free for organizations under 700 million monthly active users. There’s an important catch for European users: the Acceptable Use Policy explicitly excludes multimodal model rights for individuals or companies based in the EU. Since all Llama 4 models are natively multimodal, this effectively restricts the entire Llama 4 family in the EU, likely a preemptive response to the EU AI Act’s transparency and training data disclosure requirements. Llama 3.3 (70B dense, text-only) remains unaffected by this restriction and is still popular because it fits on a single high-end GPU without MoE-aware serving infrastructure.

Gemma 3 (Google)

Google positioned Gemma 3 for on-device and edge deployment, and the resource numbers from our testing confirm it. The 4B variant uses just 4.2 GB of RAM and responds in 94 seconds on CPU, the best efficiency ratio of any model we tested. The 27B flagship hits 78.6% on MMLU and 1338 Elo on Chatbot Arena running on a single H100 GPU, impressive for a dense model at that size.

All Gemma 3 models from 4B up support multimodal input (text and images), which is useful for building applications that need vision capabilities without the infrastructure costs of a 400B model. Google also released FunctionGemma 270M in December 2025, a tiny model optimized specifically for function calling on mobile and IoT devices.

The Gemma license requires accepting Google’s terms but permits commercial use after that. The 1B model, with its 32K context window, targets mobile and IoT scenarios where every megabyte counts.

Mistral Models

Mistral AI spans the full range from tiny to massive. The original Mistral 7B and Mixtral 8x7B, both under Apache 2.0, remain among the most deployed open models for self-hosting. Mistral 7B is a reliable general-purpose model that runs comfortably on modest hardware. Mixtral 8x7B, with 46.7B total parameters but only 12.9B active, demonstrated early on that MoE could deliver outsized quality at reasonable compute costs.

Mistral Large 3 (December 2025) is a different beast: 675B total parameters, 41B active, with multimodal support covering text and images across 80+ languages. It ships under Apache 2.0.

Mistral Small 4, released in March 2026, is the most interesting recent addition. At 119B total parameters with 128 experts and only 4 active per token (6B active parameters), it unifies instruction following, reasoning (via configurable depth), and multimodal capabilities into a single model. It supports a 256K context window and ships under Apache 2.0, a significant licensing upgrade from Mistral’s previous restrictive licenses. It combines the capabilities of three previously separate models: Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding).

Phi-4 (Microsoft)

Microsoft’s Phi-4 family proves that smaller models can punch above their weight on specific tasks. The 14B base model and its reasoning-focused variants excel at math and logic problems, consistently outperforming larger models on those benchmarks. Phi-4 Mini at 3.8B with a 128K context window is one of the best options for resource-constrained deployments that still need long-context capabilities.

Phi-4 Reasoning Vision (15B), released March 4, 2026, adds image understanding to the reasoning pipeline. Built on the Phi-4-Reasoning backbone with a SigLIP-2 vision encoder, it supports dynamic resolution with up to 3,600 visual tokens for GUI grounding and document analysis. One clever design choice: the model knows when to engage deep reasoning and when thinking is unnecessary, adapting its compute budget per query. All Phi-4 variants ship under the MIT license.

In our Ollama testing, Phi-4 Mini used 8.9 GB of RAM, which is higher than expected for a 3.8B model, likely due to its 128K context window allocation. Response quality was solid, though we noticed occasional formatting artifacts in structured output.

Command A and R+, Falcon 3, DBRX, and Grok

Command A (Cohere, 111B dense, March 2025) supersedes Command R+ for most use cases. It offers a 256K context window, runs on just 2 GPUs (A100/H100), and includes reasoning and vision variants. Like R+, it’s optimized for retrieval-augmented generation and tool use. The CC-BY-NC license limits it to non-commercial use without a separate agreement. Command R+ (104B, 128K context) remains available but is the older model.

Cohere also released Tiny Aya (3.35B, February 2026) under CC-BY-NC, supporting 70+ languages and designed for laptop and edge deployment. And Cohere Transcribe (2B, March 2026), an Apache 2.0 licensed speech recognition model that tops the Hugging Face Open ASR leaderboard across 14 languages.

Falcon 3 from TII Abu Dhabi offers models from 1B to 10B, trained on 14 trillion tokens. The Falcon 2.0 license is free for organizations under $1 million in revenue, with a 10% royalty above that threshold.

DBRX (Databricks, 132B total, 36B active) was an early high-quality MoE release in March 2024. Its license prohibits using it to train other LLMs but otherwise permits commercial deployment. It’s showing its age against newer models but remains relevant in Databricks-native workflows.

Grok-1 from xAI (314B MoE, Apache 2.0) was a surprise open release in March 2024. Grok-2.5 followed in August 2025 with open weights but a more restrictive license that prohibits using the weights to train other models. Elon Musk confirmed in February 2026 that xAI plans to open-source Grok 3, but the originally targeted February release date has passed with no weights published. Neither Grok model has the community tooling support (Ollama templates, vLLM integrations) that the more established families enjoy.

Which Model Should You Choose?

The “best” model depends entirely on what you’re building. Here’s a quick decision guide based on real use cases.

Best for reasoning and math: DeepSeek R1 (97.3% MATH-500) or Qwen 3 235B (85.7% AIME ’24). DeepSeek V3.2-Speciale won gold at IMO, IOI, and ICPC 2026. If you need the model to show its work and solve multi-step problems, the DeepSeek and Qwen families are the clear leaders.

Best for general-purpose chat: Qwen 3.5 397B-A17B or Llama 4 Maverick. Maverick has the highest MMLU (85.5%) among open models. Qwen 3.5 27B is a strong dense alternative that’s simpler to serve.

Best for coding: GLM-5 (77.8% SWE-bench Verified) or DeepSeek V3.2-Speciale. For a smaller option, Mistral Small 4 combines Devstral’s agentic coding capabilities in a 6B active parameter package.

Best for multilingual: Qwen 3.5 (201 languages) or Mistral Large 3 (80+ languages). For lightweight multilingual needs, Cohere’s Tiny Aya covers 70+ languages at just 3.35B parameters (CC-BY-NC license).

Best for edge and mobile: Gemma 3 4B (4.2 GB RAM) or Gemma 3 1B for extreme constraints. Google’s FunctionGemma 270M is purpose-built for function calling on IoT devices.

Best for RAG and tool use: Command A (256K context, grounding-optimized) or DeepSeek V3.2 (native thinking + tool-use integration). Check Cohere’s CC-BY-NC license if you’re building commercially.

Best for long context: Llama 4 Scout (10M tokens) is unmatched. For 256K, Qwen 3.5 or Mistral Small 4. For a practical 128K, DeepSeek V3.2 or Gemma 3 27B.

Best permissive license: Qwen 3/3.5 (Apache 2.0), DeepSeek (MIT), GLM-5 (MIT), or Mistral Small 4 (Apache 2.0). No usage caps, no royalties, no geographic restrictions.

Running These Models Locally with Ollama

Every model in the self-hosting table above can be pulled and run with a single Ollama command. Install Ollama on your system first:

curl -fsSL https://ollama.com/install.sh | sh

Then pull and interact with any model:

ollama run qwen3:8b

For a GPU-equipped machine, Ollama automatically uses CUDA or ROCm if available, cutting response times from minutes to seconds. On CPU-only systems, stick with models under 8B parameters for usable response times.

For a full setup walkthrough, see our guide on installing Ollama on Rocky Linux and Ubuntu. If you want a ChatGPT-style web interface for your local models, Open WebUI provides exactly that.