AI

Open Source LLM Comparison Table (2026)

The open-source LLM landscape has shifted dramatically. Models like Qwen 3.5, DeepSeek V3.2, GLM-5, and Llama 4 now match or beat proprietary alternatives on key benchmarks, and you can run them on your own hardware. Two years ago, open-weight models were curiosities. Today, they power production workloads at companies that don’t want to send their data to someone else’s API.

Original content from computingforgeeks.com - post 164556

This reference covers every major open-source and open-weight large language model available as of March 2026, with verified benchmark scores, license terms, hardware requirements, and hands-on performance data from real self-hosting tests. The tables below compare architecture details, benchmark results, licensing restrictions, and what each model actually needs to run on your own machine using Ollama.

Current as of March 2026. Benchmark data sourced from official model papers and the Hugging Face Open LLM Leaderboard. Self-hosting tests run on Ubuntu 24.04 LTS, 4 vCPUs, 16 GB RAM, CPU-only inference via Ollama.

Master Comparison Table

This table covers every major open-source/open-weight LLM family. “Active params” refers to how many parameters are used per inference pass in Mixture-of-Experts (MoE) architectures. Dense models use all parameters on every token.

ModelDeveloperTotal ParamsActive ParamsArchitectureContext WindowMultimodalLicenseRelease
Qwen 3.5 397B-A17BAlibaba397B17BMoE256K tokensText + ImageApache 2.0Feb 2026
Qwen 3.5 122B-A10BAlibaba122B10BMoE256K tokensText + ImageApache 2.0Feb 2026
Qwen 3.5 27BAlibaba27B27BDense256K tokensText + ImageApache 2.0Feb 2026
Qwen 3 235BAlibaba235B22BMoE (128e, 8 active)128K tokensNoApache 2.0Apr 2025
Qwen 3 32BAlibaba32B32BDense128K tokensNoApache 2.0Apr 2025
Qwen 3 8BAlibaba8B8BDense128K tokensNoApache 2.0Apr 2025
GLM-5Zhipu AI744B40BMoE205K tokensText + ImageMITFeb 2026
DeepSeek V3.2DeepSeek671B37BMoE128K tokensNoMITDec 2025
DeepSeek R1DeepSeek671B37BMoE128K tokensNoMITJan 2025
DeepSeek V3DeepSeek671B37BMoE128K tokensNoMITJan 2025
Llama 4 ScoutMeta109B17BMoE (16 experts)10M tokensText + ImageLlama 4 CommunityApr 2025
Llama 4 MaverickMeta400B17BMoE (128 experts)1M tokensText + ImageLlama 4 CommunityApr 2025
Llama 3.3Meta70B70BDense128K tokensNoLlama 3.3 CommunityDec 2024
Mistral Small 4Mistral AI119B6BMoE (128e, 4 active)256K tokensText + ImageApache 2.0Mar 2026
Mistral Large 3Mistral AI675B41BMoE256K tokensText + ImageApache 2.0Dec 2025
Gemma 3 27BGoogle27B27BDense128K tokensText + ImageGemma (permissive)Mar 2025
Gemma 3 12BGoogle12B12BDense128K tokensText + ImageGemma (permissive)Mar 2025
Gemma 3 4BGoogle4B4BDense128K tokensText + ImageGemma (permissive)Mar 2025
Phi-4 Reasoning VisionMicrosoft15B15BDense16K tokensText + ImageMITMar 2026
Phi-4Microsoft14B14BDense16K tokensNoMITJan 2025
Phi-4 MiniMicrosoft3.8B3.8BDense128K tokensNoMITJan 2025
Mixtral 8x7BMistral AI46.7B12.9BMoE (8 experts)32K tokensNoApache 2.02023
Command R+Cohere104B104BDense128K tokensNoCC-BY-NC2024
Command ACohere111B111BDense256K tokensNoCC-BY-NCMar 2025
Falcon 3 10BTII Abu Dhabi10B10BDense32K tokensNoTII Falcon 2.0Dec 2024
DBRXDatabricks132B36BMoE32K tokensNoDatabricks Open ModelMar 2024
Grok-1xAI314BN/AMoEN/ANoApache 2.0Mar 2024

Benchmark Scores

Benchmarks tell part of the story. MMLU tests general knowledge, GPQA Diamond tests graduate-level reasoning, AIME covers competition math, MATH-500 tests problem solving, and SWE-bench Verified measures real-world coding ability. All scores below come from official model papers or verified third-party evaluations. Empty cells mean the score hasn’t been published or independently verified.

ModelMMLUMMLU-ProGPQA DiamondAIME ’24MATH-500SWE-bench Verified
GLM-5N/AN/AN/AN/AN/A77.8%
DeepSeek V3.2-SpecialeN/AN/AN/AN/AN/AN/A
Qwen 3 235BN/A83.6%77.2%85.7%N/AN/A
DeepSeek R1N/A84.0%71.5%79.8%97.3%N/A
Llama 4 Maverick85.5%N/A69.8%N/AN/AN/A
Llama 4 Scout79.6%N/AN/AN/AN/AN/A
Gemma 3 27B78.6%N/AN/AN/A50.0%N/A
Mistral Small 4N/AN/AN/AN/AN/AN/A

A few things jump out. Qwen 3 235B leads on GPQA Diamond (77.2%) and AIME ’24 (85.7%), making it the strongest open-weight model for reasoning and math. DeepSeek R1 dominates MATH-500 at 97.3%, which is near-perfect. GLM-5 posts 77.8% on SWE-bench Verified, the strongest coding benchmark result among open models. DeepSeek V3.2-Speciale achieved gold-medal performance at IMO 2025, IOI 2025, and ICPC World Finals, though official benchmark numbers haven’t been published yet. Llama 4 Maverick posts the highest raw MMLU at 85.5%, but MMLU alone doesn’t capture reasoning depth.

License Comparison

Licensing is where “open source” gets complicated. Some models are truly permissive (Apache 2.0, MIT), while others come with usage caps, geographic restrictions, or prohibitions on training derivative models. Read the fine print before building a product on any of these.

LicenseModelsCommercial UseKey Restrictions
Apache 2.0Qwen 3/3.5 (all), Mistral Large 3, Mistral Small 4, Mixtral 8x7B, Mistral 7B, Grok-1Yes, unrestrictedNone
MITDeepSeek V3/V3.2/R1, Phi-4 (all variants), GLM-5Yes, unrestrictedNone
Llama 4 CommunityLlama 4 Scout, Llama 4 MaverickYes, free under 700M MAUEU multimodal restrictions; Meta license required above 700M monthly active users
Llama 3.3 CommunityLlama 3.3 70BYes, free under 700M MAUSame MAU threshold as Llama 4
GemmaGemma 3 (all sizes)Yes (requires agreement)Must accept Google’s terms; commercial use permitted after agreement
CC-BY-NCCommand R+, Command ANoNon-commercial only; separate agreement required for commercial deployment
TII Falcon 2.0Falcon 3 (all sizes)Yes, under $1M revenue10% royalty above $1M revenue
Databricks Open ModelDBRXYesCannot use to train other LLMs

If license flexibility is your top priority, Qwen 3/3.5 under Apache 2.0, DeepSeek under MIT, or GLM-5 under MIT are the safest choices. You can do whatever you want with them, including fine-tuning and commercial deployment with zero royalties. Both Mistral Large 3 and Mistral Small 4 now ship under Apache 2.0, a significant shift from Mistral’s earlier restrictive licensing. The Llama licenses look permissive at first glance, but the 700M MAU cap and EU restrictions matter for larger operations.

Self-Hosting Resource Requirements

Benchmarks don’t tell you how a model feels when you’re actually running it. We tested six popular small/medium models on a modest Ubuntu 24.04 VM (4 vCPUs, 16 GB RAM, CPU-only inference) using Ollama. Each model answered the same prompt to keep things consistent.

ModelOllama TagDisk SizeRAM UsageResponse Time (CPU)Notes
Llama 3.2 3Bllama3.2:3b2.0 GB11.4 GB88sClear, well-structured responses
Gemma 3 4Bgemma3:4b3.3 GB4.2 GB94sClean, structured, concise output
Phi-4 Mini 3.8Bphi4-mini2.5 GB8.9 GB97sGood reasoning, occasional formatting artifacts
Mistral 7Bmistral:7b4.4 GB7.4 GB125sConcise, accurate
Qwen 3 8Bqwen3:8b5.2 GB5.8 GB433sThinking mode adds latency; strong final answers
DeepSeek R1 8Bdeepseek-r1:8b5.2 GB5.8 GB433sChain-of-thought reasoning; slow on CPU

Gemma 3 4B stands out for RAM efficiency at just 4.2 GB, making it the best fit for memory-constrained environments. Llama 3.2 3B is the fastest responder at 88 seconds but uses a surprising 11.4 GB of RAM. The reasoning models (Qwen 3 8B and DeepSeek R1 8B) took over 7 minutes each because their chain-of-thought process generates significantly more tokens before producing a final answer. On a GPU, those times drop dramatically.

Pull and run any of these models with a single command:

ollama run gemma3:4b

Check our Ollama commands cheat sheet for the full list of management commands, and browse the Ollama model library for all available tags and quantization options.

Qwen 3 and Qwen 3.5 (Alibaba)

Qwen is arguably the most versatile open-source model family available. The Qwen 3 series, released under Apache 2.0 with no usage restrictions, spans from a 0.6B edge model to a 235B MoE flagship that competes with the best proprietary models. The 235B variant posts 85.7% on AIME ’24 and 77.2% on GPQA Diamond, both top scores among open models.

Qwen 3.5, released in waves from February to March 2026, represents a generational leap. The entire family is now natively multimodal (text + vision trained jointly from the start, not bolted on after the fact). Key upgrades over Qwen 3 include 256K context windows (up from 128K), support for 201 languages (up from 119), and significantly improved agentic coding capabilities. The flagship 397B-A17B MoE model activates only 17B parameters per token while delivering performance that rivals closed-source alternatives. The 27B dense variant is particularly compelling for teams that want high quality without MoE serving complexity.

One feature that sets the Qwen family apart is its toggleable thinking mode. You can enable chain-of-thought reasoning when you need it (math, logic, coding) and disable it for faster responses on straightforward queries. This flexibility means a single model serves both use cases. All Qwen 3 and 3.5 models remain Apache 2.0 licensed.

GLM-5 (Zhipu AI)

GLM-5, released in February 2026 by Chinese AI lab Zhipu AI, is one of the most significant open-source releases of the year. At 744B total parameters with 40B active, it’s a large MoE model under the MIT license with no usage restrictions.

What makes GLM-5 notable beyond its size is its training infrastructure: the entire model was trained on 100,000 Huawei Ascend 910B chips with no US-manufactured hardware involved. This matters for organizations operating under export control constraints. The model posts 50.4% on Humanity’s Last Exam and 77.8% on SWE-bench Verified, the latter being the strongest coding benchmark result among open models.

GLM-5 supports a 205K token context window and multimodal input. Zhipu AI has announced GLM-5.1, an open-source successor, though no release date has been confirmed. The closed-source GLM-5-Turbo launched in March 2026 for API access.

DeepSeek R1, V3, and V3.2

DeepSeek made waves in January 2025 with two models under the MIT license. V3 is a general-purpose 671B MoE model (37B active), while R1 uses the same architecture but specializes in step-by-step reasoning. R1’s 97.3% on MATH-500 is the highest score of any open model on that benchmark, and its 84.0% on MMLU-Pro puts it ahead of most competitors on professional-level knowledge tasks.

DeepSeek V3.2, released in December 2025, is the first model to integrate thinking directly into tool-use workflows. The companion V3.2-Speciale variant, designed for high-compute reasoning, achieved gold-medal performance at IMO 2025, IOI 2025, and ICPC World Finals. Both remain MIT licensed.

The distilled versions (1.5B through 70B) make DeepSeek accessible on consumer hardware. The 8B distill, tested in our Ollama benchmarks above, delivers genuine chain-of-thought reasoning at a size that runs on a laptop. For a detailed walkthrough, see our guide on running DeepSeek R1 locally with Ollama.

DeepSeek V4 has been anticipated since early 2026 (expected to be a ~1T parameter model with 1M context and native multimodal support), but multiple rumored release windows have passed without an announcement. Current expectation is Q2 or Q3 2026.

Llama 4 (Meta)

Meta’s Llama 4 family, released in April 2025, introduced MoE architecture to the Llama line for the first time. Both Scout and Maverick use only 17B active parameters per token despite having 109B and 400B total parameters respectively. The practical benefit is that you get large-model quality with smaller-model inference costs.

Scout’s headline feature is its 10 million token context window, the longest of any open model by a wide margin. Maverick, with 128 experts and a 1M context window, targets production deployments where quality matters more than context length. Both are natively multimodal (text and image).

Behemoth, the flagship at roughly 2 trillion total parameters (288B active), was originally expected to follow Scout and Maverick. However, reports from May 2025 indicate Meta paused its release after internal evaluations showed performance improvements were incremental. As of March 2026, Meta has not officially cancelled Behemoth but has provided no release timeline.

The Llama 4 Community License is free for organizations under 700 million monthly active users. There’s an important catch for European users: the Acceptable Use Policy explicitly excludes multimodal model rights for individuals or companies based in the EU. Since all Llama 4 models are natively multimodal, this effectively restricts the entire Llama 4 family in the EU, likely a preemptive response to the EU AI Act’s transparency and training data disclosure requirements. Llama 3.3 (70B dense, text-only) remains unaffected by this restriction and is still popular because it fits on a single high-end GPU without MoE-aware serving infrastructure.

Gemma 3 (Google)

Google positioned Gemma 3 for on-device and edge deployment, and the resource numbers from our testing confirm it. The 4B variant uses just 4.2 GB of RAM and responds in 94 seconds on CPU, the best efficiency ratio of any model we tested. The 27B flagship hits 78.6% on MMLU and 1338 Elo on Chatbot Arena running on a single H100 GPU, impressive for a dense model at that size.

All Gemma 3 models from 4B up support multimodal input (text and images), which is useful for building applications that need vision capabilities without the infrastructure costs of a 400B model. Google also released FunctionGemma 270M in December 2025, a tiny model optimized specifically for function calling on mobile and IoT devices.

The Gemma license requires accepting Google’s terms but permits commercial use after that. The 1B model, with its 32K context window, targets mobile and IoT scenarios where every megabyte counts.

Mistral Models

Mistral AI spans the full range from tiny to massive. The original Mistral 7B and Mixtral 8x7B, both under Apache 2.0, remain among the most deployed open models for self-hosting. Mistral 7B is a reliable general-purpose model that runs comfortably on modest hardware. Mixtral 8x7B, with 46.7B total parameters but only 12.9B active, demonstrated early on that MoE could deliver outsized quality at reasonable compute costs.

Mistral Large 3 (December 2025) is a different beast: 675B total parameters, 41B active, with multimodal support covering text and images across 80+ languages. It ships under Apache 2.0.

Mistral Small 4, released in March 2026, is the most interesting recent addition. At 119B total parameters with 128 experts and only 4 active per token (6B active parameters), it unifies instruction following, reasoning (via configurable depth), and multimodal capabilities into a single model. It supports a 256K context window and ships under Apache 2.0, a significant licensing upgrade from Mistral’s previous restrictive licenses. It combines the capabilities of three previously separate models: Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding).

Phi-4 (Microsoft)

Microsoft’s Phi-4 family proves that smaller models can punch above their weight on specific tasks. The 14B base model and its reasoning-focused variants excel at math and logic problems, consistently outperforming larger models on those benchmarks. Phi-4 Mini at 3.8B with a 128K context window is one of the best options for resource-constrained deployments that still need long-context capabilities.

Phi-4 Reasoning Vision (15B), released March 4, 2026, adds image understanding to the reasoning pipeline. Built on the Phi-4-Reasoning backbone with a SigLIP-2 vision encoder, it supports dynamic resolution with up to 3,600 visual tokens for GUI grounding and document analysis. One clever design choice: the model knows when to engage deep reasoning and when thinking is unnecessary, adapting its compute budget per query. All Phi-4 variants ship under the MIT license.

In our Ollama testing, Phi-4 Mini used 8.9 GB of RAM, which is higher than expected for a 3.8B model, likely due to its 128K context window allocation. Response quality was solid, though we noticed occasional formatting artifacts in structured output.

Command A and R+, Falcon 3, DBRX, and Grok

Command A (Cohere, 111B dense, March 2025) supersedes Command R+ for most use cases. It offers a 256K context window, runs on just 2 GPUs (A100/H100), and includes reasoning and vision variants. Like R+, it’s optimized for retrieval-augmented generation and tool use. The CC-BY-NC license limits it to non-commercial use without a separate agreement. Command R+ (104B, 128K context) remains available but is the older model.

Cohere also released Tiny Aya (3.35B, February 2026) under CC-BY-NC, supporting 70+ languages and designed for laptop and edge deployment. And Cohere Transcribe (2B, March 2026), an Apache 2.0 licensed speech recognition model that tops the Hugging Face Open ASR leaderboard across 14 languages.

Falcon 3 from TII Abu Dhabi offers models from 1B to 10B, trained on 14 trillion tokens. The Falcon 2.0 license is free for organizations under $1 million in revenue, with a 10% royalty above that threshold.

DBRX (Databricks, 132B total, 36B active) was an early high-quality MoE release in March 2024. Its license prohibits using it to train other LLMs but otherwise permits commercial deployment. It’s showing its age against newer models but remains relevant in Databricks-native workflows.

Grok-1 from xAI (314B MoE, Apache 2.0) was a surprise open release in March 2024. Grok-2.5 followed in August 2025 with open weights but a more restrictive license that prohibits using the weights to train other models. Elon Musk confirmed in February 2026 that xAI plans to open-source Grok 3, but the originally targeted February release date has passed with no weights published. Neither Grok model has the community tooling support (Ollama templates, vLLM integrations) that the more established families enjoy.

Which Model Should You Choose?

The “best” model depends entirely on what you’re building. Here’s a quick decision guide based on real use cases.

Best for reasoning and math: DeepSeek R1 (97.3% MATH-500) or Qwen 3 235B (85.7% AIME ’24). DeepSeek V3.2-Speciale won gold at IMO, IOI, and ICPC 2026. If you need the model to show its work and solve multi-step problems, the DeepSeek and Qwen families are the clear leaders.

Best for general-purpose chat: Qwen 3.5 397B-A17B or Llama 4 Maverick. Maverick has the highest MMLU (85.5%) among open models. Qwen 3.5 27B is a strong dense alternative that’s simpler to serve.

Best for coding: GLM-5 (77.8% SWE-bench Verified) or DeepSeek V3.2-Speciale. For a smaller option, Mistral Small 4 combines Devstral’s agentic coding capabilities in a 6B active parameter package.

Best for multilingual: Qwen 3.5 (201 languages) or Mistral Large 3 (80+ languages). For lightweight multilingual needs, Cohere’s Tiny Aya covers 70+ languages at just 3.35B parameters (CC-BY-NC license).

Best for edge and mobile: Gemma 3 4B (4.2 GB RAM) or Gemma 3 1B for extreme constraints. Google’s FunctionGemma 270M is purpose-built for function calling on IoT devices.

Best for RAG and tool use: Command A (256K context, grounding-optimized) or DeepSeek V3.2 (native thinking + tool-use integration). Check Cohere’s CC-BY-NC license if you’re building commercially.

Best for long context: Llama 4 Scout (10M tokens) is unmatched. For 256K, Qwen 3.5 or Mistral Small 4. For a practical 128K, DeepSeek V3.2 or Gemma 3 27B.

Best permissive license: Qwen 3/3.5 (Apache 2.0), DeepSeek (MIT), GLM-5 (MIT), or Mistral Small 4 (Apache 2.0). No usage caps, no royalties, no geographic restrictions.

Running These Models Locally with Ollama

Every model in the self-hosting table above can be pulled and run with a single Ollama command. Install Ollama on your system first:

curl -fsSL https://ollama.com/install.sh | sh

Then pull and interact with any model:

ollama run qwen3:8b

For a GPU-equipped machine, Ollama automatically uses CUDA or ROCm if available, cutting response times from minutes to seconds. On CPU-only systems, stick with models under 8B parameters for usable response times.

For a full setup walkthrough, see our guide on installing Ollama on Rocky Linux and Ubuntu. If you want a ChatGPT-style web interface for your local models, Open WebUI provides exactly that.

Related Articles

AI Install Open WebUI with Ollama on Linux AI The Complete .claude Directory Guide for Claude Code AI Deploy and Debug Kubernetes Apps with Claude Code AI Build and Debug Docker Containers with Claude Code

Leave a Comment

Press ESC to close