The open-source LLM landscape has shifted dramatically. Models like Qwen 3.5, DeepSeek V3.2, GLM-5, and Llama 4 now match or beat proprietary alternatives on key benchmarks, and you can run them on your own hardware. Two years ago, open-weight models were curiosities. Today, they power production workloads at companies that don’t want to send their data to someone else’s API.
This reference covers every major open-source and open-weight large language model available as of March 2026, with verified benchmark scores, license terms, hardware requirements, and hands-on performance data from real self-hosting tests. The tables below compare architecture details, benchmark results, licensing restrictions, and what each model actually needs to run on your own machine using Ollama.
Current as of March 2026. Benchmark data sourced from official model papers and the Hugging Face Open LLM Leaderboard. Self-hosting tests run on Ubuntu 24.04 LTS, 4 vCPUs, 16 GB RAM, CPU-only inference via Ollama.
Master Comparison Table
This table covers every major open-source/open-weight LLM family. “Active params” refers to how many parameters are used per inference pass in Mixture-of-Experts (MoE) architectures. Dense models use all parameters on every token.
| Model | Developer | Total Params | Active Params | Architecture | Context Window | Multimodal | License | Release |
|---|---|---|---|---|---|---|---|---|
| Qwen 3.5 397B-A17B | Alibaba | 397B | 17B | MoE | 256K tokens | Text + Image | Apache 2.0 | Feb 2026 |
| Qwen 3.5 122B-A10B | Alibaba | 122B | 10B | MoE | 256K tokens | Text + Image | Apache 2.0 | Feb 2026 |
| Qwen 3.5 27B | Alibaba | 27B | 27B | Dense | 256K tokens | Text + Image | Apache 2.0 | Feb 2026 |
| Qwen 3 235B | Alibaba | 235B | 22B | MoE (128e, 8 active) | 128K tokens | No | Apache 2.0 | Apr 2025 |
| Qwen 3 32B | Alibaba | 32B | 32B | Dense | 128K tokens | No | Apache 2.0 | Apr 2025 |
| Qwen 3 8B | Alibaba | 8B | 8B | Dense | 128K tokens | No | Apache 2.0 | Apr 2025 |
| GLM-5 | Zhipu AI | 744B | 40B | MoE | 205K tokens | Text + Image | MIT | Feb 2026 |
| DeepSeek V3.2 | DeepSeek | 671B | 37B | MoE | 128K tokens | No | MIT | Dec 2025 |
| DeepSeek R1 | DeepSeek | 671B | 37B | MoE | 128K tokens | No | MIT | Jan 2025 |
| DeepSeek V3 | DeepSeek | 671B | 37B | MoE | 128K tokens | No | MIT | Jan 2025 |
| Llama 4 Scout | Meta | 109B | 17B | MoE (16 experts) | 10M tokens | Text + Image | Llama 4 Community | Apr 2025 |
| Llama 4 Maverick | Meta | 400B | 17B | MoE (128 experts) | 1M tokens | Text + Image | Llama 4 Community | Apr 2025 |
| Llama 3.3 | Meta | 70B | 70B | Dense | 128K tokens | No | Llama 3.3 Community | Dec 2024 |
| Mistral Small 4 | Mistral AI | 119B | 6B | MoE (128e, 4 active) | 256K tokens | Text + Image | Apache 2.0 | Mar 2026 |
| Mistral Large 3 | Mistral AI | 675B | 41B | MoE | 256K tokens | Text + Image | Apache 2.0 | Dec 2025 |
| Gemma 3 27B | 27B | 27B | Dense | 128K tokens | Text + Image | Gemma (permissive) | Mar 2025 | |
| Gemma 3 12B | 12B | 12B | Dense | 128K tokens | Text + Image | Gemma (permissive) | Mar 2025 | |
| Gemma 3 4B | 4B | 4B | Dense | 128K tokens | Text + Image | Gemma (permissive) | Mar 2025 | |
| Phi-4 Reasoning Vision | Microsoft | 15B | 15B | Dense | 16K tokens | Text + Image | MIT | Mar 2026 |
| Phi-4 | Microsoft | 14B | 14B | Dense | 16K tokens | No | MIT | Jan 2025 |
| Phi-4 Mini | Microsoft | 3.8B | 3.8B | Dense | 128K tokens | No | MIT | Jan 2025 |
| Mixtral 8x7B | Mistral AI | 46.7B | 12.9B | MoE (8 experts) | 32K tokens | No | Apache 2.0 | 2023 |
| Command R+ | Cohere | 104B | 104B | Dense | 128K tokens | No | CC-BY-NC | 2024 |
| Command A | Cohere | 111B | 111B | Dense | 256K tokens | No | CC-BY-NC | Mar 2025 |
| Falcon 3 10B | TII Abu Dhabi | 10B | 10B | Dense | 32K tokens | No | TII Falcon 2.0 | Dec 2024 |
| DBRX | Databricks | 132B | 36B | MoE | 32K tokens | No | Databricks Open Model | Mar 2024 |
| Grok-1 | xAI | 314B | N/A | MoE | N/A | No | Apache 2.0 | Mar 2024 |
Benchmark Scores
Benchmarks tell part of the story. MMLU tests general knowledge, GPQA Diamond tests graduate-level reasoning, AIME covers competition math, MATH-500 tests problem solving, and SWE-bench Verified measures real-world coding ability. All scores below come from official model papers or verified third-party evaluations. Empty cells mean the score hasn’t been published or independently verified.
| Model | MMLU | MMLU-Pro | GPQA Diamond | AIME ’24 | MATH-500 | SWE-bench Verified |
|---|---|---|---|---|---|---|
| GLM-5 | N/A | N/A | N/A | N/A | N/A | 77.8% |
| DeepSeek V3.2-Speciale | N/A | N/A | N/A | N/A | N/A | N/A |
| Qwen 3 235B | N/A | 83.6% | 77.2% | 85.7% | N/A | N/A |
| DeepSeek R1 | N/A | 84.0% | 71.5% | 79.8% | 97.3% | N/A |
| Llama 4 Maverick | 85.5% | N/A | 69.8% | N/A | N/A | N/A |
| Llama 4 Scout | 79.6% | N/A | N/A | N/A | N/A | N/A |
| Gemma 3 27B | 78.6% | N/A | N/A | N/A | 50.0% | N/A |
| Mistral Small 4 | N/A | N/A | N/A | N/A | N/A | N/A |
A few things jump out. Qwen 3 235B leads on GPQA Diamond (77.2%) and AIME ’24 (85.7%), making it the strongest open-weight model for reasoning and math. DeepSeek R1 dominates MATH-500 at 97.3%, which is near-perfect. GLM-5 posts 77.8% on SWE-bench Verified, the strongest coding benchmark result among open models. DeepSeek V3.2-Speciale achieved gold-medal performance at IMO 2025, IOI 2025, and ICPC World Finals, though official benchmark numbers haven’t been published yet. Llama 4 Maverick posts the highest raw MMLU at 85.5%, but MMLU alone doesn’t capture reasoning depth.
License Comparison
Licensing is where “open source” gets complicated. Some models are truly permissive (Apache 2.0, MIT), while others come with usage caps, geographic restrictions, or prohibitions on training derivative models. Read the fine print before building a product on any of these.
| License | Models | Commercial Use | Key Restrictions |
|---|---|---|---|
| Apache 2.0 | Qwen 3/3.5 (all), Mistral Large 3, Mistral Small 4, Mixtral 8x7B, Mistral 7B, Grok-1 | Yes, unrestricted | None |
| MIT | DeepSeek V3/V3.2/R1, Phi-4 (all variants), GLM-5 | Yes, unrestricted | None |
| Llama 4 Community | Llama 4 Scout, Llama 4 Maverick | Yes, free under 700M MAU | EU multimodal restrictions; Meta license required above 700M monthly active users |
| Llama 3.3 Community | Llama 3.3 70B | Yes, free under 700M MAU | Same MAU threshold as Llama 4 |
| Gemma | Gemma 3 (all sizes) | Yes (requires agreement) | Must accept Google’s terms; commercial use permitted after agreement |
| CC-BY-NC | Command R+, Command A | No | Non-commercial only; separate agreement required for commercial deployment |
| TII Falcon 2.0 | Falcon 3 (all sizes) | Yes, under $1M revenue | 10% royalty above $1M revenue |
| Databricks Open Model | DBRX | Yes | Cannot use to train other LLMs |
If license flexibility is your top priority, Qwen 3/3.5 under Apache 2.0, DeepSeek under MIT, or GLM-5 under MIT are the safest choices. You can do whatever you want with them, including fine-tuning and commercial deployment with zero royalties. Both Mistral Large 3 and Mistral Small 4 now ship under Apache 2.0, a significant shift from Mistral’s earlier restrictive licensing. The Llama licenses look permissive at first glance, but the 700M MAU cap and EU restrictions matter for larger operations.
Self-Hosting Resource Requirements
Benchmarks don’t tell you how a model feels when you’re actually running it. We tested six popular small/medium models on a modest Ubuntu 24.04 VM (4 vCPUs, 16 GB RAM, CPU-only inference) using Ollama. Each model answered the same prompt to keep things consistent.
| Model | Ollama Tag | Disk Size | RAM Usage | Response Time (CPU) | Notes |
|---|---|---|---|---|---|
| Llama 3.2 3B | llama3.2:3b | 2.0 GB | 11.4 GB | 88s | Clear, well-structured responses |
| Gemma 3 4B | gemma3:4b | 3.3 GB | 4.2 GB | 94s | Clean, structured, concise output |
| Phi-4 Mini 3.8B | phi4-mini | 2.5 GB | 8.9 GB | 97s | Good reasoning, occasional formatting artifacts |
| Mistral 7B | mistral:7b | 4.4 GB | 7.4 GB | 125s | Concise, accurate |
| Qwen 3 8B | qwen3:8b | 5.2 GB | 5.8 GB | 433s | Thinking mode adds latency; strong final answers |
| DeepSeek R1 8B | deepseek-r1:8b | 5.2 GB | 5.8 GB | 433s | Chain-of-thought reasoning; slow on CPU |
Gemma 3 4B stands out for RAM efficiency at just 4.2 GB, making it the best fit for memory-constrained environments. Llama 3.2 3B is the fastest responder at 88 seconds but uses a surprising 11.4 GB of RAM. The reasoning models (Qwen 3 8B and DeepSeek R1 8B) took over 7 minutes each because their chain-of-thought process generates significantly more tokens before producing a final answer. On a GPU, those times drop dramatically.
Pull and run any of these models with a single command:
ollama run gemma3:4b
Check our Ollama commands cheat sheet for the full list of management commands, and browse the Ollama model library for all available tags and quantization options.
Qwen 3 and Qwen 3.5 (Alibaba)
Qwen is arguably the most versatile open-source model family available. The Qwen 3 series, released under Apache 2.0 with no usage restrictions, spans from a 0.6B edge model to a 235B MoE flagship that competes with the best proprietary models. The 235B variant posts 85.7% on AIME ’24 and 77.2% on GPQA Diamond, both top scores among open models.
Qwen 3.5, released in waves from February to March 2026, represents a generational leap. The entire family is now natively multimodal (text + vision trained jointly from the start, not bolted on after the fact). Key upgrades over Qwen 3 include 256K context windows (up from 128K), support for 201 languages (up from 119), and significantly improved agentic coding capabilities. The flagship 397B-A17B MoE model activates only 17B parameters per token while delivering performance that rivals closed-source alternatives. The 27B dense variant is particularly compelling for teams that want high quality without MoE serving complexity.
One feature that sets the Qwen family apart is its toggleable thinking mode. You can enable chain-of-thought reasoning when you need it (math, logic, coding) and disable it for faster responses on straightforward queries. This flexibility means a single model serves both use cases. All Qwen 3 and 3.5 models remain Apache 2.0 licensed.
GLM-5 (Zhipu AI)
GLM-5, released in February 2026 by Chinese AI lab Zhipu AI, is one of the most significant open-source releases of the year. At 744B total parameters with 40B active, it’s a large MoE model under the MIT license with no usage restrictions.
What makes GLM-5 notable beyond its size is its training infrastructure: the entire model was trained on 100,000 Huawei Ascend 910B chips with no US-manufactured hardware involved. This matters for organizations operating under export control constraints. The model posts 50.4% on Humanity’s Last Exam and 77.8% on SWE-bench Verified, the latter being the strongest coding benchmark result among open models.
GLM-5 supports a 205K token context window and multimodal input. Zhipu AI has announced GLM-5.1, an open-source successor, though no release date has been confirmed. The closed-source GLM-5-Turbo launched in March 2026 for API access.
DeepSeek R1, V3, and V3.2
DeepSeek made waves in January 2025 with two models under the MIT license. V3 is a general-purpose 671B MoE model (37B active), while R1 uses the same architecture but specializes in step-by-step reasoning. R1’s 97.3% on MATH-500 is the highest score of any open model on that benchmark, and its 84.0% on MMLU-Pro puts it ahead of most competitors on professional-level knowledge tasks.
DeepSeek V3.2, released in December 2025, is the first model to integrate thinking directly into tool-use workflows. The companion V3.2-Speciale variant, designed for high-compute reasoning, achieved gold-medal performance at IMO 2025, IOI 2025, and ICPC World Finals. Both remain MIT licensed.
The distilled versions (1.5B through 70B) make DeepSeek accessible on consumer hardware. The 8B distill, tested in our Ollama benchmarks above, delivers genuine chain-of-thought reasoning at a size that runs on a laptop. For a detailed walkthrough, see our guide on running DeepSeek R1 locally with Ollama.
DeepSeek V4 has been anticipated since early 2026 (expected to be a ~1T parameter model with 1M context and native multimodal support), but multiple rumored release windows have passed without an announcement. Current expectation is Q2 or Q3 2026.
Llama 4 (Meta)
Meta’s Llama 4 family, released in April 2025, introduced MoE architecture to the Llama line for the first time. Both Scout and Maverick use only 17B active parameters per token despite having 109B and 400B total parameters respectively. The practical benefit is that you get large-model quality with smaller-model inference costs.
Scout’s headline feature is its 10 million token context window, the longest of any open model by a wide margin. Maverick, with 128 experts and a 1M context window, targets production deployments where quality matters more than context length. Both are natively multimodal (text and image).
Behemoth, the flagship at roughly 2 trillion total parameters (288B active), was originally expected to follow Scout and Maverick. However, reports from May 2025 indicate Meta paused its release after internal evaluations showed performance improvements were incremental. As of March 2026, Meta has not officially cancelled Behemoth but has provided no release timeline.
The Llama 4 Community License is free for organizations under 700 million monthly active users. There’s an important catch for European users: the Acceptable Use Policy explicitly excludes multimodal model rights for individuals or companies based in the EU. Since all Llama 4 models are natively multimodal, this effectively restricts the entire Llama 4 family in the EU, likely a preemptive response to the EU AI Act’s transparency and training data disclosure requirements. Llama 3.3 (70B dense, text-only) remains unaffected by this restriction and is still popular because it fits on a single high-end GPU without MoE-aware serving infrastructure.
Gemma 3 (Google)
Google positioned Gemma 3 for on-device and edge deployment, and the resource numbers from our testing confirm it. The 4B variant uses just 4.2 GB of RAM and responds in 94 seconds on CPU, the best efficiency ratio of any model we tested. The 27B flagship hits 78.6% on MMLU and 1338 Elo on Chatbot Arena running on a single H100 GPU, impressive for a dense model at that size.
All Gemma 3 models from 4B up support multimodal input (text and images), which is useful for building applications that need vision capabilities without the infrastructure costs of a 400B model. Google also released FunctionGemma 270M in December 2025, a tiny model optimized specifically for function calling on mobile and IoT devices.
The Gemma license requires accepting Google’s terms but permits commercial use after that. The 1B model, with its 32K context window, targets mobile and IoT scenarios where every megabyte counts.
Mistral Models
Mistral AI spans the full range from tiny to massive. The original Mistral 7B and Mixtral 8x7B, both under Apache 2.0, remain among the most deployed open models for self-hosting. Mistral 7B is a reliable general-purpose model that runs comfortably on modest hardware. Mixtral 8x7B, with 46.7B total parameters but only 12.9B active, demonstrated early on that MoE could deliver outsized quality at reasonable compute costs.
Mistral Large 3 (December 2025) is a different beast: 675B total parameters, 41B active, with multimodal support covering text and images across 80+ languages. It ships under Apache 2.0.
Mistral Small 4, released in March 2026, is the most interesting recent addition. At 119B total parameters with 128 experts and only 4 active per token (6B active parameters), it unifies instruction following, reasoning (via configurable depth), and multimodal capabilities into a single model. It supports a 256K context window and ships under Apache 2.0, a significant licensing upgrade from Mistral’s previous restrictive licenses. It combines the capabilities of three previously separate models: Magistral (reasoning), Pixtral (multimodal), and Devstral (agentic coding).
Phi-4 (Microsoft)
Microsoft’s Phi-4 family proves that smaller models can punch above their weight on specific tasks. The 14B base model and its reasoning-focused variants excel at math and logic problems, consistently outperforming larger models on those benchmarks. Phi-4 Mini at 3.8B with a 128K context window is one of the best options for resource-constrained deployments that still need long-context capabilities.
Phi-4 Reasoning Vision (15B), released March 4, 2026, adds image understanding to the reasoning pipeline. Built on the Phi-4-Reasoning backbone with a SigLIP-2 vision encoder, it supports dynamic resolution with up to 3,600 visual tokens for GUI grounding and document analysis. One clever design choice: the model knows when to engage deep reasoning and when thinking is unnecessary, adapting its compute budget per query. All Phi-4 variants ship under the MIT license.
In our Ollama testing, Phi-4 Mini used 8.9 GB of RAM, which is higher than expected for a 3.8B model, likely due to its 128K context window allocation. Response quality was solid, though we noticed occasional formatting artifacts in structured output.
Command A and R+, Falcon 3, DBRX, and Grok
Command A (Cohere, 111B dense, March 2025) supersedes Command R+ for most use cases. It offers a 256K context window, runs on just 2 GPUs (A100/H100), and includes reasoning and vision variants. Like R+, it’s optimized for retrieval-augmented generation and tool use. The CC-BY-NC license limits it to non-commercial use without a separate agreement. Command R+ (104B, 128K context) remains available but is the older model.
Cohere also released Tiny Aya (3.35B, February 2026) under CC-BY-NC, supporting 70+ languages and designed for laptop and edge deployment. And Cohere Transcribe (2B, March 2026), an Apache 2.0 licensed speech recognition model that tops the Hugging Face Open ASR leaderboard across 14 languages.
Falcon 3 from TII Abu Dhabi offers models from 1B to 10B, trained on 14 trillion tokens. The Falcon 2.0 license is free for organizations under $1 million in revenue, with a 10% royalty above that threshold.
DBRX (Databricks, 132B total, 36B active) was an early high-quality MoE release in March 2024. Its license prohibits using it to train other LLMs but otherwise permits commercial deployment. It’s showing its age against newer models but remains relevant in Databricks-native workflows.
Grok-1 from xAI (314B MoE, Apache 2.0) was a surprise open release in March 2024. Grok-2.5 followed in August 2025 with open weights but a more restrictive license that prohibits using the weights to train other models. Elon Musk confirmed in February 2026 that xAI plans to open-source Grok 3, but the originally targeted February release date has passed with no weights published. Neither Grok model has the community tooling support (Ollama templates, vLLM integrations) that the more established families enjoy.
Which Model Should You Choose?
The “best” model depends entirely on what you’re building. Here’s a quick decision guide based on real use cases.
Best for reasoning and math: DeepSeek R1 (97.3% MATH-500) or Qwen 3 235B (85.7% AIME ’24). DeepSeek V3.2-Speciale won gold at IMO, IOI, and ICPC 2026. If you need the model to show its work and solve multi-step problems, the DeepSeek and Qwen families are the clear leaders.
Best for general-purpose chat: Qwen 3.5 397B-A17B or Llama 4 Maverick. Maverick has the highest MMLU (85.5%) among open models. Qwen 3.5 27B is a strong dense alternative that’s simpler to serve.
Best for coding: GLM-5 (77.8% SWE-bench Verified) or DeepSeek V3.2-Speciale. For a smaller option, Mistral Small 4 combines Devstral’s agentic coding capabilities in a 6B active parameter package.
Best for multilingual: Qwen 3.5 (201 languages) or Mistral Large 3 (80+ languages). For lightweight multilingual needs, Cohere’s Tiny Aya covers 70+ languages at just 3.35B parameters (CC-BY-NC license).
Best for edge and mobile: Gemma 3 4B (4.2 GB RAM) or Gemma 3 1B for extreme constraints. Google’s FunctionGemma 270M is purpose-built for function calling on IoT devices.
Best for RAG and tool use: Command A (256K context, grounding-optimized) or DeepSeek V3.2 (native thinking + tool-use integration). Check Cohere’s CC-BY-NC license if you’re building commercially.
Best for long context: Llama 4 Scout (10M tokens) is unmatched. For 256K, Qwen 3.5 or Mistral Small 4. For a practical 128K, DeepSeek V3.2 or Gemma 3 27B.
Best permissive license: Qwen 3/3.5 (Apache 2.0), DeepSeek (MIT), GLM-5 (MIT), or Mistral Small 4 (Apache 2.0). No usage caps, no royalties, no geographic restrictions.
Running These Models Locally with Ollama
Every model in the self-hosting table above can be pulled and run with a single Ollama command. Install Ollama on your system first:
curl -fsSL https://ollama.com/install.sh | sh
Then pull and interact with any model:
ollama run qwen3:8b
For a GPU-equipped machine, Ollama automatically uses CUDA or ROCm if available, cutting response times from minutes to seconds. On CPU-only systems, stick with models under 8B parameters for usable response times.
For a full setup walkthrough, see our guide on installing Ollama on Rocky Linux and Ubuntu. If you want a ChatGPT-style web interface for your local models, Open WebUI provides exactly that.