AI

Run DeepSeek R1 Locally with Ollama on Linux

DeepSeek R1 does something most open-source models skip: it shows its reasoning. When you ask it a question, it thinks through the problem step by step in visible <think> blocks before giving you an answer. This chain-of-thought approach matches OpenAI’s o1 on reasoning benchmarks, and the entire model is released under the MIT license.

Original content from computingforgeeks.com - post 164427

This guide covers pulling and running DeepSeek R1 locally with Ollama on Rocky Linux 10 and Ubuntu 24.04. We test the 1.5B and 8B variants on a 4-core CPU-only VM with 16 GB RAM, compare response quality and speed, and show how to use the REST API for integration. The 8B variant is the sweet spot for most servers, delivering correct reasoning at about 6 tokens per second on modest hardware.

Verified working: March 2026 on Rocky Linux 10.1 (kernel 6.12), Ollama 0.18.2, DeepSeek R1 1.5B and 8B (distilled Qwen/Llama variants)

What Makes DeepSeek R1 Different

Most LLMs give you an answer directly. DeepSeek R1 generates a reasoning trace first, enclosed in <think> tags, where it works through the problem before producing a final response. On the AIME 2024 math benchmark, R1 scores 79.8% and hits 97.3% on MATH-500, putting it in the same tier as OpenAI’s o1. The R1-0528 update pushed AIME 2025 scores to 87.5%, approaching GPT-4 level on advanced math. These are not theoretical claims; you can reproduce them on your own hardware.

The full model has 671 billion parameters and requires serious GPU infrastructure. For practical local use, DeepSeek released distilled variants based on Qwen 2.5 and Llama architectures, ranging from 1.5B to 70B parameters. The distilled models retain the chain-of-thought reasoning behavior while running on commodity hardware. All variants are MIT-licensed, which means full commercial use with no restrictions.

Prerequisites

  • Rocky Linux 10, AlmaLinux 10, Ubuntu 24.04, or Debian 13
  • Minimum 4 GB RAM for the 1.5B variant, 8 GB for 8B, 16 GB recommended for comfortable 8B inference
  • 10 GB free disk space (5.2 GB for the 8B model plus Ollama binary)
  • Root or sudo access
  • Ollama installed and running (see our Ollama installation guide for Rocky Linux and Ubuntu)
  • GPU is optional. All testing in this guide runs on CPU only

DeepSeek R1 Model Variants

Ollama hosts seven DeepSeek R1 variants. The smaller models are distilled from the full 671B model using Qwen 2.5 and Llama base architectures, fine-tuned on 800K reasoning samples. Pick the largest variant your hardware can support.

ModelDisk SizeRAM NeededContext WindowBase Architecture
deepseek-r1:1.5b1.1 GB~3 GB128KQwen 2.5 1.5B
deepseek-r1:7b4.7 GB~6 GB128KQwen 2.5 7B
deepseek-r1:8b5.2 GB~7 GB128KLlama 3.1 8B
deepseek-r1:14b9.0 GB~12 GB128KQwen 2.5 14B
deepseek-r1:32b20 GB~24 GB128KQwen 2.5 32B
deepseek-r1:70b43 GB~48 GB128KLlama 3.3 70B
deepseek-r1:671b404 GB~500 GB160KDeepSeek MoE (full)

The 8B variant (Llama 3.1 based) is the default tag in Ollama and offers the best balance of quality and resource usage for most servers. The 1.5B is useful for quick prototyping or resource-constrained environments but makes more reasoning mistakes, as we demonstrate below.

Pull and Run DeepSeek R1

With Ollama running, pull the 8B variant (the default):

ollama pull deepseek-r1:8b

The download is about 5.2 GB. On a 100 Mbps connection, expect roughly 5 minutes.

pulling manifest
pulling 6995872bfe4c... 100%
pulling 369ca498f347... 100%
pulling ed8474dc73db... 100%
pulling f64cd5418e4b... 100%
verifying sha256 digest
writing manifest
success

Verify the model is available:

ollama list

You should see the model with its size:

NAME              ID              SIZE      MODIFIED
deepseek-r1:8b    6995872bfe4c    5.2 GB    20 seconds ago

Start an interactive chat session:

ollama run deepseek-r1:8b

The first run takes 10-30 seconds to load the model into memory. After that, you get a prompt where you can type questions. Type /bye to exit the session.

Chain-of-Thought Reasoning in Action

The defining feature of DeepSeek R1 is visible reasoning. When you ask a question, the model first generates a thinking trace, then produces its final answer. In the terminal, Ollama displays “Thinking…” followed by the reasoning steps, then the answer.

Here is a classic trick question that tests whether the model truly reasons or just pattern-matches. Ask the 8B model:

A farmer has 17 sheep. All but 9 die. How many sheep are left alive?

The model’s internal reasoning (from the streaming API thinking field) works through the problem methodically:

<think>
The problem states: "A farmer has 17 sheep. All but 9 die. How many sheep
are left alive?" I need to figure out how many sheep are still alive.

First, "all but 9" means that all except for 9 die. So, if all but 9 die,
that means that 9 are left alive, right?

It says: "All but 9 die." So, all except 9 die. That means that the 9 that
are not dying are alive, so 9 are left alive.

17 minus 9 is 8, so 8 died, and 9 are alive. The question is how many are
left alive, which should be 9.
</think>

The final answer arrives after the thinking completes:

The phrase "all but 9 die" means that all except 9 sheep die. Therefore,
9 sheep are left alive.

Initially, there are 17 sheep. After 8 sheep die (since 17 - 9 = 8),
9 sheep remain alive.

The 8B model correctly identifies that “all but 9” means 9 survive. This is where the chain-of-thought reasoning pays off. The model catches the language trick by explicitly reasoning about what “all but” means before committing to an answer.

Comparing the 1.5B and 8B Variants

To see how model size affects reasoning quality, we ran the same sheep problem on both the 1.5B and 8B variants on a 4-core CPU VM with 16 GB RAM. The results are revealing.

1.5B Result: Wrong Answer

The 1.5B variant misinterprets “all but 9 die” as “9 sheep die” rather than “9 sheep survive.” Its thinking reveals the confusion:

<think>
First, I need to understand the problem clearly. The farmer has a total of
17 sheep initially.

Out of these 17 sheep, all but 9 die. That means that not only does one
sheep remain but also nine more sheep don't make it into the land anymore.

To find out how many sheep are left alive, I can subtract the number of
dead sheep from the initial total. So, 17 minus 9 equals 8.

Therefore, there are 8 sheep left alive.
</think>

Answer: 8 sheep are left alive.

The 1.5B model gets this wrong. It interprets “all but 9” backwards, concluding 8 survive instead of 9. This is a fundamental reasoning failure that the larger model avoids.

8B Result: Correct Answer

The 8B variant correctly parses “all but 9 die” as “9 survive” and produces the right answer of 9. Its thinking chain is longer and more thorough, with the model even second-guessing itself before confirming the correct interpretation.

Performance Comparison

Metricdeepseek-r1:1.5bdeepseek-r1:8b
Correct answer (sheep problem)No (answered 8)Yes (answered 9)
Tokens/sec (CPU, 4 cores)~30~6
RAM used (model loaded)1.1 GB5.9 GB
Disk size1.1 GB5.2 GB
Thinking depthShallow, often wrongThorough, self-correcting

The 1.5B variant is roughly 5x faster but produces unreliable reasoning. For anything beyond simple factual questions, the 8B model is the minimum recommended size. If your server has 32 GB or more RAM, the 14B variant (9.0 GB on disk, ~12 GB RAM) delivers noticeably better quality.

Code Debugging with DeepSeek R1

Reasoning models excel at code analysis because they can trace through logic step by step. Give the 8B model a buggy Python function:

Find the bug in this Python function:

def calculate_average(numbers):
    total = 0
    for num in numbers:
        total += num
    average = total / len(numbers)
    return average

result = calculate_average([])
print(result)

The model’s thinking identifies the zero-division bug immediately:

<think>
We are given a function calculate_average that takes a list of numbers
and returns their average. The bug occurs when the list is empty. In that
case, len(numbers) is 0, and we divide by zero, which will raise a
ZeroDivisionError.
</think>

The final response includes the fix:

The bug occurs when calculating the average of an empty list. The division
by zero (total / len(numbers)) will raise a ZeroDivisionError because the
length of an empty list is zero.

Fixed version:

def calculate_average(numbers):
    if not numbers:
        return None
    total = 0
    for num in numbers:
        total += num
    average = total / len(numbers)
    return average

The model not only identifies the bug but explains why it happens and provides a corrected version with an empty-list guard. At 6 tokens per second on CPU, the response takes about 100 seconds, which is acceptable for debugging assistance on a local machine.

Use DeepSeek R1 via the REST API

Ollama exposes an OpenAI-compatible API on port 11434. This lets you integrate DeepSeek R1 into scripts, applications, or automation pipelines without using the interactive CLI.

Send a request using curl:

curl -s http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:8b",
  "messages": [{
    "role": "user",
    "content": "What are the main differences between TCP and UDP?"
  }],
  "stream": false
}'

The response comes back as JSON with the model’s answer in the message.content field. With stream: false, Ollama buffers the entire response before returning it. The thinking happens internally (consuming tokens) but the <think> content is excluded from the final response when using this endpoint.

To capture the thinking trace via the API, use stream: true and read the thinking field from each streamed JSON line:

curl -sN http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:8b",
  "messages": [{
    "role": "user",
    "content": "What is 15 + 28?"
  }],
  "stream": true
}'

Each line of the stream is a JSON object. During the thinking phase, the output includes a thinking field with the reasoning tokens:

{"model":"deepseek-r1:8b","message":{"role":"assistant","content":"","thinking":"First"},"done":false}
{"model":"deepseek-r1:8b","message":{"role":"assistant","content":"","thinking":", the"},"done":false}
{"model":"deepseek-r1:8b","message":{"role":"assistant","content":"","thinking":" question"},"done":false}

After the thinking completes, the content field starts populating with the final answer. The last line includes "done": true along with performance metrics like eval_count (total tokens generated) and eval_duration (inference time in nanoseconds). You can calculate tokens per second from these values.

For programmatic use in Python, the ollama pip package wraps this API:

pip install ollama

Then in your Python script:

import ollama

response = ollama.chat(
    model='deepseek-r1:8b',
    messages=[{'role': 'user', 'content': 'Explain DNS in one sentence.'}]
)
print(response['message']['content'])

RAM and Disk Usage Reference

These measurements come from our 16 GB test VM running Rocky Linux 10.1 with 4 CPU cores (no GPU). The “RAM used” column shows what ollama ps reports when the model is loaded and ready for inference.

ModelDisk SizeRAM (loaded)Tokens/sec (CPU)Reasoning Quality
deepseek-r1:1.5b1.1 GB1.1 GB~30Basic, frequent errors on tricky questions
deepseek-r1:8b5.2 GB5.9 GB~6Good, handles trick questions and code debugging
deepseek-r1:14b9.0 GB~11 GB~3 (est.)Strong, better nuance and fewer mistakes
deepseek-r1:32b20 GB~24 GB~1.5 (est.)Excellent reasoning comparable to larger proprietary models

With GPU acceleration (NVIDIA CUDA), expect 5-10x higher tokens per second depending on your card. A 12 GB VRAM GPU like the RTX 4070 can run the 8B model entirely in VRAM, pushing speeds to 40-60 tokens per second.

Troubleshooting

Error: “no space left on device” when pulling larger models

We hit this when attempting to pull the 14B variant on a 20 GB disk that already had the 1.5B and 8B models downloaded. Ollama stores models in /usr/share/ollama/.ollama/models/ by default. The partial download file can fill your disk completely.

Clean up partial downloads:

sudo find /usr/share/ollama/.ollama/models/blobs/ -name '*partial*' -delete

Verify disk space is recovered:

df -h /

If you need larger models, either add disk space or move the model storage directory by setting OLLAMA_MODELS in the systemd service environment. Remove models you no longer need with ollama rm deepseek-r1:1.5b.

Ollama service not found after install script

On Rocky Linux 10.1, the Ollama install script (curl -fsSL https://ollama.com/install.sh | sh) installed the binary to /usr/local/bin/ollama but did not create the systemd service file. If systemctl status ollama returns “Unit not found,” create the service manually:

sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama

Create the systemd unit file:

sudo vi /etc/systemd/system/ollama.service

Add the following configuration:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

[Install]
WantedBy=default.target

Reload systemd and start the service:

sudo systemctl daemon-reload
sudo systemctl enable --now ollama

Confirm it is running:

systemctl status ollama

The output should show active (running):

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Wed 2026-03-25 03:25:22 EAT; 2s ago
   Main PID: 5003 (ollama)
      Tasks: 10 (limit: 100206)
     Memory: 11.9M (peak: 24.8M)

Going Further

  • Try larger variants if your hardware allows. The 14B and 32B models produce noticeably better reasoning on complex problems. Each step up roughly doubles the RAM requirement
  • Add a web interface with Open WebUI for a ChatGPT-like experience that persists conversation history and supports multiple models
  • Compare with other reasoning models like Qwen3 (which also supports chain-of-thought) and phi4-reasoning from Microsoft. Pull them with ollama pull qwen3 and test the same prompts side by side
  • GPU acceleration transforms the experience. An NVIDIA GPU with 8 GB or more VRAM can run the 8B model at 40-60 tokens per second instead of 6
  • Build a RAG pipeline by combining DeepSeek R1 with a vector database like pgvector to let the model reason over your own documents without sending data to external APIs

Related Articles

Networking Setup LVS Load Balancer on RHEL 10 / Rocky Linux 10 AlmaLinux Configure Static IP Address on Rocky Linux 9 / AlmaLinux 9 AI Install Open WebUI with Ollama on Linux AlmaLinux Install and Use Micro terminal text editor on Rocky/AlmaLinux 8

Leave a Comment

Press ESC to close