AI

Install Ollama on Rocky Linux 10 / Ubuntu 24.04

Running large language models on your own hardware means no API costs, no data leaving your network, and no rate limits. Ollama makes this straightforward with a single install command, built-in model management, and an OpenAI-compatible REST API that works as a drop-in replacement for cloud endpoints.

Original content from computingforgeeks.com - post 164398

This guide covers installing Ollama on Rocky Linux 10 and Ubuntu 24.04, pulling and running models, configuring the systemd service for remote access, firewall rules (firewalld and ufw), the REST API with practical examples, and running Ollama in Docker. Both OS families are tested side by side so you can follow along on whichever you run.

Tested March 2026 on Rocky Linux 10.1 (SELinux enforcing) and Ubuntu 24.04.2 LTS with Ollama v0.18.2, Gemma 3 4B

Prerequisites

  • Rocky Linux 10 or Ubuntu 24.04 server (fresh or existing)
  • Minimum 8 GB RAM for small models (1B-3B). 16 GB recommended for 7B models
  • 20 GB free disk space for model storage
  • (Optional) NVIDIA GPU with compute capability 5.0+ for hardware acceleration
  • Root or sudo access
  • Internet access to download Ollama and pull models

Ollama runs in CPU-only mode by default and auto-detects NVIDIA or AMD GPUs when drivers are present. You do not need a GPU to follow this guide.

Install Ollama

Ollama ships as a single binary. The official install script downloads it to /usr/local/bin, creates a dedicated ollama system user and group, and sets up a systemd service. There is no RPM or DEB repository involved.

Rocky Linux 10 requires zstd for extracting the compressed archive. Install it first:

sudo dnf install -y zstd

Then run the install script on either OS:

curl -fsSL https://ollama.com/install.sh | sh

The installer finishes with a confirmation and the API address:

>>> Installing ollama to /usr/local
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

The GPU warning is expected on CPU-only systems. Ollama works fine without a GPU, just slower on larger models.

Verify the installed version:

ollama --version

You should see the version number confirmed:

ollama version is 0.18.2

Confirm the service is running:

systemctl status ollama

The output shows active (running) with the process details:

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Wed 2026-03-25 02:40:40 EAT; 2min ago
   Main PID: 5120 (ollama)
      Tasks: 9 (limit: 100206)
     Memory: 11.8M (peak: 26.2M)
        CPU: 52ms
     CGroup: /system.slice/ollama.service
             └─5120 /usr/local/bin/ollama serve

Pull and Run Your First Model

Ollama hosts a library of 200+ models you can pull with a single command. For a first test, Gemma 3 4B from Google is a solid choice because it fits in 8 GB of RAM and responds quickly on CPU.

ollama pull gemma3:4b

The download takes a few minutes depending on your connection (the model is about 3.3 GB):

pulling manifest
pulling aeda25e63ebd... 100%  3.3 GB
pulling e0a42594d802... 100%  358 B
pulling dd084c7d92a3... 100%  8.4 KB
pulling 3116c5225075... 100%   77 B
pulling b6ae5839783f... 100%  489 B
verifying sha256 digest
writing manifest
success

Start an interactive chat session:

ollama run gemma3:4b

This drops you into a prompt where you can type questions. On a 4-core CPU with 16 GB RAM, expect roughly 8-12 tokens per second with the 4B model. Type /bye to exit the session.

You can also pipe a prompt directly without entering interactive mode:

echo "What is the capital of France? Answer in one sentence." | ollama run gemma3:4b

The model responds immediately:

The capital of France is Paris.

List all downloaded models:

ollama list

This shows the model name, ID, size on disk, and when it was last modified:

NAME         ID              SIZE      MODIFIED
gemma3:4b    a2af6cc3eb7f    3.3 GB    3 minutes ago

Download and Switch Between Models

One of Ollama’s strengths is how easily you can pull multiple models and switch between them. Each model is good at different things, so having several available lets you pick the right tool for the job.

Popular Models to Pull

Pull a general-purpose model (Meta’s Llama 3.1, the most downloaded model on Ollama with 111M+ pulls):

ollama pull llama3.1:8b

Pull a reasoning model (DeepSeek R1 excels at math, logic, and complex problem-solving):

ollama pull deepseek-r1:8b

Pull a code generation model (Qwen 2.5 Coder handles code completion, debugging, and explanation):

ollama pull qwen2.5-coder:7b

Pull an embedding model for RAG and semantic search (tiny footprint, essential for document Q&A pipelines):

ollama pull nomic-embed-text

After pulling several models, ollama list shows everything available locally:

ollama list
NAME                   ID              SIZE      MODIFIED
deepseek-r1:8b         28f8fd6cdc67    4.9 GB    2 minutes ago
gemma3:4b              a2af6cc3eb7f    3.3 GB    15 minutes ago
llama3.1:8b            46e0c10c039e    4.7 GB    5 minutes ago
nomic-embed-text       0a109f422b47    274 MB    1 minute ago
qwen2.5-coder:7b      2b0496514337    4.7 GB    3 minutes ago

Switch Between Models

Switching models is as simple as specifying a different name with ollama run. There is no configuration change needed. Ollama unloads the previous model from memory (after the keep-alive timeout) and loads the new one:

ollama run llama3.1:8b "Summarize what systemd does in two sentences."

Now try the same question with the reasoning model:

ollama run deepseek-r1:8b "Summarize what systemd does in two sentences."

Ask the code model to generate a script:

ollama run qwen2.5-coder:7b "Write a bash script that checks disk usage and alerts if any partition exceeds 90%."

Check which models are currently loaded in memory:

ollama ps

This shows the active model, how much memory it uses, and whether it runs on CPU or GPU. By default, Ollama keeps a model loaded for 5 minutes after the last request, then unloads it to free memory.

Model Sizes and Hardware Requirements

Models come in different parameter sizes. Larger models produce better output but need more RAM and a faster GPU to run at usable speeds. The table below covers the most popular models with their hardware requirements, so you can pick the right one for your server:

ModelParametersDiskRAM (CPU)VRAM (GPU)Best For
gemma3:1b1B815 MB4 GB2 GBQuick answers, edge devices, Raspberry Pi
gemma3:4b4.3B3.3 GB8 GB4 GBFast general chat, vision tasks, low-resource servers
llama3.1:8b8B4.7 GB16 GB5 GBGeneral purpose, tool use, most popular Ollama model
deepseek-r1:8b8B4.9 GB16 GB5 GBReasoning, math, logic, problem-solving
qwen2.5-coder:7b7B4.7 GB16 GB5 GBCode generation, debugging, code review
qwen2.5:7b7B4.7 GB16 GB5 GBMultilingual tasks, 128K token context window
mistral:7b7B4.1 GB16 GB5 GBFast inference, good all-rounder
llama3.1:70b70B40 GB64 GB40 GBNear GPT-4 quality, needs serious hardware
deepseek-r1:70b70B42 GB64 GB40 GBAdvanced reasoning, research-grade math
qwen2.5:72b72B43 GB64 GB48 GBBest open multilingual model, 128K context
llama3.1:405b405B231 GB256+ GBMulti-GPUFrontier model, requires multi-GPU or cloud
nomic-embed-text137M274 MB4 GB1 GBText embeddings for RAG and semantic search
llava:7b7B4.5 GB16 GB5 GBVision: describe images, read screenshots

A practical rule of thumb: on CPU-only servers, stick to 7B-8B models or smaller. They run at 8-15 tokens/sec on modern hardware, which is usable for interactive chat. The 13B models are noticeably slower on CPU (3-6 tokens/sec), and anything above 30B on CPU alone becomes impractical for real-time use.

With a GPU, the picture changes. An NVIDIA RTX 4090 (24 GB VRAM) runs 7B models at 80+ tokens/sec and can handle 13B models comfortably. For 70B models, you need 48+ GB of VRAM (A6000, dual GPUs, or cloud instances).

Pull a Specific Model Version

Models use a name:tag format. The tag specifies the size variant. For example, Llama 3.1 is available in three sizes:

ollama pull llama3.1:8b
ollama pull llama3.1:70b
ollama pull llama3.1:405b

If you omit the tag, Ollama pulls the default (usually the smallest variant). Always specify the tag explicitly so you know exactly what you are downloading.

Manage Models

Beyond pulling and running, Ollama provides commands for inspecting, copying, removing, and creating custom model variants.

View detailed model information including architecture, quantization, and parameters:

ollama show gemma3:4b

The output shows the parameter count, context window, and quantization level:

  Model
    architecture        gemma3
    parameters          4.3B
    context length      131072
    embedding length    2560
    quantization        Q4_K_M

  Capabilities
    completion
    vision

  Parameters
    stop           ""
    temperature    1
    top_k          64
    top_p          0.95

  License
    Gemma Terms of Use
    Last modified: February 21, 2024

Remove a model to free disk space:

ollama rm deepseek-r1:8b

Copy a model to create a named variant (useful before customizing):

ollama cp gemma3:4b my-assistant

Create Custom Models with a Modelfile

A Modelfile lets you bake in a system prompt, adjust temperature, or apply other default parameters to a base model. This is useful when you want a model that always behaves a certain way without specifying the system prompt on every request. Create the file:

vi Modelfile

Add the following content:

FROM gemma3:4b
SYSTEM "You are a senior Linux systems administrator. Give concise, practical answers with exact commands."
PARAMETER temperature 0.3

Build the custom model and give it a name:

ollama create sysadmin-assistant -f Modelfile

Run your custom model the same way as any other:

ollama run sysadmin-assistant "How do I check which process is using port 443?"

The custom model appears in ollama list alongside the base models and can be used through the API just like any other model.

Configure the Ollama Service

The install script creates a systemd unit at /etc/systemd/system/ollama.service. The default service file looks like this:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

[Install]
WantedBy=default.target

To customize Ollama’s behavior, create a systemd override instead of editing the service file directly. Overrides survive upgrades:

sudo systemctl edit ollama

This opens an editor where you can add environment variables. Here is an example that changes the bind address, model storage location, and concurrent request limit:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=10m"

Key environment variables and what they control:

VariableDefaultPurpose
OLLAMA_HOST127.0.0.1:11434Bind address and port for the API
OLLAMA_MODELS/usr/share/ollama/.ollama/modelsWhere models are stored on disk
OLLAMA_NUM_PARALLEL1Concurrent requests per model
OLLAMA_MAX_QUEUE512Maximum queued requests before rejecting
OLLAMA_KEEP_ALIVE5mHow long a model stays loaded after last request

After saving the override, reload systemd and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Check the logs to confirm the new settings took effect:

journalctl -u ollama --no-pager -n 20

Enable Remote Access

By default, Ollama only listens on 127.0.0.1, rejecting connections from other machines. To serve models to remote clients or a web UI running on a different host, change the bind address and open the firewall.

Set OLLAMA_HOST to bind on all interfaces (if not already done in the previous section):

sudo systemctl edit ollama

Add the bind override:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Restart the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Firewall: Rocky Linux 10 (firewalld)

Open port 11434 and reload the firewall:

sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --reload

Verify the port is listed:

sudo firewall-cmd --list-ports

The output should include 11434/tcp.

Firewall: Ubuntu 24.04 (ufw)

Allow the port through ufw:

sudo ufw allow 11434/tcp
sudo ufw reload

SELinux on Rocky Linux

Ollama runs cleanly under SELinux enforcing mode with its default configuration. During testing on Rocky Linux 10.1, no AVC denials were logged. Verify this yourself:

sudo ausearch -m avc -ts recent | grep ollama

If this returns <no matches>, SELinux is not blocking anything. If you later add an Nginx reverse proxy in front of Ollama, you will need to allow Nginx to make network connections:

sudo setsebool -P httpd_can_network_connect 1

Test Remote Access

From another machine on the network, query the API (replace 192.168.1.168 with your server’s IP):

curl -s http://192.168.1.168:11434/api/tags | python3 -m json.tool

A successful response lists the models available on the server. If you get a connection refused error, check that OLLAMA_HOST is set to 0.0.0.0 and the firewall port is open.

Use the Ollama REST API

Ollama exposes a REST API on port 11434 that any application can call. This is how web UIs, scripts, and automation tools interact with your models.

List Available Models

curl -s http://localhost:11434/api/tags | python3 -m json.tool

The response includes model name, size, quantization, and family:

{
    "models": [
        {
            "name": "gemma3:4b",
            "model": "gemma3:4b",
            "size": 3338801804,
            "details": {
                "format": "gguf",
                "family": "gemma3",
                "parameter_size": "4.3B",
                "quantization_level": "Q4_K_M"
            }
        }
    ]
}

Generate a Completion

Send a single prompt and get a response. The "stream": false flag returns the full response at once instead of token by token:

curl -s http://localhost:11434/api/generate \
  -d '{"model":"gemma3:4b","prompt":"What is Linux? Answer in two sentences.","stream":false}' \
  | python3 -m json.tool

The response includes the generated text along with timing metrics:

{
    "model": "gemma3:4b",
    "response": "Linux is a free and open-source operating system kernel, meaning it's the core of many operating systems like Ubuntu and Fedora. It's known for its stability, flexibility, and strong community support, making it popular for servers, desktops, and embedded systems.",
    "done": true,
    "total_duration": 6631553316,
    "prompt_eval_count": 18,
    "eval_count": 55
}

Chat Conversation (Multi-turn)

The chat endpoint accepts a list of messages for multi-turn conversations:

curl -s http://localhost:11434/api/chat \
  -d '{"model":"gemma3:4b","messages":[{"role":"user","content":"Explain DNS in one sentence."}],"stream":false}' \
  | python3 -m json.tool

OpenAI-Compatible Endpoint

Ollama exposes /v1/chat/completions following the OpenAI API format. Any library or tool built for OpenAI works with Ollama by pointing it at a different base URL:

curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma3:4b","messages":[{"role":"user","content":"Explain DNS in one sentence."}]}' \
  | python3 -m json.tool

The response follows the standard OpenAI format:

{
  "id": "chatcmpl-751",
  "model": "gemma3:4b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "DNS (Domain Name System) is like a phonebook for the internet, translating human-readable domain names (like google.com) into the numerical IP addresses computers use to locate each other."
      },
      "finish_reason": "stop"
    }
  ]
}

This means you can point the OpenAI Python SDK, LangChain, or any OpenAI-compatible client at http://your-server:11434/v1 and use local models without changing your application code.

NVIDIA GPU Acceleration (Optional)

GPU acceleration is optional but makes a dramatic difference for larger models. A 7B model that generates 8-12 tokens/sec on CPU can exceed 60 tokens/sec on a mid-range GPU. Ollama auto-detects NVIDIA GPUs once drivers are installed.

Rocky Linux 10

Install the NVIDIA driver from the official repository:

sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-10.noarch.rpm
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel10/x86_64/cuda-rhel10.repo
sudo dnf install -y nvidia-driver nvidia-driver-cuda

Reboot after driver installation:

sudo reboot

Ubuntu 24.04

Use the ubuntu-drivers tool which handles dependency resolution automatically:

sudo ubuntu-drivers install
sudo reboot

Verify GPU Detection

After rebooting, confirm the driver is loaded:

nvidia-smi

This displays your GPU model, driver version, and CUDA version. Restart Ollama and verify GPU detection:

sudo systemctl restart ollama
ollama ps

When a model is running on GPU, the ollama ps output shows 100% GPU in the processor column instead of 100% CPU.

Run Ollama in Docker (Alternative)

If you prefer containerized deployments, Ollama publishes official Docker images. This keeps things isolated and pairs easily with a web UI like Open WebUI. Make sure Docker and Compose are installed on your system.

Run Ollama in CPU mode:

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama

For NVIDIA GPU support, add the --gpus all flag (requires nvidia-container-toolkit):

docker run -d \
  --name ollama \
  --gpus all \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama

Pull a model inside the running container:

docker exec ollama ollama pull gemma3:4b

For a complete stack, use Docker Compose to run Ollama alongside Open WebUI (a self-hosted ChatGPT-like interface). Create the compose file:

vi docker-compose.yml

Add the following:

services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    volumes:
      - open_webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:

Start both services:

docker compose up -d

Open WebUI will be available at http://your-server-ip:3000. The first user to register becomes the admin.

Rocky Linux 10 vs Ubuntu 24.04 Comparison

The core Ollama experience is identical on both platforms. The differences are in firewall management, security modules, and GPU driver installation:

ItemRocky Linux 10Ubuntu 24.04
Ollama installNeeds zstd first, then same curl scriptSame curl script
Binary location/usr/local/bin/ollama/usr/local/bin/ollama
Model storage/usr/share/ollama/.ollama/models/usr/share/ollama/.ollama/models
Service managementsystemctl (identical)systemctl (identical)
Firewallfirewall-cmd –add-port=11434/tcpufw allow 11434/tcp
Security moduleSELinux enforcing (no issues observed)AppArmor (no configuration needed)
GPU driver (NVIDIA)dnf install nvidia-driver nvidia-driver-cudaubuntu-drivers install
GPU driver (AMD ROCm)dnf install rocm-hip-runtimeapt install rocm-hip-runtime

Troubleshooting

Error: “could not connect to ollama app, is it running?”

The Ollama service is not running. Start and enable it:

sudo systemctl start ollama
sudo systemctl enable ollama

Check the logs for the root cause:

journalctl -u ollama --no-pager -n 30

Error: “bind: address already in use” on port 11434

Another process is holding port 11434. Find it:

ss -tlnp | grep 11434

If it is a stale Ollama process, kill it and restart the service. You can also change the port by setting OLLAMA_HOST=0.0.0.0:11435 in the systemd override.

Rocky Linux: “This version requires zstd for extraction”

Rocky Linux 10 minimal installs do not include zstd. The Ollama installer needs it for the compressed archive. Install it and re-run:

sudo dnf install -y zstd
curl -fsSL https://ollama.com/install.sh | sh

Going Further

With Ollama running, here are the natural next steps depending on your use case:

  • Web interface: Pair Ollama with Open WebUI for a full ChatGPT-like experience your team can share
  • Production serving: For multi-user workloads, consider vLLM which handles significantly higher concurrent request volumes
  • RAG pipelines: Combine Ollama with PostgreSQL and pgvector to build a self-hosted document Q&A system without cloud API dependencies
  • Embeddings: Pull nomic-embed-text and use the /api/embed endpoint to generate vector embeddings for semantic search
  • Model exploration: Try DeepSeek R1 for reasoning, Qwen 2.5 Coder for code generation, or Llama 3.1 as a general-purpose workhorse

Related Articles

Programming How To Install PHP 8.3 on Ubuntu 24.04|22.04|20.04 Debian Extract deb package on Ubuntu / Debian Linux System Ubuntu How To Install Odoo 15 on Ubuntu 20.04|18.04 Debian Install Centrifugo Messaging Server on Ubuntu 24.04 | Debian 12

Leave a Comment

Press ESC to close