Running large language models on your own hardware means no API costs, no data leaving your network, and no rate limits. Ollama makes this straightforward with a single install command, built-in model management, and an OpenAI-compatible REST API that works as a drop-in replacement for cloud endpoints.
This guide covers installing Ollama on Rocky Linux 10 and Ubuntu 24.04, pulling and running models, configuring the systemd service for remote access, firewall rules (firewalld and ufw), the REST API with practical examples, and running Ollama in Docker. Both OS families are tested side by side so you can follow along on whichever you run.
Tested March 2026 on Rocky Linux 10.1 (SELinux enforcing) and Ubuntu 24.04.2 LTS with Ollama v0.18.2, Gemma 3 4B
Prerequisites
- Rocky Linux 10 or Ubuntu 24.04 server (fresh or existing)
- Minimum 8 GB RAM for small models (1B-3B). 16 GB recommended for 7B models
- 20 GB free disk space for model storage
- (Optional) NVIDIA GPU with compute capability 5.0+ for hardware acceleration
- Root or sudo access
- Internet access to download Ollama and pull models
Ollama runs in CPU-only mode by default and auto-detects NVIDIA or AMD GPUs when drivers are present. You do not need a GPU to follow this guide.
Install Ollama
Ollama ships as a single binary. The official install script downloads it to /usr/local/bin, creates a dedicated ollama system user and group, and sets up a systemd service. There is no RPM or DEB repository involved.
Rocky Linux 10 requires zstd for extracting the compressed archive. Install it first:
sudo dnf install -y zstd
Then run the install script on either OS:
curl -fsSL https://ollama.com/install.sh | sh
The installer finishes with a confirmation and the API address:
>>> Installing ollama to /usr/local
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
The GPU warning is expected on CPU-only systems. Ollama works fine without a GPU, just slower on larger models.
Verify the installed version:
ollama --version
You should see the version number confirmed:
ollama version is 0.18.2
Confirm the service is running:
systemctl status ollama
The output shows active (running) with the process details:
● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
Active: active (running) since Wed 2026-03-25 02:40:40 EAT; 2min ago
Main PID: 5120 (ollama)
Tasks: 9 (limit: 100206)
Memory: 11.8M (peak: 26.2M)
CPU: 52ms
CGroup: /system.slice/ollama.service
└─5120 /usr/local/bin/ollama serve
Pull and Run Your First Model
Ollama hosts a library of 200+ models you can pull with a single command. For a first test, Gemma 3 4B from Google is a solid choice because it fits in 8 GB of RAM and responds quickly on CPU.
ollama pull gemma3:4b
The download takes a few minutes depending on your connection (the model is about 3.3 GB):
pulling manifest
pulling aeda25e63ebd... 100% 3.3 GB
pulling e0a42594d802... 100% 358 B
pulling dd084c7d92a3... 100% 8.4 KB
pulling 3116c5225075... 100% 77 B
pulling b6ae5839783f... 100% 489 B
verifying sha256 digest
writing manifest
success
Start an interactive chat session:
ollama run gemma3:4b
This drops you into a prompt where you can type questions. On a 4-core CPU with 16 GB RAM, expect roughly 8-12 tokens per second with the 4B model. Type /bye to exit the session.
You can also pipe a prompt directly without entering interactive mode:
echo "What is the capital of France? Answer in one sentence." | ollama run gemma3:4b
The model responds immediately:
The capital of France is Paris.
List all downloaded models:
ollama list
This shows the model name, ID, size on disk, and when it was last modified:
NAME ID SIZE MODIFIED
gemma3:4b a2af6cc3eb7f 3.3 GB 3 minutes ago
Download and Switch Between Models
One of Ollama’s strengths is how easily you can pull multiple models and switch between them. Each model is good at different things, so having several available lets you pick the right tool for the job.
Popular Models to Pull
Pull a general-purpose model (Meta’s Llama 3.1, the most downloaded model on Ollama with 111M+ pulls):
ollama pull llama3.1:8b
Pull a reasoning model (DeepSeek R1 excels at math, logic, and complex problem-solving):
ollama pull deepseek-r1:8b
Pull a code generation model (Qwen 2.5 Coder handles code completion, debugging, and explanation):
ollama pull qwen2.5-coder:7b
Pull an embedding model for RAG and semantic search (tiny footprint, essential for document Q&A pipelines):
ollama pull nomic-embed-text
After pulling several models, ollama list shows everything available locally:
ollama list
NAME ID SIZE MODIFIED
deepseek-r1:8b 28f8fd6cdc67 4.9 GB 2 minutes ago
gemma3:4b a2af6cc3eb7f 3.3 GB 15 minutes ago
llama3.1:8b 46e0c10c039e 4.7 GB 5 minutes ago
nomic-embed-text 0a109f422b47 274 MB 1 minute ago
qwen2.5-coder:7b 2b0496514337 4.7 GB 3 minutes ago
Switch Between Models
Switching models is as simple as specifying a different name with ollama run. There is no configuration change needed. Ollama unloads the previous model from memory (after the keep-alive timeout) and loads the new one:
ollama run llama3.1:8b "Summarize what systemd does in two sentences."
Now try the same question with the reasoning model:
ollama run deepseek-r1:8b "Summarize what systemd does in two sentences."
Ask the code model to generate a script:
ollama run qwen2.5-coder:7b "Write a bash script that checks disk usage and alerts if any partition exceeds 90%."
Check which models are currently loaded in memory:
ollama ps
This shows the active model, how much memory it uses, and whether it runs on CPU or GPU. By default, Ollama keeps a model loaded for 5 minutes after the last request, then unloads it to free memory.
Model Sizes and Hardware Requirements
Models come in different parameter sizes. Larger models produce better output but need more RAM and a faster GPU to run at usable speeds. The table below covers the most popular models with their hardware requirements, so you can pick the right one for your server:
| Model | Parameters | Disk | RAM (CPU) | VRAM (GPU) | Best For |
|---|---|---|---|---|---|
| gemma3:1b | 1B | 815 MB | 4 GB | 2 GB | Quick answers, edge devices, Raspberry Pi |
| gemma3:4b | 4.3B | 3.3 GB | 8 GB | 4 GB | Fast general chat, vision tasks, low-resource servers |
| llama3.1:8b | 8B | 4.7 GB | 16 GB | 5 GB | General purpose, tool use, most popular Ollama model |
| deepseek-r1:8b | 8B | 4.9 GB | 16 GB | 5 GB | Reasoning, math, logic, problem-solving |
| qwen2.5-coder:7b | 7B | 4.7 GB | 16 GB | 5 GB | Code generation, debugging, code review |
| qwen2.5:7b | 7B | 4.7 GB | 16 GB | 5 GB | Multilingual tasks, 128K token context window |
| mistral:7b | 7B | 4.1 GB | 16 GB | 5 GB | Fast inference, good all-rounder |
| llama3.1:70b | 70B | 40 GB | 64 GB | 40 GB | Near GPT-4 quality, needs serious hardware |
| deepseek-r1:70b | 70B | 42 GB | 64 GB | 40 GB | Advanced reasoning, research-grade math |
| qwen2.5:72b | 72B | 43 GB | 64 GB | 48 GB | Best open multilingual model, 128K context |
| llama3.1:405b | 405B | 231 GB | 256+ GB | Multi-GPU | Frontier model, requires multi-GPU or cloud |
| nomic-embed-text | 137M | 274 MB | 4 GB | 1 GB | Text embeddings for RAG and semantic search |
| llava:7b | 7B | 4.5 GB | 16 GB | 5 GB | Vision: describe images, read screenshots |
A practical rule of thumb: on CPU-only servers, stick to 7B-8B models or smaller. They run at 8-15 tokens/sec on modern hardware, which is usable for interactive chat. The 13B models are noticeably slower on CPU (3-6 tokens/sec), and anything above 30B on CPU alone becomes impractical for real-time use.
With a GPU, the picture changes. An NVIDIA RTX 4090 (24 GB VRAM) runs 7B models at 80+ tokens/sec and can handle 13B models comfortably. For 70B models, you need 48+ GB of VRAM (A6000, dual GPUs, or cloud instances).
Pull a Specific Model Version
Models use a name:tag format. The tag specifies the size variant. For example, Llama 3.1 is available in three sizes:
ollama pull llama3.1:8b
ollama pull llama3.1:70b
ollama pull llama3.1:405b
If you omit the tag, Ollama pulls the default (usually the smallest variant). Always specify the tag explicitly so you know exactly what you are downloading.
Manage Models
Beyond pulling and running, Ollama provides commands for inspecting, copying, removing, and creating custom model variants.
View detailed model information including architecture, quantization, and parameters:
ollama show gemma3:4b
The output shows the parameter count, context window, and quantization level:
Model
architecture gemma3
parameters 4.3B
context length 131072
embedding length 2560
quantization Q4_K_M
Capabilities
completion
vision
Parameters
stop ""
temperature 1
top_k 64
top_p 0.95
License
Gemma Terms of Use
Last modified: February 21, 2024
Remove a model to free disk space:
ollama rm deepseek-r1:8b
Copy a model to create a named variant (useful before customizing):
ollama cp gemma3:4b my-assistant
Create Custom Models with a Modelfile
A Modelfile lets you bake in a system prompt, adjust temperature, or apply other default parameters to a base model. This is useful when you want a model that always behaves a certain way without specifying the system prompt on every request. Create the file:
vi Modelfile
Add the following content:
FROM gemma3:4b
SYSTEM "You are a senior Linux systems administrator. Give concise, practical answers with exact commands."
PARAMETER temperature 0.3
Build the custom model and give it a name:
ollama create sysadmin-assistant -f Modelfile
Run your custom model the same way as any other:
ollama run sysadmin-assistant "How do I check which process is using port 443?"
The custom model appears in ollama list alongside the base models and can be used through the API just like any other model.
Configure the Ollama Service
The install script creates a systemd unit at /etc/systemd/system/ollama.service. The default service file looks like this:
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
[Install]
WantedBy=default.target
To customize Ollama’s behavior, create a systemd override instead of editing the service file directly. Overrides survive upgrades:
sudo systemctl edit ollama
This opens an editor where you can add environment variables. Here is an example that changes the bind address, model storage location, and concurrent request limit:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_KEEP_ALIVE=10m"
Key environment variables and what they control:
| Variable | Default | Purpose |
|---|---|---|
| OLLAMA_HOST | 127.0.0.1:11434 | Bind address and port for the API |
| OLLAMA_MODELS | /usr/share/ollama/.ollama/models | Where models are stored on disk |
| OLLAMA_NUM_PARALLEL | 1 | Concurrent requests per model |
| OLLAMA_MAX_QUEUE | 512 | Maximum queued requests before rejecting |
| OLLAMA_KEEP_ALIVE | 5m | How long a model stays loaded after last request |
After saving the override, reload systemd and restart the service:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Check the logs to confirm the new settings took effect:
journalctl -u ollama --no-pager -n 20
Enable Remote Access
By default, Ollama only listens on 127.0.0.1, rejecting connections from other machines. To serve models to remote clients or a web UI running on a different host, change the bind address and open the firewall.
Set OLLAMA_HOST to bind on all interfaces (if not already done in the previous section):
sudo systemctl edit ollama
Add the bind override:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Restart the service:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Firewall: Rocky Linux 10 (firewalld)
Open port 11434 and reload the firewall:
sudo firewall-cmd --permanent --add-port=11434/tcp
sudo firewall-cmd --reload
Verify the port is listed:
sudo firewall-cmd --list-ports
The output should include 11434/tcp.
Firewall: Ubuntu 24.04 (ufw)
Allow the port through ufw:
sudo ufw allow 11434/tcp
sudo ufw reload
SELinux on Rocky Linux
Ollama runs cleanly under SELinux enforcing mode with its default configuration. During testing on Rocky Linux 10.1, no AVC denials were logged. Verify this yourself:
sudo ausearch -m avc -ts recent | grep ollama
If this returns <no matches>, SELinux is not blocking anything. If you later add an Nginx reverse proxy in front of Ollama, you will need to allow Nginx to make network connections:
sudo setsebool -P httpd_can_network_connect 1
Test Remote Access
From another machine on the network, query the API (replace 192.168.1.168 with your server’s IP):
curl -s http://192.168.1.168:11434/api/tags | python3 -m json.tool
A successful response lists the models available on the server. If you get a connection refused error, check that OLLAMA_HOST is set to 0.0.0.0 and the firewall port is open.
Use the Ollama REST API
Ollama exposes a REST API on port 11434 that any application can call. This is how web UIs, scripts, and automation tools interact with your models.
List Available Models
curl -s http://localhost:11434/api/tags | python3 -m json.tool
The response includes model name, size, quantization, and family:
{
"models": [
{
"name": "gemma3:4b",
"model": "gemma3:4b",
"size": 3338801804,
"details": {
"format": "gguf",
"family": "gemma3",
"parameter_size": "4.3B",
"quantization_level": "Q4_K_M"
}
}
]
}
Generate a Completion
Send a single prompt and get a response. The "stream": false flag returns the full response at once instead of token by token:
curl -s http://localhost:11434/api/generate \
-d '{"model":"gemma3:4b","prompt":"What is Linux? Answer in two sentences.","stream":false}' \
| python3 -m json.tool
The response includes the generated text along with timing metrics:
{
"model": "gemma3:4b",
"response": "Linux is a free and open-source operating system kernel, meaning it's the core of many operating systems like Ubuntu and Fedora. It's known for its stability, flexibility, and strong community support, making it popular for servers, desktops, and embedded systems.",
"done": true,
"total_duration": 6631553316,
"prompt_eval_count": 18,
"eval_count": 55
}
Chat Conversation (Multi-turn)
The chat endpoint accepts a list of messages for multi-turn conversations:
curl -s http://localhost:11434/api/chat \
-d '{"model":"gemma3:4b","messages":[{"role":"user","content":"Explain DNS in one sentence."}],"stream":false}' \
| python3 -m json.tool
OpenAI-Compatible Endpoint
Ollama exposes /v1/chat/completions following the OpenAI API format. Any library or tool built for OpenAI works with Ollama by pointing it at a different base URL:
curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma3:4b","messages":[{"role":"user","content":"Explain DNS in one sentence."}]}' \
| python3 -m json.tool
The response follows the standard OpenAI format:
{
"id": "chatcmpl-751",
"model": "gemma3:4b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "DNS (Domain Name System) is like a phonebook for the internet, translating human-readable domain names (like google.com) into the numerical IP addresses computers use to locate each other."
},
"finish_reason": "stop"
}
]
}
This means you can point the OpenAI Python SDK, LangChain, or any OpenAI-compatible client at http://your-server:11434/v1 and use local models without changing your application code.
NVIDIA GPU Acceleration (Optional)
GPU acceleration is optional but makes a dramatic difference for larger models. A 7B model that generates 8-12 tokens/sec on CPU can exceed 60 tokens/sec on a mid-range GPU. Ollama auto-detects NVIDIA GPUs once drivers are installed.
Rocky Linux 10
Install the NVIDIA driver from the official repository:
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-10.noarch.rpm
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel10/x86_64/cuda-rhel10.repo
sudo dnf install -y nvidia-driver nvidia-driver-cuda
Reboot after driver installation:
sudo reboot
Ubuntu 24.04
Use the ubuntu-drivers tool which handles dependency resolution automatically:
sudo ubuntu-drivers install
sudo reboot
Verify GPU Detection
After rebooting, confirm the driver is loaded:
nvidia-smi
This displays your GPU model, driver version, and CUDA version. Restart Ollama and verify GPU detection:
sudo systemctl restart ollama
ollama ps
When a model is running on GPU, the ollama ps output shows 100% GPU in the processor column instead of 100% CPU.
Run Ollama in Docker (Alternative)
If you prefer containerized deployments, Ollama publishes official Docker images. This keeps things isolated and pairs easily with a web UI like Open WebUI. Make sure Docker and Compose are installed on your system.
Run Ollama in CPU mode:
docker run -d \
--name ollama \
-p 11434:11434 \
-v ollama_data:/root/.ollama \
ollama/ollama
For NVIDIA GPU support, add the --gpus all flag (requires nvidia-container-toolkit):
docker run -d \
--name ollama \
--gpus all \
-p 11434:11434 \
-v ollama_data:/root/.ollama \
ollama/ollama
Pull a model inside the running container:
docker exec ollama ollama pull gemma3:4b
For a complete stack, use Docker Compose to run Ollama alongside Open WebUI (a self-hosted ChatGPT-like interface). Create the compose file:
vi docker-compose.yml
Add the following:
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open_webui_data:/app/backend/data
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
open_webui_data:
Start both services:
docker compose up -d
Open WebUI will be available at http://your-server-ip:3000. The first user to register becomes the admin.
Rocky Linux 10 vs Ubuntu 24.04 Comparison
The core Ollama experience is identical on both platforms. The differences are in firewall management, security modules, and GPU driver installation:
| Item | Rocky Linux 10 | Ubuntu 24.04 |
|---|---|---|
| Ollama install | Needs zstd first, then same curl script | Same curl script |
| Binary location | /usr/local/bin/ollama | /usr/local/bin/ollama |
| Model storage | /usr/share/ollama/.ollama/models | /usr/share/ollama/.ollama/models |
| Service management | systemctl (identical) | systemctl (identical) |
| Firewall | firewall-cmd –add-port=11434/tcp | ufw allow 11434/tcp |
| Security module | SELinux enforcing (no issues observed) | AppArmor (no configuration needed) |
| GPU driver (NVIDIA) | dnf install nvidia-driver nvidia-driver-cuda | ubuntu-drivers install |
| GPU driver (AMD ROCm) | dnf install rocm-hip-runtime | apt install rocm-hip-runtime |
Troubleshooting
Error: “could not connect to ollama app, is it running?”
The Ollama service is not running. Start and enable it:
sudo systemctl start ollama
sudo systemctl enable ollama
Check the logs for the root cause:
journalctl -u ollama --no-pager -n 30
Error: “bind: address already in use” on port 11434
Another process is holding port 11434. Find it:
ss -tlnp | grep 11434
If it is a stale Ollama process, kill it and restart the service. You can also change the port by setting OLLAMA_HOST=0.0.0.0:11435 in the systemd override.
Rocky Linux: “This version requires zstd for extraction”
Rocky Linux 10 minimal installs do not include zstd. The Ollama installer needs it for the compressed archive. Install it and re-run:
sudo dnf install -y zstd
curl -fsSL https://ollama.com/install.sh | sh
Going Further
With Ollama running, here are the natural next steps depending on your use case:
- Web interface: Pair Ollama with Open WebUI for a full ChatGPT-like experience your team can share
- Production serving: For multi-user workloads, consider vLLM which handles significantly higher concurrent request volumes
- RAG pipelines: Combine Ollama with PostgreSQL and pgvector to build a self-hosted document Q&A system without cloud API dependencies
- Embeddings: Pull
nomic-embed-textand use the/api/embedendpoint to generate vector embeddings for semantic search - Model exploration: Try DeepSeek R1 for reasoning, Qwen 2.5 Coder for code generation, or Llama 3.1 as a general-purpose workhorse