Every time I need to check a model’s parameter count or hit the embedding endpoint, I end up scrolling through docs. This cheat sheet is the reference I keep coming back to. It covers every Ollama CLI command and REST API endpoint with tested examples you can copy and run.
Ollama 0.18.x ships with a full CLI for model management, an interactive chat interface, and a REST API that includes OpenAI-compatible endpoints. Whether you are scripting model deployments, building applications against the API, or just need to remember the right curl command, this page has it. If you need installation instructions first, see Install Ollama on Rocky Linux 10 / Ubuntu 24.04.
Tested March 2026 on Rocky Linux 10.1 with Ollama 0.18.2, gemma3:4b (Q4_K_M)
Model Management
These commands handle downloading, inspecting, copying, and removing models from the local registry.
Pull a model
Downloads a model from the Ollama library. Tag defaults to latest if omitted.
ollama pull gemma3:4b
List local models
Shows all models stored on disk with their size and last modified time.
ollama list
The output includes the model ID, size, and when it was last updated:
NAME ID SIZE MODIFIED
gemma3:4b a2af6cc3eb7f 3.3 GB 42 seconds ago
List running models
Shows models currently loaded in memory, including processor allocation and context size.
ollama ps
On a CPU-only system, all processing runs on the CPU:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma3:4b a2af6cc3eb7f 4.3 GB 100% CPU 4096 4 minutes from now
Show model details
Displays architecture, parameter count, quantization, context length, and license info.
ollama show gemma3:4b
Output confirms the model architecture and capabilities:
Model
architecture gemma3
parameters 4.3B
context length 131072
embedding length 2560
quantization Q4_K_M
Capabilities
completion
vision
Parameters
stop "<end_of_turn>"
temperature 1
top_k 64
top_p 0.95
You can also extract specific sections with flags:
ollama show gemma3:4b --modelfile
ollama show gemma3:4b --parameters
ollama show gemma3:4b --template
Copy a model
Creates a new reference to an existing model. Useful for creating custom variants without re-downloading weights.
ollama cp gemma3:4b mymodel:latest
Both names now point to the same model ID:
copied 'gemma3:4b' to 'mymodel:latest'
Remove a model
Deletes a model from disk. The blob files are only removed when no other model references them.
ollama rm mymodel:latest
Confirmation:
deleted 'mymodel:latest'
Stop a running model
Unloads a model from memory without stopping the Ollama server.
ollama stop gemma3:4b
Running Models
The ollama run command handles both interactive chat sessions and one-shot prompts from the command line.
Interactive chat
Opens an interactive session. Type /bye to exit.
ollama run gemma3:4b
One-shot prompt
Pass the prompt as a second argument. Ollama prints the response and exits.
ollama run gemma3:4b "What port does PostgreSQL use?"
Piped input
Pipe text from another command or file. Great for scripting.
echo "What is the capital of France?" | ollama run gemma3:4b
Response:
The capital of France is Paris.
JSON output
Force the model to respond with valid JSON using --format json. Include “JSON” in your prompt so the model understands what structure you want.
echo "List 3 Linux distributions. Return JSON array with name and year fields." | ollama run gemma3:4b --format json
The model returns structured JSON:
{
"distributions": [
{"name": "Ubuntu", "year": 2004},
{"name": "Debian", "year": 1993},
{"name": "Fedora", "year": 2003}
]
}
Verbose output with timing stats
Add --verbose to see token generation speed and load times. Useful for benchmarking.
ollama run gemma3:4b --verbose "What is 2+2? Reply with just the number."
After the response, timing stats are printed:
4
total duration: 1.681255883s
load duration: 495.619141ms
prompt eval count: 22 token(s)
prompt eval duration: 956.787989ms
prompt eval rate: 22.99 tokens/s
eval count: 3 token(s)
eval duration: 224.407368ms
eval rate: 13.37 tokens/s
Keep model loaded
By default, models unload after 5 minutes of inactivity. Override this with --keepalive.
ollama run gemma3:4b --keepalive 30m "Hello"
Set to -1 to keep a model loaded indefinitely, or 0 to unload immediately after the request.
Server Management
Ollama runs as a systemd service on Linux. These commands cover starting, stopping, and configuring the server.
Start the server manually
If you need to run Ollama outside of systemd (for debugging, for example):
ollama serve
Systemd service commands
The standard install creates an ollama.service unit. Manage it with systemctl:
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama
Verify the service is active:
● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
Active: active (running) since Wed 2026-03-25 03:25:07 EAT; 2s ago
Main PID: 5181 (ollama)
Tasks: 9 (limit: 100206)
Memory: 11.7M (peak: 24.7M)
CPU: 49ms
CGroup: /system.slice/ollama.service
└─5181 /usr/local/bin/ollama serve
View logs
Check the journal for errors, request logs, and model load events:
sudo journalctl -u ollama -f
Sample log entries showing API requests:
Mar 25 03:32:37 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:37 | 200 | 3.808679349s | 127.0.0.1 | POST "/api/generate"
Mar 25 03:32:54 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:54 | 200 | 11.318770126s | 127.0.0.1 | POST "/api/generate"
Mar 25 03:33:01 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:33:01 | 200 | 539.182µs | 127.0.0.1 | POST "/api/copy"
Set environment variables via systemd override
To configure Ollama to listen on all interfaces or change the model storage path, create a systemd override file:
sudo systemctl edit ollama
Add the environment variables you need between the comments:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_FLASH_ATTENTION=1"
Reload and restart for changes to take effect:
sudo systemctl daemon-reload
sudo systemctl restart ollama
REST API: Generate and Chat
The Ollama API listens on port 11434 by default. All endpoints accept JSON. Set "stream": false to get the complete response in a single JSON object instead of a stream of tokens.
POST /api/generate
Single-turn text generation. Takes a prompt string and returns a completion.
curl -s http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Why is the sky blue? Answer in one sentence.",
"stream": false
}'
Response JSON (trimmed for readability):
{
"model": "gemma3:4b",
"created_at": "2026-03-25T00:29:37.830168406Z",
"response": "The sky appears blue due to a phenomenon called Rayleigh scattering, where shorter wavelengths of sunlight (like blue) are scattered more by the Earth's atmosphere than longer wavelengths.",
"done": true,
"done_reason": "stop",
"total_duration": 4509574648,
"load_duration": 498389364,
"prompt_eval_count": 20,
"eval_count": 35
}
The total_duration and eval_count fields are in nanoseconds and token counts respectively. Divide eval_count by eval_duration (in seconds) to calculate tokens per second.
POST /api/chat
Multi-turn conversation endpoint. Pass a messages array with role/content pairs, just like the OpenAI Chat API format.
curl -s http://localhost:11434/api/chat -d '{
"model": "gemma3:4b",
"messages": [
{"role": "system", "content": "You are a Linux sysadmin. Be brief."},
{"role": "user", "content": "What is DNS?"}
],
"stream": false
}'
The response wraps the assistant message in a message object:
{
"model": "gemma3:4b",
"created_at": "2026-03-25T00:29:50.789139412Z",
"message": {
"role": "assistant",
"content": "DNS (Domain Name System) is a hierarchical and distributed naming system that translates human-readable domain names (like google.com) into the numerical IP addresses computers use to communicate with each other."
},
"done": true,
"done_reason": "stop",
"total_duration": 5747578407,
"load_duration": 268908121,
"prompt_eval_count": 16,
"eval_count": 40
}
POST /api/generate with JSON format
Force structured JSON output by setting "format": "json" in the request body. The model will only return valid JSON.
curl -s http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "List 3 Linux distributions. Return JSON array with name and year fields.",
"format": "json",
"stream": false
}'
The response field contains valid JSON that you can parse directly:
{
"distributions": [
{"name": "Ubuntu", "year": 2004},
{"name": "Debian", "year": 1993},
{"name": "Fedora", "year": 2003}
]
}
GET /api/tags
Lists all models available locally. Same data as ollama list but in JSON.
curl -s http://localhost:11434/api/tags | python3 -m json.tool
Each model entry includes its digest, size, parameter count, and quantization level:
{
"models": [
{
"name": "gemma3:4b",
"model": "gemma3:4b",
"modified_at": "2026-03-25T03:27:22.913735934+03:00",
"size": 3338801804,
"digest": "a2af6cc3eb7fa8be...",
"details": {
"format": "gguf",
"family": "gemma3",
"parameter_size": "4.3B",
"quantization_level": "Q4_K_M"
}
}
]
}
POST /api/show
Returns detailed model metadata: parameters, template, license, and the full model_info block with architecture details.
curl -s http://localhost:11434/api/show -d '{"model": "gemma3:4b"}' | python3 -m json.tool
POST /api/embed
Generates vector embeddings for text input. Requires a model that supports embeddings (such as nomic-embed-text or all-minilm). General chat models like gemma3 or llama3 do not support this endpoint.
ollama pull nomic-embed-text
curl -s http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Ollama runs large language models locally"
}'
The response contains an embeddings array of float vectors. These can be stored in a vector database like PostgreSQL with pgvector for similarity search and RAG applications.
REST API: OpenAI-Compatible Endpoints
Ollama exposes OpenAI-compatible endpoints at /v1/. This lets you point any OpenAI SDK or tool at your local Ollama instance by changing the base URL.
GET /v1/models
Lists available models in the OpenAI format.
curl -s http://localhost:11434/v1/models | python3 -m json.tool
Response follows the OpenAI schema:
{
"object": "list",
"data": [
{
"id": "gemma3:4b",
"object": "model",
"created": 1774398442,
"owned_by": "library"
}
]
}
POST /v1/chat/completions
Drop-in replacement for the OpenAI Chat Completions API. Works with the official openai Python SDK by setting base_url="http://localhost:11434/v1".
curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma3:4b",
"messages": [
{"role": "user", "content": "What port does SSH use?"}
]
}'
The response matches the OpenAI format with choices and usage fields:
{
"id": "chatcmpl-124",
"object": "chat.completion",
"created": 1774398701,
"model": "gemma3:4b",
"system_fingerprint": "fp_ollama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "SSH typically uses port 22 for secure remote access."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 18,
"completion_tokens": 13,
"total_tokens": 31
}
}
REST API: Model Management
These endpoints let you manage models programmatically, which is useful for automation scripts and deployment pipelines.
POST /api/pull
Pull a model via the API. Equivalent to ollama pull from the CLI.
curl -s http://localhost:11434/api/pull -d '{
"model": "gemma3:4b",
"stream": false
}'
Returns {"status": "success"} when complete.
POST /api/copy
Copy a model to a new name. Returns HTTP 200 with an empty body on success.
curl -s -X POST http://localhost:11434/api/copy -d '{
"source": "gemma3:4b",
"destination": "gemma3-backup"
}'
DELETE /api/delete
Delete a model. Returns HTTP 200 on success.
curl -s -X DELETE http://localhost:11434/api/delete -d '{
"model": "gemma3-backup"
}'
Modelfile Reference
A Modelfile defines a custom model: its base weights, system prompt, parameters, and template. Create one to build reusable model configurations.
Example Modelfile
Save this as Modelfile in your working directory:
FROM gemma3:4b
SYSTEM "You are a Linux systems administrator. Answer questions about Linux concisely."
PARAMETER temperature 0.7
PARAMETER top_p 0.9
Build the custom model from the Modelfile:
ollama create sysadmin-bot -f Modelfile
The new model appears in your local registry:
NAME ID SIZE MODIFIED
sysadmin-bot:latest ab1cd106e0ee 3.3 GB Less than a second ago
gemma3:4b a2af6cc3eb7f 3.3 GB 4 minutes ago
Now run it like any other model:
echo "How do I check disk space?" | ollama run sysadmin-bot
The system prompt kicks in and you get a focused response:
Use the `df -h` command. It displays disk space in a human-readable format (KB, MB, GB).
Modelfile directives
| Directive | Purpose | Example |
|---|---|---|
FROM | Base model (required) | FROM llama3.3 |
SYSTEM | System prompt | SYSTEM "You are a helpful assistant" |
PARAMETER | Set model parameters | PARAMETER temperature 0.7 |
TEMPLATE | Go template for prompt format | TEMPLATE "{{ .Prompt }}" |
ADAPTER | Path to LoRA/QLoRA adapter | ADAPTER ./lora.gguf |
MESSAGE | Seed conversation history | MESSAGE user "Hi" |
LICENSE | License text for the model | LICENSE "MIT" |
Common PARAMETER values
| Parameter | Default | Description |
|---|---|---|
temperature | 0.8 | Controls randomness. Lower values make output more deterministic |
top_p | 0.9 | Nucleus sampling threshold |
top_k | 40 | Limits token selection to top K candidates |
num_ctx | 2048 | Context window size in tokens |
repeat_penalty | 1.1 | Penalizes repeated tokens |
seed | 0 | Random seed for reproducible output (0 = random) |
stop | model-specific | Stop sequences that end generation |
num_predict | -1 | Maximum tokens to generate (-1 = unlimited) |
Environment Variables Reference
These environment variables configure the Ollama server. Set them in your systemd override file or export them before running ollama serve.
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Listen address and port |
OLLAMA_MODELS | ~/.ollama/models | Model storage directory |
OLLAMA_KEEP_ALIVE | 5m | How long models stay loaded after last request |
OLLAMA_NUM_PARALLEL | 1 | Maximum concurrent requests per model |
OLLAMA_MAX_LOADED_MODELS | 0 (auto) | Maximum models loaded simultaneously |
OLLAMA_MAX_QUEUE | 512 | Maximum queued requests before rejecting |
OLLAMA_ORIGINS | localhost variants | Allowed CORS origins (comma-separated) |
OLLAMA_DEBUG | false | Enable debug logging (1=debug, 2=trace) |
OLLAMA_FLASH_ATTENTION | false | Enable flash attention for faster inference |
OLLAMA_KV_CACHE_TYPE | f16 | KV cache quantization type (f16, q8_0, q4_0) |
OLLAMA_CONTEXT_LENGTH | 0 (auto) | Default context length for all models |
OLLAMA_GPU_OVERHEAD | 0 | Reserved VRAM per GPU in bytes |
OLLAMA_LOAD_TIMEOUT | 5m | Maximum time to wait for model loading |
OLLAMA_NOPRUNE | false | Skip pruning unused model blobs on startup |
OLLAMA_SCHED_SPREAD | false | Spread model layers across all GPUs |
OLLAMA_MULTIUSER_CACHE | false | Optimize prompt caching for multiple users |
OLLAMA_NOHISTORY | false | Disable readline history in interactive mode |
OLLAMA_TMPDIR | system default | Temporary directory for downloads and extraction |
Useful One-Liners
Practical command combinations for scripting and daily use.
Check model disk usage
du -sh /usr/share/ollama/.ollama/models/
On a system with one 4B model:
3.2G /usr/share/ollama/.ollama/models/
Pipe a file into a prompt
Summarize a log file, config, or any text file:
cat /var/log/dnf.log | ollama run gemma3:4b "Summarize what packages were installed"
Batch inference from a file
Process multiple prompts from a text file, one per line:
while IFS= read -r prompt; do
echo "=== $prompt ==="
echo "$prompt" | ollama run gemma3:4b
echo ""
done < prompts.txt
Quick benchmark: tokens per second
Run a standardized prompt with --verbose to measure generation speed:
ollama run gemma3:4b --verbose "Write a 100-word essay about Linux." 2>&1 | grep "eval rate"
Check if Ollama API is reachable
curl -s http://localhost:11434/ && echo "Ollama is running"
Returns "Ollama is running" if the server is up.
Pull multiple models in sequence
for model in gemma3:4b llama3.2:3b qwen2.5:7b; do
echo "Pulling $model..."
ollama pull "$model"
done
Export model list as JSON
Useful for backup scripts or inventory tracking:
curl -s http://localhost:11434/api/tags | python3 -c "
import sys, json
data = json.load(sys.stdin)
for m in data['models']:
print(f\"{m['name']:30s} {m['details']['parameter_size']:>8s} {m['details']['quantization_level']}\")"
Sample output:
gemma3:4b 4.3B Q4_K_M
Use Ollama with the OpenAI Python SDK
Point the official openai package at your local Ollama instance. No API key required.
pip install openai
Then in Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
model="gemma3:4b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Quick Command Reference Table
All CLI commands at a glance, sorted by category.
| Command | What it does |
|---|---|
ollama pull MODEL | Download a model from the registry |
ollama list | Show all local models |
ollama ps | Show models loaded in memory |
ollama show MODEL | Display model details (params, template, license) |
ollama run MODEL | Start interactive chat or run one-shot prompt |
ollama run MODEL --verbose | Run with timing stats |
ollama run MODEL --format json | Force JSON output |
ollama stop MODEL | Unload model from memory |
ollama cp SRC DST | Copy a model to a new name |
ollama rm MODEL | Delete a model from disk |
ollama create NAME -f FILE | Build custom model from Modelfile |
ollama serve | Start the server manually |
ollama --version | Print Ollama version |