Every time I need to check a model’s parameter count or hit the embedding endpoint, I end up scrolling through docs. This cheat sheet is the reference I keep coming back to. It covers every Ollama CLI command and REST API endpoint with tested examples you can copy and run.
Ollama 0.23.x ships with a full CLI for model management, an interactive chat REPL, a REST API with OpenAI-compatible endpoints, structured outputs, native tool calling on capable models, official Python and JavaScript SDKs, and a launch command that hands a model off to coding agents like Claude Code, OpenCode, Codex, Aider, Cline, Copilot CLI, and Hermes. Whether you are scripting deployments, building against the API, or just need to remember the right curl, this page has every command tested and ready to copy. If you need installation instructions first, see Install Ollama on Rocky Linux 10 / Ubuntu 24.04. To pick which model to run after install, see the Ollama models cheat sheet.
Refreshed May 2026 on Ubuntu 22.04 with Ollama 0.23.1 against an NVIDIA RTX 4090 (CUDA 12.6, driver 560.35.03). Every command, API response, and SDK snippet was executed live. Original test rig: Rocky Linux 10.1 with Ollama 0.18.2 on gemma3:4b.
Model Management
These commands handle downloading, inspecting, copying, and removing models from the local registry.
Pull a model
Downloads a model from the Ollama library. Tag defaults to latest if omitted.
ollama pull gemma3:4b
List local models
Shows all models stored on disk with their size and last modified time.
ollama list
The output includes the model ID, size, and when it was last updated:
NAME ID SIZE MODIFIED
gemma3:4b a2af6cc3eb7f 3.3 GB 42 seconds ago
List running models
Shows models currently loaded in memory, including processor allocation and context size.
ollama ps
On a CPU-only system, all processing runs on the CPU:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gemma3:4b a2af6cc3eb7f 4.3 GB 100% CPU 4096 4 minutes from now
Show model details
Displays architecture, parameter count, quantization, context length, and license info.
ollama show gemma3:4b
Output confirms the model architecture and capabilities:
Model
architecture gemma3
parameters 4.3B
context length 131072
embedding length 2560
quantization Q4_K_M
Capabilities
completion
vision
Parameters
stop "<end_of_turn>"
temperature 1
top_k 64
top_p 0.95
You can also extract specific sections with flags:
ollama show gemma3:4b --modelfile
ollama show gemma3:4b --parameters
ollama show gemma3:4b --template
Copy a model
Creates a new reference to an existing model. Useful for creating custom variants without re-downloading weights.
ollama cp gemma3:4b mymodel:latest
Both names now point to the same model ID:
copied 'gemma3:4b' to 'mymodel:latest'
Remove a model
Deletes a model from disk. The blob files are only removed when no other model references them.
ollama rm mymodel:latest
Confirmation:
deleted 'mymodel:latest'
Stop a running model
Unloads a model from memory without stopping the Ollama server.
ollama stop gemma3:4b
Running Models
The ollama run command handles both interactive chat sessions and one-shot prompts from the command line.
Interactive chat
Opens an interactive session. Type /bye to exit.
ollama run gemma3:4b
One-shot prompt
Pass the prompt as a second argument. Ollama prints the response and exits.
ollama run gemma3:4b "What port does PostgreSQL use?"
Piped input
Pipe text from another command or file. Great for scripting.
echo "What is the capital of France?" | ollama run gemma3:4b
Response:
The capital of France is Paris.
JSON output
Force the model to respond with valid JSON using --format json. Include “JSON” in your prompt so the model understands what structure you want.
echo "List 3 Linux distributions. Return JSON array with name and year fields." | ollama run gemma3:4b --format json
The model returns structured JSON:
{
"distributions": [
{"name": "Ubuntu", "year": 2004},
{"name": "Debian", "year": 1993},
{"name": "Fedora", "year": 2003}
]
}
Verbose output with timing stats
Add --verbose to see token generation speed and load times. Useful for benchmarking.
ollama run gemma3:4b --verbose "What is 2+2? Reply with just the number."
After the response, timing stats are printed:
4
total duration: 1.681255883s
load duration: 495.619141ms
prompt eval count: 22 token(s)
prompt eval duration: 956.787989ms
prompt eval rate: 22.99 tokens/s
eval count: 3 token(s)
eval duration: 224.407368ms
eval rate: 13.37 tokens/s
Keep model loaded
By default, models unload after 5 minutes of inactivity. Override this with --keepalive.
ollama run gemma3:4b --keepalive 30m "Hello"
Set to -1 to keep a model loaded indefinitely, or 0 to unload immediately after the request.
Thinking mode (DeepSeek R1, Qwen3, gpt-oss)
Reasoning models emit a <think>...</think> chain-of-thought block before the final answer. The --think flag explicitly turns this on (some models accept high, medium, low levels), and --hidethinking suppresses the trace from the visible output while still allowing the model to reason internally.
# Force thinking on
echo "What is 13 * 17?" | ollama run deepseek-r1:7b --think true
# Show the answer only, hide the chain of thought
echo "What is 13 * 17?" | ollama run deepseek-r1:7b --hidethinking
Real timing on RTX 4090: --think true on deepseek-r1:7b answered “13 × 17 = 221” in 5.12 s total with 187 generated tokens at 182.69 tok/s. The same prompt with --hidethinking ran 267 tokens at 181.57 tok/s; the answer was identical, the visible output cleaner.
Truncate embedding dimensions
Embedding models like nomic-embed-text output 768-dim vectors by default. Pass --dimensions to truncate (Matryoshka-style) for cheaper storage in your vector DB.
echo "the quick brown fox" | ollama run nomic-embed-text --dimensions 256
Experimental agent loop and web search
Three experimental flags activate the in-CLI agent loop with optional web search and an auto-approve mode for tool calls. Treat them as preview only; behavior changes between releases.
ollama run llama3.1:8b --experimental "What is the latest Linux kernel version?"
ollama run llama3.1:8b --experimental --experimental-websearch "..."
ollama run llama3.1:8b --experimental --experimental-yolo "..." # skip tool approvals
Server Management
Ollama runs as a systemd service on Linux. These commands cover starting, stopping, and configuring the server.
Start the server manually
If you need to run Ollama outside of systemd (for debugging, for example):
ollama serve
Systemd service commands
The standard install creates an ollama.service unit. Manage it with systemctl:
sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama
Verify the service is active:
● ollama.service - Ollama Service
Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
Active: active (running) since Wed 2026-03-25 03:25:07 EAT; 2s ago
Main PID: 5181 (ollama)
Tasks: 9 (limit: 100206)
Memory: 11.7M (peak: 24.7M)
CPU: 49ms
CGroup: /system.slice/ollama.service
└─5181 /usr/local/bin/ollama serve
View logs
Check the journal for errors, request logs, and model load events:
sudo journalctl -u ollama -f
Sample log entries showing API requests:
Mar 25 03:32:37 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:37 | 200 | 3.808679349s | 127.0.0.1 | POST "/api/generate"
Mar 25 03:32:54 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:54 | 200 | 11.318770126s | 127.0.0.1 | POST "/api/generate"
Mar 25 03:33:01 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:33:01 | 200 | 539.182µs | 127.0.0.1 | POST "/api/copy"
Set environment variables via systemd override
To configure Ollama to listen on all interfaces or change the model storage path, create a systemd override file:
sudo systemctl edit ollama
Add the environment variables you need between the comments:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_FLASH_ATTENTION=1"
Reload and restart for changes to take effect:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Sign in, sign out, and launch integrations
Three commands tie the local CLI to ollama.com and to coding agents. signin authenticates against your Ollama account (required to push a model or use cloud features), signout revokes the local credential, and launch opens the desktop menu on macOS or Windows or hands a model directly to a registered integration.
ollama signin # interactive browser flow
ollama signout
ollama launch # open the desktop menu (macOS / Windows)
ollama launch claude # hand off to Claude Code
ollama launch opencode # OpenCode
ollama launch codex # Codex
ollama launch copilot # Copilot CLI
Integrations recognized by Ollama 0.23.1: claude, claude-desktop, cline, codex, copilot, droid, hermes, kimi, opencode. They each launch with a sane default model unless you pass a specific one.
REST API: Generate and Chat
The Ollama API listens on port 11434 by default. All endpoints accept JSON. Set "stream": false to get the complete response in a single JSON object instead of a stream of tokens.
POST /api/generate
Single-turn text generation. Takes a prompt string and returns a completion.
curl -s http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "Why is the sky blue? Answer in one sentence.",
"stream": false
}'
Response JSON (trimmed for readability):
{
"model": "gemma3:4b",
"created_at": "2026-03-25T00:29:37.830168406Z",
"response": "The sky appears blue due to a phenomenon called Rayleigh scattering, where shorter wavelengths of sunlight (like blue) are scattered more by the Earth's atmosphere than longer wavelengths.",
"done": true,
"done_reason": "stop",
"total_duration": 4509574648,
"load_duration": 498389364,
"prompt_eval_count": 20,
"eval_count": 35
}
The total_duration and eval_count fields are in nanoseconds and token counts respectively. Divide eval_count by eval_duration (in seconds) to calculate tokens per second.
POST /api/chat
Multi-turn conversation endpoint. Pass a messages array with role/content pairs, just like the OpenAI Chat API format.
curl -s http://localhost:11434/api/chat -d '{
"model": "gemma3:4b",
"messages": [
{"role": "system", "content": "You are a Linux sysadmin. Be brief."},
{"role": "user", "content": "What is DNS?"}
],
"stream": false
}'
The response wraps the assistant message in a message object:
{
"model": "gemma3:4b",
"created_at": "2026-03-25T00:29:50.789139412Z",
"message": {
"role": "assistant",
"content": "DNS (Domain Name System) is a hierarchical and distributed naming system that translates human-readable domain names (like google.com) into the numerical IP addresses computers use to communicate with each other."
},
"done": true,
"done_reason": "stop",
"total_duration": 5747578407,
"load_duration": 268908121,
"prompt_eval_count": 16,
"eval_count": 40
}
POST /api/generate with JSON format
Force structured JSON output by setting "format": "json" in the request body. The model will only return valid JSON.
curl -s http://localhost:11434/api/generate -d '{
"model": "gemma3:4b",
"prompt": "List 3 Linux distributions. Return JSON array with name and year fields.",
"format": "json",
"stream": false
}'
The response field contains valid JSON that you can parse directly:
{
"distributions": [
{"name": "Ubuntu", "year": 2004},
{"name": "Debian", "year": 1993},
{"name": "Fedora", "year": 2003}
]
}
GET /api/tags
Lists all models available locally. Same data as ollama list but in JSON.
curl -s http://localhost:11434/api/tags | python3 -m json.tool
Each model entry includes its digest, size, parameter count, and quantization level:
{
"models": [
{
"name": "gemma3:4b",
"model": "gemma3:4b",
"modified_at": "2026-03-25T03:27:22.913735934+03:00",
"size": 3338801804,
"digest": "a2af6cc3eb7fa8be...",
"details": {
"format": "gguf",
"family": "gemma3",
"parameter_size": "4.3B",
"quantization_level": "Q4_K_M"
}
}
]
}
POST /api/show
Returns detailed model metadata: parameters, template, license, and the full model_info block with architecture details.
curl -s http://localhost:11434/api/show -d '{"model": "gemma3:4b"}' | python3 -m json.tool
POST /api/embed
Generates vector embeddings for text input. Requires a model that supports embeddings (such as nomic-embed-text or all-minilm). General chat models like gemma3 or llama3 do not support this endpoint.
ollama pull nomic-embed-text
curl -s http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Ollama runs large language models locally"
}'
The response contains an embeddings array of float vectors. These can be stored in a vector database like PostgreSQL with pgvector for similarity search and RAG applications.
GET /api/version
Returns the running Ollama version. Useful for health checks and version-gated feature detection.
$ curl -s http://localhost:11434/api/version
{"version":"0.23.1"}
GET /api/ps
Returns the list of currently loaded models. Same data as ollama ps, in JSON.
$ curl -s http://localhost:11434/api/ps | jq
{
"models": [
{
"name": "llama3.1:8b",
"model": "llama3.1:8b",
"size": 11240960000,
"size_vram": 11240960000,
"expires_at": "2026-05-06T12:30:00Z"
}
]
}
POST /api/chat with tool calling
Tool-capable models (Llama 3.1+, Mistral Small, Qwen 2.5+, Qwen 3) accept a tools array describing functions they can call. The response either answers directly or returns a tool_calls array with the chosen function and parsed arguments. You execute the function on your side and post the result back as a tool role message to continue the conversation.
curl -s http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"stream": false,
"messages": [{"role":"user","content":"What is the weather in Nairobi?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}]
}' | jq '.message.tool_calls'
Real response from Llama 3.1 8B on the test rig:
[
{
"id": "call_ww20fksl",
"function": {
"index": 0,
"name": "get_weather",
"arguments": {"city": "Nairobi"}
}
}
]
POST /api/chat with structured outputs
Pass a JSON Schema in the format field and the response is constrained to match it. This is stricter than the older "format":"json" shorthand because the schema enforces field names, types, and required fields.
curl -s http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"stream": false,
"messages": [{"role":"user","content":"Pick a city and report the weather. Output JSON only."}],
"format": {
"type": "object",
"properties": {
"city": {"type": "string"},
"temp_c": {"type": "number"},
"condition": {"type": "string"}
},
"required": ["city","temp_c","condition"]
}
}' | jq -r .message.content
Real output:
{
"city": "San Francisco",
"temp_c": 17,
"condition": "Mostly Cloudy"
}
REST API: OpenAI-Compatible Endpoints
Ollama exposes OpenAI-compatible endpoints at /v1/. This lets you point any OpenAI SDK or tool at your local Ollama instance by changing the base URL.
GET /v1/models
Lists available models in the OpenAI format.
curl -s http://localhost:11434/v1/models | python3 -m json.tool
Response follows the OpenAI schema:
{
"object": "list",
"data": [
{
"id": "gemma3:4b",
"object": "model",
"created": 1774398442,
"owned_by": "library"
}
]
}
POST /v1/chat/completions
Drop-in replacement for the OpenAI Chat Completions API. Works with the official openai Python SDK by setting base_url="http://localhost:11434/v1".
curl -s http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma3:4b",
"messages": [
{"role": "user", "content": "What port does SSH use?"}
]
}'
The response matches the OpenAI format with choices and usage fields:
{
"id": "chatcmpl-124",
"object": "chat.completion",
"created": 1774398701,
"model": "gemma3:4b",
"system_fingerprint": "fp_ollama",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "SSH typically uses port 22 for secure remote access."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 18,
"completion_tokens": 13,
"total_tokens": 31
}
}
REST API: Model Management
These endpoints let you manage models programmatically, which is useful for automation scripts and deployment pipelines.
POST /api/pull
Pull a model via the API. Equivalent to ollama pull from the CLI.
curl -s http://localhost:11434/api/pull -d '{
"model": "gemma3:4b",
"stream": false
}'
Returns {"status": "success"} when complete.
POST /api/copy
Copy a model to a new name. Returns HTTP 200 with an empty body on success.
curl -s -X POST http://localhost:11434/api/copy -d '{
"source": "gemma3:4b",
"destination": "gemma3-backup"
}'
DELETE /api/delete
Delete a model. Returns HTTP 200 on success.
curl -s -X DELETE http://localhost:11434/api/delete -d '{
"model": "gemma3-backup"
}'
POST /api/create
Build a custom model directly from an inline body, no Modelfile on disk required. The server returns a stream of status events. Use this when your control plane wants to provision custom models without shipping files around.
curl -X POST http://localhost:11434/api/create -d '{
"model": "json-bot",
"from": "qwen2.5:7b",
"system": "You answer in JSON only. No prose.",
"parameters": {"temperature": 0.0, "num_ctx": 8192}
}'
POST /api/push
Upload a local model to ollama.com. Requires ollama signin first. Streams progress events as JSON. The model name must use your namespace, for example youruser/coder-strict.
curl -X POST http://localhost:11434/api/push -d '{
"model": "youruser/coder-strict",
"stream": true
}'
Modelfile Reference
A Modelfile defines a custom model: its base weights, system prompt, parameters, and template. Create one to build reusable model configurations.
Example Modelfile
Save this as Modelfile in your working directory:
FROM gemma3:4b
SYSTEM "You are a Linux systems administrator. Answer questions about Linux concisely."
PARAMETER temperature 0.7
PARAMETER top_p 0.9
Build the custom model from the Modelfile:
ollama create sysadmin-bot -f Modelfile
# Quantize while creating, in one step (FP16 or higher source recommended)
ollama create sysadmin-bot-q5 -f Modelfile -q q5_K_M
The -q / --quantize flag added in Ollama 0.5+ runs the quantization at build time and accepts the standard k-quant levels: q4_K_M (default), q5_K_M, q6_K, q8_0, plus the legacy q4_0, q5_0, q4_1, q5_1. Use it when importing an FP16 GGUF and you want a smaller deployment artifact without running llama.cpp separately.
The new model appears in your local registry:
NAME ID SIZE MODIFIED
sysadmin-bot:latest ab1cd106e0ee 3.3 GB Less than a second ago
gemma3:4b a2af6cc3eb7f 3.3 GB 4 minutes ago
Now run it like any other model:
echo "How do I check disk space?" | ollama run sysadmin-bot
The system prompt kicks in and you get a focused response:
Use the `df -h` command. It displays disk space in a human-readable format (KB, MB, GB).
Modelfile directives
| Directive | Purpose | Example |
|---|---|---|
FROM | Base model (required) | FROM llama3.3 |
SYSTEM | System prompt | SYSTEM "You are a helpful assistant" |
PARAMETER | Set model parameters | PARAMETER temperature 0.7 |
TEMPLATE | Go template for prompt format | TEMPLATE "{{ .Prompt }}" |
ADAPTER | Path to LoRA/QLoRA adapter | ADAPTER ./lora.gguf |
MESSAGE | Seed conversation history | MESSAGE user "Hi" |
LICENSE | License text for the model | LICENSE "MIT" |
Common PARAMETER values
| Parameter | Default | Description |
|---|---|---|
temperature | 0.8 | Controls randomness. Lower values make output more deterministic |
top_p | 0.9 | Nucleus sampling threshold |
top_k | 40 | Limits token selection to top K candidates |
num_ctx | 2048 | Context window size in tokens |
repeat_penalty | 1.1 | Penalizes repeated tokens |
seed | 0 | Random seed for reproducible output (0 = random) |
stop | model-specific | Stop sequences that end generation |
num_predict | -1 | Maximum tokens to generate (-1 = unlimited) |
Environment Variables Reference
These environment variables configure the Ollama server. Set them in your systemd override file or export them before running ollama serve.
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Listen address and port |
OLLAMA_MODELS | ~/.ollama/models | Model storage directory |
OLLAMA_KEEP_ALIVE | 5m | How long models stay loaded after last request |
OLLAMA_NUM_PARALLEL | auto | Maximum concurrent requests per model |
OLLAMA_MAX_LOADED_MODELS | auto (per GPU) | Maximum models loaded simultaneously |
OLLAMA_MAX_QUEUE | 512 | Maximum queued requests before rejecting (HTTP 503 after) |
OLLAMA_ORIGINS | localhost variants | Allowed CORS origins (comma-separated) |
OLLAMA_DEBUG | false | Enable debug logging (1=debug, 2=trace) |
OLLAMA_FLASH_ATTENTION | false | Enable flash attention 2 (faster on Ampere+ GPUs) |
OLLAMA_KV_CACHE_TYPE | f16 | KV cache quantization (f16, q8_0, q4_0). Halves or quarters KV memory |
OLLAMA_CONTEXT_LENGTH | auto (4k/32k/256k by VRAM) | Default num_ctx when the request does not set one |
OLLAMA_GPU_OVERHEAD | 0 | Reserved VRAM per GPU in bytes |
OLLAMA_LOAD_TIMEOUT | 5m | How long to allow a model load to stall before giving up |
OLLAMA_NOPRUNE | false | Skip pruning unused model blobs on startup |
OLLAMA_SCHED_SPREAD | false | Always shard a model across all GPUs (vs. preferring single-GPU placement) |
OLLAMA_LLM_LIBRARY | auto | Force a specific backend (cuda_v12, cpu_avx2, rocm) bypassing autodetection. Added in 0.20+ |
OLLAMA_NO_CLOUD | false | Disable cloud features (remote inference and web search). Added in 0.21+ |
OLLAMA_MULTIUSER_CACHE | false | Optimize prompt caching for multiple users (legacy) |
OLLAMA_NOHISTORY | false | Disable readline history in interactive mode |
OLLAMA_EDITOR | (none) | Editor invoked by Ctrl+G in the REPL |
OLLAMA_TMPDIR | system default | Temporary directory for downloads and extraction (legacy, may be removed) |
Python SDK
The official Python client (ollama/ollama-python) wraps every endpoint above with typed responses. Install with pip install ollama. The five most common patterns are below; all snippets ran live against the test rig.
import ollama
# 1. Generate (single-turn)
res = ollama.generate(model="qwen2.5:0.5b", prompt="Say hi.")
print(res["response"])
# Hello! How can I assist you today?
# 2. Chat (multi-turn)
res = ollama.chat(
model="qwen2.5:0.5b",
messages=[{"role": "user", "content": "Reply with one word: yes"}],
)
print(res["message"]["content"], "eval_count:", res["eval_count"])
# Yes eval_count: 2
# 3. Streaming
for chunk in ollama.generate(model="qwen2.5:0.5b",
prompt="List 3 colors:", stream=True):
print(chunk["response"], end="", flush=True)
# 4. Embeddings
res = ollama.embed(model="nomic-embed-text", input=["hello", "world"])
print(len(res["embeddings"]), "vectors of dim",
len(res["embeddings"][0]))
# 2 vectors of dim 768
# 5. Tool calling (Llama 3.1+, Mistral Small, Qwen 2.5+)
res = ollama.chat(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Weather in Nairobi?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}],
)
for call in res["message"].get("tool_calls", []):
print(call["function"]["name"], call["function"]["arguments"])
# get_weather {'city': 'Nairobi'}
For high-throughput services use ollama.AsyncClient (asyncio). For a remote server pass host="http://gpu-host:11434" to Client(...). The Node equivalent lives at ollama/ollama-js; for any other language, point an OpenAI SDK at http://localhost:11434/v1.
Useful One-Liners
Practical command combinations for scripting and daily use.
Check model disk usage
du -sh /usr/share/ollama/.ollama/models/
On a system with one 4B model:
3.2G /usr/share/ollama/.ollama/models/
Pipe a file into a prompt
Summarize a log file, config, or any text file:
cat /var/log/dnf.log | ollama run gemma3:4b "Summarize what packages were installed"
Batch inference from a file
Process multiple prompts from a text file, one per line:
while IFS= read -r prompt; do
echo "=== $prompt ==="
echo "$prompt" | ollama run gemma3:4b
echo ""
done < prompts.txt
Quick benchmark: tokens per second
Run a standardized prompt with --verbose to measure generation speed:
ollama run gemma3:4b --verbose "Write a 100-word essay about Linux." 2>&1 | grep "eval rate"
Check if Ollama API is reachable
curl -s http://localhost:11434/ && echo "Ollama is running"
Returns "Ollama is running" if the server is up.
Pull multiple models in sequence
for model in gemma3:4b llama3.2:3b qwen2.5:7b; do
echo "Pulling $model..."
ollama pull "$model"
done
Export model list as JSON
Useful for backup scripts or inventory tracking:
curl -s http://localhost:11434/api/tags | python3 -c "
import sys, json
data = json.load(sys.stdin)
for m in data['models']:
print(f\"{m['name']:30s} {m['details']['parameter_size']:>8s} {m['details']['quantization_level']}\")"
Sample output:
gemma3:4b 4.3B Q4_K_M
Use Ollama with the OpenAI Python SDK
Point the official openai package at your local Ollama instance. No API key required.
pip install openai
Then in Python:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
model="gemma3:4b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Quick Command Reference Table
All CLI commands at a glance, sorted by category.
| Command | What it does |
|---|---|
ollama pull MODEL | Download a model from the registry |
ollama list (alias ls) | Show all local models |
ollama ps | Show models loaded in memory |
ollama show MODEL | Display model details (params, template, license) |
ollama run MODEL | Start interactive chat or run one-shot prompt |
ollama run MODEL --verbose | Run with timing stats |
ollama run MODEL --format json | Force JSON output |
ollama run MODEL --think true | Force thinking mode on (R1, Qwen 3, gpt-oss) |
ollama run MODEL --hidethinking | Suppress chain-of-thought from visible output |
ollama run MODEL --keepalive 1h | Keep the model loaded after the request |
ollama run EMBED --dimensions N | Truncate output embeddings (Matryoshka) |
ollama stop MODEL | Unload model from memory |
ollama cp SRC DST | Copy a model to a new name |
ollama rm MODEL [MODEL...] | Delete one or more models from disk |
ollama create NAME -f FILE | Build a custom model from a Modelfile |
ollama create NAME -f FILE -q q5_K_M | Build and quantize in one step |
ollama push NAMESPACE/MODEL | Push a model to the registry (requires signin) |
ollama signin / ollama signout | Authenticate with ollama.com |
ollama launch [integration] | Open the desktop menu or hand off to a coding agent |
ollama serve (alias start) | Start the server manually |
ollama --version | Print Ollama version |