Ollama Commands: CLI and API Reference [Cheat Sheet]

Every time I need to check a model’s parameter count or hit the embedding endpoint, I end up scrolling through docs. This cheat sheet is the reference I keep coming back to. It covers every Ollama CLI command and REST API endpoint with tested examples you can copy and run.

Original content from computingforgeeks.com - post 164411

Ollama 0.23.x ships with a full CLI for model management, an interactive chat REPL, a REST API with OpenAI-compatible endpoints, structured outputs, native tool calling on capable models, official Python and JavaScript SDKs, and a launch command that hands a model off to coding agents like Claude Code, OpenCode, Codex, Aider, Cline, Copilot CLI, and Hermes. Whether you are scripting deployments, building against the API, or just need to remember the right curl, this page has every command tested and ready to copy. If you need installation instructions first, see Install Ollama on Rocky Linux 10 / Ubuntu 24.04. To pick which model to run after install, see the Ollama models cheat sheet.

Refreshed May 2026 on Ubuntu 22.04 with Ollama 0.23.1 against an NVIDIA RTX 4090 (CUDA 12.6, driver 560.35.03). Every command, API response, and SDK snippet was executed live. Original test rig: Rocky Linux 10.1 with Ollama 0.18.2 on gemma3:4b.

Model Management

These commands handle downloading, inspecting, copying, and removing models from the local registry.

Pull a model

Downloads a model from the Ollama library. Tag defaults to latest if omitted.

ollama pull gemma3:4b

List local models

Shows all models stored on disk with their size and last modified time.

ollama list

The output includes the model ID, size, and when it was last updated:

NAME         ID              SIZE      MODIFIED       
gemma3:4b    a2af6cc3eb7f    3.3 GB    42 seconds ago

List running models

Shows models currently loaded in memory, including processor allocation and context size.

ollama ps

On a CPU-only system, all processing runs on the CPU:

NAME         ID              SIZE      PROCESSOR    CONTEXT    UNTIL              
gemma3:4b    a2af6cc3eb7f    4.3 GB    100% CPU     4096       4 minutes from now

Show model details

Displays architecture, parameter count, quantization, context length, and license info.

ollama show gemma3:4b

Output confirms the model architecture and capabilities:

  Model
    architecture        gemma3    
    parameters          4.3B      
    context length      131072    
    embedding length    2560      
    quantization        Q4_K_M    

  Capabilities
    completion    
    vision        

  Parameters
    stop           "<end_of_turn>"    
    temperature    1                  
    top_k          64                 
    top_p          0.95

You can also extract specific sections with flags:

ollama show gemma3:4b --modelfile
ollama show gemma3:4b --parameters
ollama show gemma3:4b --template

Copy a model

Creates a new reference to an existing model. Useful for creating custom variants without re-downloading weights.

ollama cp gemma3:4b mymodel:latest

Both names now point to the same model ID:

copied 'gemma3:4b' to 'mymodel:latest'

Remove a model

Deletes a model from disk. The blob files are only removed when no other model references them.

ollama rm mymodel:latest

Confirmation:

deleted 'mymodel:latest'

Stop a running model

Unloads a model from memory without stopping the Ollama server.

ollama stop gemma3:4b

Running Models

The ollama run command handles both interactive chat sessions and one-shot prompts from the command line.

Interactive chat

Opens an interactive session. Type /bye to exit.

ollama run gemma3:4b

One-shot prompt

Pass the prompt as a second argument. Ollama prints the response and exits.

ollama run gemma3:4b "What port does PostgreSQL use?"

Piped input

Pipe text from another command or file. Great for scripting.

echo "What is the capital of France?" | ollama run gemma3:4b

Response:

The capital of France is Paris.

JSON output

Force the model to respond with valid JSON using --format json. Include “JSON” in your prompt so the model understands what structure you want.

echo "List 3 Linux distributions. Return JSON array with name and year fields." | ollama run gemma3:4b --format json

The model returns structured JSON:

{
  "distributions": [
    {"name": "Ubuntu", "year": 2004},
    {"name": "Debian", "year": 1993},
    {"name": "Fedora", "year": 2003}
  ]
}

Verbose output with timing stats

Add --verbose to see token generation speed and load times. Useful for benchmarking.

ollama run gemma3:4b --verbose "What is 2+2? Reply with just the number."

After the response, timing stats are printed:

4

total duration:       1.681255883s
load duration:        495.619141ms
prompt eval count:    22 token(s)
prompt eval duration: 956.787989ms
prompt eval rate:     22.99 tokens/s
eval count:           3 token(s)
eval duration:        224.407368ms
eval rate:            13.37 tokens/s

Keep model loaded

By default, models unload after 5 minutes of inactivity. Override this with --keepalive.

ollama run gemma3:4b --keepalive 30m "Hello"

Set to -1 to keep a model loaded indefinitely, or 0 to unload immediately after the request.

Thinking mode (DeepSeek R1, Qwen3, gpt-oss)

Reasoning models emit a <think>...</think> chain-of-thought block before the final answer. The --think flag explicitly turns this on (some models accept high, medium, low levels), and --hidethinking suppresses the trace from the visible output while still allowing the model to reason internally.

# Force thinking on
echo "What is 13 * 17?" | ollama run deepseek-r1:7b --think true

# Show the answer only, hide the chain of thought
echo "What is 13 * 17?" | ollama run deepseek-r1:7b --hidethinking

Real timing on RTX 4090: --think true on deepseek-r1:7b answered “13 × 17 = 221” in 5.12 s total with 187 generated tokens at 182.69 tok/s. The same prompt with --hidethinking ran 267 tokens at 181.57 tok/s; the answer was identical, the visible output cleaner.

Truncate embedding dimensions

Embedding models like nomic-embed-text output 768-dim vectors by default. Pass --dimensions to truncate (Matryoshka-style) for cheaper storage in your vector DB.

echo "the quick brown fox" | ollama run nomic-embed-text --dimensions 256

Experimental agent loop and web search

Three experimental flags activate the in-CLI agent loop with optional web search and an auto-approve mode for tool calls. Treat them as preview only; behavior changes between releases.

ollama run llama3.1:8b --experimental "What is the latest Linux kernel version?"
ollama run llama3.1:8b --experimental --experimental-websearch "..."
ollama run llama3.1:8b --experimental --experimental-yolo "..."   # skip tool approvals

Server Management

Ollama runs as a systemd service on Linux. These commands cover starting, stopping, and configuring the server.

Start the server manually

If you need to run Ollama outside of systemd (for debugging, for example):

ollama serve

Systemd service commands

The standard install creates an ollama.service unit. Manage it with systemctl:

sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama

Verify the service is active:

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Wed 2026-03-25 03:25:07 EAT; 2s ago
   Main PID: 5181 (ollama)
      Tasks: 9 (limit: 100206)
     Memory: 11.7M (peak: 24.7M)
        CPU: 49ms
     CGroup: /system.slice/ollama.service
             └─5181 /usr/local/bin/ollama serve

View logs

Check the journal for errors, request logs, and model load events:

sudo journalctl -u ollama -f

Sample log entries showing API requests:

Mar 25 03:32:37 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:37 | 200 |  3.808679349s |       127.0.0.1 | POST     "/api/generate"
Mar 25 03:32:54 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:54 | 200 | 11.318770126s |       127.0.0.1 | POST     "/api/generate"
Mar 25 03:33:01 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:33:01 | 200 |     539.182µs |       127.0.0.1 | POST     "/api/copy"

Set environment variables via systemd override

To configure Ollama to listen on all interfaces or change the model storage path, create a systemd override file:

sudo systemctl edit ollama

Add the environment variables you need between the comments:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_FLASH_ATTENTION=1"

Reload and restart for changes to take effect:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Three commands tie the local CLI to ollama.com and to coding agents. signin authenticates against your Ollama account (required to push a model or use cloud features), signout revokes the local credential, and launch opens the desktop menu on macOS or Windows or hands a model directly to a registered integration.

ollama signin             # interactive browser flow
ollama signout

ollama launch             # open the desktop menu (macOS / Windows)
ollama launch claude      # hand off to Claude Code
ollama launch opencode    # OpenCode
ollama launch codex       # Codex
ollama launch copilot     # Copilot CLI

Integrations recognized by Ollama 0.23.1: claude, claude-desktop, cline, codex, copilot, droid, hermes, kimi, opencode. They each launch with a sane default model unless you pass a specific one.

REST API: Generate and Chat

The Ollama API listens on port 11434 by default. All endpoints accept JSON. Set "stream": false to get the complete response in a single JSON object instead of a stream of tokens.

POST /api/generate

Single-turn text generation. Takes a prompt string and returns a completion.

curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Why is the sky blue? Answer in one sentence.",
  "stream": false
}'

Response JSON (trimmed for readability):

{
  "model": "gemma3:4b",
  "created_at": "2026-03-25T00:29:37.830168406Z",
  "response": "The sky appears blue due to a phenomenon called Rayleigh scattering, where shorter wavelengths of sunlight (like blue) are scattered more by the Earth's atmosphere than longer wavelengths.",
  "done": true,
  "done_reason": "stop",
  "total_duration": 4509574648,
  "load_duration": 498389364,
  "prompt_eval_count": 20,
  "eval_count": 35
}

The total_duration and eval_count fields are in nanoseconds and token counts respectively. Divide eval_count by eval_duration (in seconds) to calculate tokens per second.

POST /api/chat

Multi-turn conversation endpoint. Pass a messages array with role/content pairs, just like the OpenAI Chat API format.

curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma3:4b",
  "messages": [
    {"role": "system", "content": "You are a Linux sysadmin. Be brief."},
    {"role": "user", "content": "What is DNS?"}
  ],
  "stream": false
}'

The response wraps the assistant message in a message object:

{
  "model": "gemma3:4b",
  "created_at": "2026-03-25T00:29:50.789139412Z",
  "message": {
    "role": "assistant",
    "content": "DNS (Domain Name System) is a hierarchical and distributed naming system that translates human-readable domain names (like google.com) into the numerical IP addresses computers use to communicate with each other."
  },
  "done": true,
  "done_reason": "stop",
  "total_duration": 5747578407,
  "load_duration": 268908121,
  "prompt_eval_count": 16,
  "eval_count": 40
}

POST /api/generate with JSON format

Force structured JSON output by setting "format": "json" in the request body. The model will only return valid JSON.

curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "List 3 Linux distributions. Return JSON array with name and year fields.",
  "format": "json",
  "stream": false
}'

The response field contains valid JSON that you can parse directly:

{
  "distributions": [
    {"name": "Ubuntu", "year": 2004},
    {"name": "Debian", "year": 1993},
    {"name": "Fedora", "year": 2003}
  ]
}

GET /api/tags

Lists all models available locally. Same data as ollama list but in JSON.

curl -s http://localhost:11434/api/tags | python3 -m json.tool

Each model entry includes its digest, size, parameter count, and quantization level:

{
  "models": [
    {
      "name": "gemma3:4b",
      "model": "gemma3:4b",
      "modified_at": "2026-03-25T03:27:22.913735934+03:00",
      "size": 3338801804,
      "digest": "a2af6cc3eb7fa8be...",
      "details": {
        "format": "gguf",
        "family": "gemma3",
        "parameter_size": "4.3B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}

POST /api/show

Returns detailed model metadata: parameters, template, license, and the full model_info block with architecture details.

curl -s http://localhost:11434/api/show -d '{"model": "gemma3:4b"}' | python3 -m json.tool

POST /api/embed

Generates vector embeddings for text input. Requires a model that supports embeddings (such as nomic-embed-text or all-minilm). General chat models like gemma3 or llama3 do not support this endpoint.

ollama pull nomic-embed-text
curl -s http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama runs large language models locally"
}'

The response contains an embeddings array of float vectors. These can be stored in a vector database like PostgreSQL with pgvector for similarity search and RAG applications.

GET /api/version

Returns the running Ollama version. Useful for health checks and version-gated feature detection.

$ curl -s http://localhost:11434/api/version
{"version":"0.23.1"}

GET /api/ps

Returns the list of currently loaded models. Same data as ollama ps, in JSON.

$ curl -s http://localhost:11434/api/ps | jq
{
  "models": [
    {
      "name": "llama3.1:8b",
      "model": "llama3.1:8b",
      "size": 11240960000,
      "size_vram": 11240960000,
      "expires_at": "2026-05-06T12:30:00Z"
    }
  ]
}

POST /api/chat with tool calling

Tool-capable models (Llama 3.1+, Mistral Small, Qwen 2.5+, Qwen 3) accept a tools array describing functions they can call. The response either answers directly or returns a tool_calls array with the chosen function and parsed arguments. You execute the function on your side and post the result back as a tool role message to continue the conversation.

curl -s http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "stream": false,
  "messages": [{"role":"user","content":"What is the weather in Nairobi?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a city",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    }
  }]
}' | jq '.message.tool_calls'

Real response from Llama 3.1 8B on the test rig:

[
  {
    "id": "call_ww20fksl",
    "function": {
      "index": 0,
      "name": "get_weather",
      "arguments": {"city": "Nairobi"}
    }
  }
]

POST /api/chat with structured outputs

Pass a JSON Schema in the format field and the response is constrained to match it. This is stricter than the older "format":"json" shorthand because the schema enforces field names, types, and required fields.

curl -s http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "stream": false,
  "messages": [{"role":"user","content":"Pick a city and report the weather. Output JSON only."}],
  "format": {
    "type": "object",
    "properties": {
      "city": {"type": "string"},
      "temp_c": {"type": "number"},
      "condition": {"type": "string"}
    },
    "required": ["city","temp_c","condition"]
  }
}' | jq -r .message.content

Real output:

{
  "city": "San Francisco",
  "temp_c": 17,
  "condition": "Mostly Cloudy"
}

REST API: OpenAI-Compatible Endpoints

Ollama exposes OpenAI-compatible endpoints at /v1/. This lets you point any OpenAI SDK or tool at your local Ollama instance by changing the base URL.

GET /v1/models

Lists available models in the OpenAI format.

curl -s http://localhost:11434/v1/models | python3 -m json.tool

Response follows the OpenAI schema:

{
  "object": "list",
  "data": [
    {
      "id": "gemma3:4b",
      "object": "model",
      "created": 1774398442,
      "owned_by": "library"
    }
  ]
}

POST /v1/chat/completions

Drop-in replacement for the OpenAI Chat Completions API. Works with the official openai Python SDK by setting base_url="http://localhost:11434/v1".

curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:4b",
    "messages": [
      {"role": "user", "content": "What port does SSH use?"}
    ]
  }'

The response matches the OpenAI format with choices and usage fields:

{
  "id": "chatcmpl-124",
  "object": "chat.completion",
  "created": 1774398701,
  "model": "gemma3:4b",
  "system_fingerprint": "fp_ollama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "SSH typically uses port 22 for secure remote access."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 13,
    "total_tokens": 31
  }
}

REST API: Model Management

These endpoints let you manage models programmatically, which is useful for automation scripts and deployment pipelines.

POST /api/pull

Pull a model via the API. Equivalent to ollama pull from the CLI.

curl -s http://localhost:11434/api/pull -d '{
  "model": "gemma3:4b",
  "stream": false
}'

Returns {"status": "success"} when complete.

POST /api/copy

Copy a model to a new name. Returns HTTP 200 with an empty body on success.

curl -s -X POST http://localhost:11434/api/copy -d '{
  "source": "gemma3:4b",
  "destination": "gemma3-backup"
}'

DELETE /api/delete

Delete a model. Returns HTTP 200 on success.

curl -s -X DELETE http://localhost:11434/api/delete -d '{
  "model": "gemma3-backup"
}'

POST /api/create

Build a custom model directly from an inline body, no Modelfile on disk required. The server returns a stream of status events. Use this when your control plane wants to provision custom models without shipping files around.

curl -X POST http://localhost:11434/api/create -d '{
  "model": "json-bot",
  "from": "qwen2.5:7b",
  "system": "You answer in JSON only. No prose.",
  "parameters": {"temperature": 0.0, "num_ctx": 8192}
}'

POST /api/push

Upload a local model to ollama.com. Requires ollama signin first. Streams progress events as JSON. The model name must use your namespace, for example youruser/coder-strict.

curl -X POST http://localhost:11434/api/push -d '{
  "model": "youruser/coder-strict",
  "stream": true
}'

Modelfile Reference

A Modelfile defines a custom model: its base weights, system prompt, parameters, and template. Create one to build reusable model configurations.

Example Modelfile

Save this as Modelfile in your working directory:

FROM gemma3:4b
SYSTEM "You are a Linux systems administrator. Answer questions about Linux concisely."
PARAMETER temperature 0.7
PARAMETER top_p 0.9

Build the custom model from the Modelfile:

ollama create sysadmin-bot -f Modelfile

# Quantize while creating, in one step (FP16 or higher source recommended)
ollama create sysadmin-bot-q5 -f Modelfile -q q5_K_M

The -q / --quantize flag added in Ollama 0.5+ runs the quantization at build time and accepts the standard k-quant levels: q4_K_M (default), q5_K_M, q6_K, q8_0, plus the legacy q4_0, q5_0, q4_1, q5_1. Use it when importing an FP16 GGUF and you want a smaller deployment artifact without running llama.cpp separately.

The new model appears in your local registry:

NAME                   ID              SIZE      MODIFIED               
sysadmin-bot:latest    ab1cd106e0ee    3.3 GB    Less than a second ago    
gemma3:4b              a2af6cc3eb7f    3.3 GB    4 minutes ago

Now run it like any other model:

echo "How do I check disk space?" | ollama run sysadmin-bot

The system prompt kicks in and you get a focused response:

Use the `df -h` command. It displays disk space in a human-readable format (KB, MB, GB).

Modelfile directives

Directive	Purpose	Example
`FROM`	Base model (required)	`FROM llama3.3`
`SYSTEM`	System prompt	`SYSTEM "You are a helpful assistant"`
`PARAMETER`	Set model parameters	`PARAMETER temperature 0.7`
`TEMPLATE`	Go template for prompt format	`TEMPLATE "{{ .Prompt }}"`
`ADAPTER`	Path to LoRA/QLoRA adapter	`ADAPTER ./lora.gguf`
`MESSAGE`	Seed conversation history	`MESSAGE user "Hi"`
`LICENSE`	License text for the model	`LICENSE "MIT"`

Common PARAMETER values

Parameter	Default	Description
`temperature`	0.8	Controls randomness. Lower values make output more deterministic
`top_p`	0.9	Nucleus sampling threshold
`top_k`	40	Limits token selection to top K candidates
`num_ctx`	2048	Context window size in tokens
`repeat_penalty`	1.1	Penalizes repeated tokens
`seed`	0	Random seed for reproducible output (0 = random)
`stop`	model-specific	Stop sequences that end generation
`num_predict`	-1	Maximum tokens to generate (-1 = unlimited)

Environment Variables Reference

These environment variables configure the Ollama server. Set them in your systemd override file or export them before running ollama serve.

Variable	Default	Description
`OLLAMA_HOST`	127.0.0.1:11434	Listen address and port
`OLLAMA_MODELS`	~/.ollama/models	Model storage directory
`OLLAMA_KEEP_ALIVE`	5m	How long models stay loaded after last request
`OLLAMA_NUM_PARALLEL`	auto	Maximum concurrent requests per model
`OLLAMA_MAX_LOADED_MODELS`	auto (per GPU)	Maximum models loaded simultaneously
`OLLAMA_MAX_QUEUE`	512	Maximum queued requests before rejecting (HTTP 503 after)
`OLLAMA_ORIGINS`	localhost variants	Allowed CORS origins (comma-separated)
`OLLAMA_DEBUG`	false	Enable debug logging (1=debug, 2=trace)
`OLLAMA_FLASH_ATTENTION`	false	Enable flash attention 2 (faster on Ampere+ GPUs)
`OLLAMA_KV_CACHE_TYPE`	f16	KV cache quantization (f16, q8_0, q4_0). Halves or quarters KV memory
`OLLAMA_CONTEXT_LENGTH`	auto (4k/32k/256k by VRAM)	Default `num_ctx` when the request does not set one
`OLLAMA_GPU_OVERHEAD`	0	Reserved VRAM per GPU in bytes
`OLLAMA_LOAD_TIMEOUT`	5m	How long to allow a model load to stall before giving up
`OLLAMA_NOPRUNE`	false	Skip pruning unused model blobs on startup
`OLLAMA_SCHED_SPREAD`	false	Always shard a model across all GPUs (vs. preferring single-GPU placement)
`OLLAMA_LLM_LIBRARY`	auto	Force a specific backend (`cuda_v12`, `cpu_avx2`, `rocm`) bypassing autodetection. Added in 0.20+
`OLLAMA_NO_CLOUD`	false	Disable cloud features (remote inference and web search). Added in 0.21+
`OLLAMA_MULTIUSER_CACHE`	false	Optimize prompt caching for multiple users (legacy)
`OLLAMA_NOHISTORY`	false	Disable readline history in interactive mode
`OLLAMA_EDITOR`	(none)	Editor invoked by `Ctrl+G` in the REPL
`OLLAMA_TMPDIR`	system default	Temporary directory for downloads and extraction (legacy, may be removed)

Python SDK

The official Python client (ollama/ollama-python) wraps every endpoint above with typed responses. Install with pip install ollama. The five most common patterns are below; all snippets ran live against the test rig.

import ollama

# 1. Generate (single-turn)
res = ollama.generate(model="qwen2.5:0.5b", prompt="Say hi.")
print(res["response"])
# Hello! How can I assist you today?

# 2. Chat (multi-turn)
res = ollama.chat(
    model="qwen2.5:0.5b",
    messages=[{"role": "user", "content": "Reply with one word: yes"}],
)
print(res["message"]["content"], "eval_count:", res["eval_count"])
# Yes eval_count: 2

# 3. Streaming
for chunk in ollama.generate(model="qwen2.5:0.5b",
                             prompt="List 3 colors:", stream=True):
    print(chunk["response"], end="", flush=True)

# 4. Embeddings
res = ollama.embed(model="nomic-embed-text", input=["hello", "world"])
print(len(res["embeddings"]), "vectors of dim",
      len(res["embeddings"][0]))
# 2 vectors of dim 768

# 5. Tool calling (Llama 3.1+, Mistral Small, Qwen 2.5+)
res = ollama.chat(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Weather in Nairobi?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
)
for call in res["message"].get("tool_calls", []):
    print(call["function"]["name"], call["function"]["arguments"])
# get_weather {'city': 'Nairobi'}

For high-throughput services use ollama.AsyncClient (asyncio). For a remote server pass host="http://gpu-host:11434" to Client(...). The Node equivalent lives at ollama/ollama-js; for any other language, point an OpenAI SDK at http://localhost:11434/v1.

Useful One-Liners

Practical command combinations for scripting and daily use.

Check model disk usage

du -sh /usr/share/ollama/.ollama/models/

On a system with one 4B model:

3.2G	/usr/share/ollama/.ollama/models/

Pipe a file into a prompt

Summarize a log file, config, or any text file:

cat /var/log/dnf.log | ollama run gemma3:4b "Summarize what packages were installed"

Batch inference from a file

Process multiple prompts from a text file, one per line:

while IFS= read -r prompt; do
  echo "=== $prompt ==="
  echo "$prompt" | ollama run gemma3:4b
  echo ""
done < prompts.txt

Quick benchmark: tokens per second

Run a standardized prompt with --verbose to measure generation speed:

ollama run gemma3:4b --verbose "Write a 100-word essay about Linux." 2>&1 | grep "eval rate"

Check if Ollama API is reachable

curl -s http://localhost:11434/ && echo "Ollama is running"

Returns "Ollama is running" if the server is up.

Pull multiple models in sequence

for model in gemma3:4b llama3.2:3b qwen2.5:7b; do
  echo "Pulling $model..."
  ollama pull "$model"
done

Export model list as JSON

Useful for backup scripts or inventory tracking:

curl -s http://localhost:11434/api/tags | python3 -c "
import sys, json
data = json.load(sys.stdin)
for m in data['models']:
    print(f\"{m['name']:30s} {m['details']['parameter_size']:>8s} {m['details']['quantization_level']}\")"

Sample output:

gemma3:4b                          4.3B Q4_K_M

Use Ollama with the OpenAI Python SDK

Point the official openai package at your local Ollama instance. No API key required.

pip install openai

Then in Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
    model="gemma3:4b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Quick Command Reference Table

All CLI commands at a glance, sorted by category.

Command	What it does
`ollama pull MODEL`	Download a model from the registry
`ollama list` (alias `ls`)	Show all local models
`ollama ps`	Show models loaded in memory
`ollama show MODEL`	Display model details (params, template, license)
`ollama run MODEL`	Start interactive chat or run one-shot prompt
`ollama run MODEL --verbose`	Run with timing stats
`ollama run MODEL --format json`	Force JSON output
`ollama run MODEL --think true`	Force thinking mode on (R1, Qwen 3, gpt-oss)
`ollama run MODEL --hidethinking`	Suppress chain-of-thought from visible output
`ollama run MODEL --keepalive 1h`	Keep the model loaded after the request
`ollama run EMBED --dimensions N`	Truncate output embeddings (Matryoshka)
`ollama stop MODEL`	Unload model from memory
`ollama cp SRC DST`	Copy a model to a new name
`ollama rm MODEL [MODEL...]`	Delete one or more models from disk
`ollama create NAME -f FILE`	Build a custom model from a Modelfile
`ollama create NAME -f FILE -q q5_K_M`	Build and quantize in one step
`ollama push NAMESPACE/MODEL`	Push a model to the registry (requires `signin`)
`ollama signin` / `ollama signout`	Authenticate with ollama.com
`ollama launch [integration]`	Open the desktop menu or hand off to a coding agent
`ollama serve` (alias `start`)	Start the server manually
`ollama --version`	Print Ollama version