AI

Ollama Commands Cheat Sheet: CLI and API Reference

Every time I need to check a model’s parameter count or hit the embedding endpoint, I end up scrolling through docs. This cheat sheet is the reference I keep coming back to. It covers every Ollama CLI command and REST API endpoint with tested examples you can copy and run.

Original content from computingforgeeks.com - post 164411

Ollama 0.18.x ships with a full CLI for model management, an interactive chat interface, and a REST API that includes OpenAI-compatible endpoints. Whether you are scripting model deployments, building applications against the API, or just need to remember the right curl command, this page has it. If you need installation instructions first, see Install Ollama on Rocky Linux 10 / Ubuntu 24.04.

Tested March 2026 on Rocky Linux 10.1 with Ollama 0.18.2, gemma3:4b (Q4_K_M)

Model Management

These commands handle downloading, inspecting, copying, and removing models from the local registry.

Pull a model

Downloads a model from the Ollama library. Tag defaults to latest if omitted.

ollama pull gemma3:4b

List local models

Shows all models stored on disk with their size and last modified time.

ollama list

The output includes the model ID, size, and when it was last updated:

NAME         ID              SIZE      MODIFIED       
gemma3:4b    a2af6cc3eb7f    3.3 GB    42 seconds ago

List running models

Shows models currently loaded in memory, including processor allocation and context size.

ollama ps

On a CPU-only system, all processing runs on the CPU:

NAME         ID              SIZE      PROCESSOR    CONTEXT    UNTIL              
gemma3:4b    a2af6cc3eb7f    4.3 GB    100% CPU     4096       4 minutes from now

Show model details

Displays architecture, parameter count, quantization, context length, and license info.

ollama show gemma3:4b

Output confirms the model architecture and capabilities:

  Model
    architecture        gemma3    
    parameters          4.3B      
    context length      131072    
    embedding length    2560      
    quantization        Q4_K_M    

  Capabilities
    completion    
    vision        

  Parameters
    stop           "<end_of_turn>"    
    temperature    1                  
    top_k          64                 
    top_p          0.95

You can also extract specific sections with flags:

ollama show gemma3:4b --modelfile
ollama show gemma3:4b --parameters
ollama show gemma3:4b --template

Copy a model

Creates a new reference to an existing model. Useful for creating custom variants without re-downloading weights.

ollama cp gemma3:4b mymodel:latest

Both names now point to the same model ID:

copied 'gemma3:4b' to 'mymodel:latest'

Remove a model

Deletes a model from disk. The blob files are only removed when no other model references them.

ollama rm mymodel:latest

Confirmation:

deleted 'mymodel:latest'

Stop a running model

Unloads a model from memory without stopping the Ollama server.

ollama stop gemma3:4b

Running Models

The ollama run command handles both interactive chat sessions and one-shot prompts from the command line.

Interactive chat

Opens an interactive session. Type /bye to exit.

ollama run gemma3:4b

One-shot prompt

Pass the prompt as a second argument. Ollama prints the response and exits.

ollama run gemma3:4b "What port does PostgreSQL use?"

Piped input

Pipe text from another command or file. Great for scripting.

echo "What is the capital of France?" | ollama run gemma3:4b

Response:

The capital of France is Paris.

JSON output

Force the model to respond with valid JSON using --format json. Include “JSON” in your prompt so the model understands what structure you want.

echo "List 3 Linux distributions. Return JSON array with name and year fields." | ollama run gemma3:4b --format json

The model returns structured JSON:

{
  "distributions": [
    {"name": "Ubuntu", "year": 2004},
    {"name": "Debian", "year": 1993},
    {"name": "Fedora", "year": 2003}
  ]
}

Verbose output with timing stats

Add --verbose to see token generation speed and load times. Useful for benchmarking.

ollama run gemma3:4b --verbose "What is 2+2? Reply with just the number."

After the response, timing stats are printed:

4

total duration:       1.681255883s
load duration:        495.619141ms
prompt eval count:    22 token(s)
prompt eval duration: 956.787989ms
prompt eval rate:     22.99 tokens/s
eval count:           3 token(s)
eval duration:        224.407368ms
eval rate:            13.37 tokens/s

Keep model loaded

By default, models unload after 5 minutes of inactivity. Override this with --keepalive.

ollama run gemma3:4b --keepalive 30m "Hello"

Set to -1 to keep a model loaded indefinitely, or 0 to unload immediately after the request.

Server Management

Ollama runs as a systemd service on Linux. These commands cover starting, stopping, and configuring the server.

Start the server manually

If you need to run Ollama outside of systemd (for debugging, for example):

ollama serve

Systemd service commands

The standard install creates an ollama.service unit. Manage it with systemctl:

sudo systemctl start ollama
sudo systemctl stop ollama
sudo systemctl restart ollama
sudo systemctl status ollama

Verify the service is active:

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Wed 2026-03-25 03:25:07 EAT; 2s ago
   Main PID: 5181 (ollama)
      Tasks: 9 (limit: 100206)
     Memory: 11.7M (peak: 24.7M)
        CPU: 49ms
     CGroup: /system.slice/ollama.service
             └─5181 /usr/local/bin/ollama serve

View logs

Check the journal for errors, request logs, and model load events:

sudo journalctl -u ollama -f

Sample log entries showing API requests:

Mar 25 03:32:37 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:37 | 200 |  3.808679349s |       127.0.0.1 | POST     "/api/generate"
Mar 25 03:32:54 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:32:54 | 200 | 11.318770126s |       127.0.0.1 | POST     "/api/generate"
Mar 25 03:33:01 ollama-cheatsheet ollama[7258]: [GIN] 2026/03/25 - 03:33:01 | 200 |     539.182µs |       127.0.0.1 | POST     "/api/copy"

Set environment variables via systemd override

To configure Ollama to listen on all interfaces or change the model storage path, create a systemd override file:

sudo systemctl edit ollama

Add the environment variables you need between the comments:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/ollama/models"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_FLASH_ATTENTION=1"

Reload and restart for changes to take effect:

sudo systemctl daemon-reload
sudo systemctl restart ollama

REST API: Generate and Chat

The Ollama API listens on port 11434 by default. All endpoints accept JSON. Set "stream": false to get the complete response in a single JSON object instead of a stream of tokens.

POST /api/generate

Single-turn text generation. Takes a prompt string and returns a completion.

curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "Why is the sky blue? Answer in one sentence.",
  "stream": false
}'

Response JSON (trimmed for readability):

{
  "model": "gemma3:4b",
  "created_at": "2026-03-25T00:29:37.830168406Z",
  "response": "The sky appears blue due to a phenomenon called Rayleigh scattering, where shorter wavelengths of sunlight (like blue) are scattered more by the Earth's atmosphere than longer wavelengths.",
  "done": true,
  "done_reason": "stop",
  "total_duration": 4509574648,
  "load_duration": 498389364,
  "prompt_eval_count": 20,
  "eval_count": 35
}

The total_duration and eval_count fields are in nanoseconds and token counts respectively. Divide eval_count by eval_duration (in seconds) to calculate tokens per second.

POST /api/chat

Multi-turn conversation endpoint. Pass a messages array with role/content pairs, just like the OpenAI Chat API format.

curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma3:4b",
  "messages": [
    {"role": "system", "content": "You are a Linux sysadmin. Be brief."},
    {"role": "user", "content": "What is DNS?"}
  ],
  "stream": false
}'

The response wraps the assistant message in a message object:

{
  "model": "gemma3:4b",
  "created_at": "2026-03-25T00:29:50.789139412Z",
  "message": {
    "role": "assistant",
    "content": "DNS (Domain Name System) is a hierarchical and distributed naming system that translates human-readable domain names (like google.com) into the numerical IP addresses computers use to communicate with each other."
  },
  "done": true,
  "done_reason": "stop",
  "total_duration": 5747578407,
  "load_duration": 268908121,
  "prompt_eval_count": 16,
  "eval_count": 40
}

POST /api/generate with JSON format

Force structured JSON output by setting "format": "json" in the request body. The model will only return valid JSON.

curl -s http://localhost:11434/api/generate -d '{
  "model": "gemma3:4b",
  "prompt": "List 3 Linux distributions. Return JSON array with name and year fields.",
  "format": "json",
  "stream": false
}'

The response field contains valid JSON that you can parse directly:

{
  "distributions": [
    {"name": "Ubuntu", "year": 2004},
    {"name": "Debian", "year": 1993},
    {"name": "Fedora", "year": 2003}
  ]
}

GET /api/tags

Lists all models available locally. Same data as ollama list but in JSON.

curl -s http://localhost:11434/api/tags | python3 -m json.tool

Each model entry includes its digest, size, parameter count, and quantization level:

{
  "models": [
    {
      "name": "gemma3:4b",
      "model": "gemma3:4b",
      "modified_at": "2026-03-25T03:27:22.913735934+03:00",
      "size": 3338801804,
      "digest": "a2af6cc3eb7fa8be...",
      "details": {
        "format": "gguf",
        "family": "gemma3",
        "parameter_size": "4.3B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}

POST /api/show

Returns detailed model metadata: parameters, template, license, and the full model_info block with architecture details.

curl -s http://localhost:11434/api/show -d '{"model": "gemma3:4b"}' | python3 -m json.tool

POST /api/embed

Generates vector embeddings for text input. Requires a model that supports embeddings (such as nomic-embed-text or all-minilm). General chat models like gemma3 or llama3 do not support this endpoint.

ollama pull nomic-embed-text
curl -s http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama runs large language models locally"
}'

The response contains an embeddings array of float vectors. These can be stored in a vector database like PostgreSQL with pgvector for similarity search and RAG applications.

REST API: OpenAI-Compatible Endpoints

Ollama exposes OpenAI-compatible endpoints at /v1/. This lets you point any OpenAI SDK or tool at your local Ollama instance by changing the base URL.

GET /v1/models

Lists available models in the OpenAI format.

curl -s http://localhost:11434/v1/models | python3 -m json.tool

Response follows the OpenAI schema:

{
  "object": "list",
  "data": [
    {
      "id": "gemma3:4b",
      "object": "model",
      "created": 1774398442,
      "owned_by": "library"
    }
  ]
}

POST /v1/chat/completions

Drop-in replacement for the OpenAI Chat Completions API. Works with the official openai Python SDK by setting base_url="http://localhost:11434/v1".

curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma3:4b",
    "messages": [
      {"role": "user", "content": "What port does SSH use?"}
    ]
  }'

The response matches the OpenAI format with choices and usage fields:

{
  "id": "chatcmpl-124",
  "object": "chat.completion",
  "created": 1774398701,
  "model": "gemma3:4b",
  "system_fingerprint": "fp_ollama",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "SSH typically uses port 22 for secure remote access."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 13,
    "total_tokens": 31
  }
}

REST API: Model Management

These endpoints let you manage models programmatically, which is useful for automation scripts and deployment pipelines.

POST /api/pull

Pull a model via the API. Equivalent to ollama pull from the CLI.

curl -s http://localhost:11434/api/pull -d '{
  "model": "gemma3:4b",
  "stream": false
}'

Returns {"status": "success"} when complete.

POST /api/copy

Copy a model to a new name. Returns HTTP 200 with an empty body on success.

curl -s -X POST http://localhost:11434/api/copy -d '{
  "source": "gemma3:4b",
  "destination": "gemma3-backup"
}'

DELETE /api/delete

Delete a model. Returns HTTP 200 on success.

curl -s -X DELETE http://localhost:11434/api/delete -d '{
  "model": "gemma3-backup"
}'

Modelfile Reference

A Modelfile defines a custom model: its base weights, system prompt, parameters, and template. Create one to build reusable model configurations.

Example Modelfile

Save this as Modelfile in your working directory:

FROM gemma3:4b
SYSTEM "You are a Linux systems administrator. Answer questions about Linux concisely."
PARAMETER temperature 0.7
PARAMETER top_p 0.9

Build the custom model from the Modelfile:

ollama create sysadmin-bot -f Modelfile

The new model appears in your local registry:

NAME                   ID              SIZE      MODIFIED               
sysadmin-bot:latest    ab1cd106e0ee    3.3 GB    Less than a second ago    
gemma3:4b              a2af6cc3eb7f    3.3 GB    4 minutes ago

Now run it like any other model:

echo "How do I check disk space?" | ollama run sysadmin-bot

The system prompt kicks in and you get a focused response:

Use the `df -h` command. It displays disk space in a human-readable format (KB, MB, GB).

Modelfile directives

DirectivePurposeExample
FROMBase model (required)FROM llama3.3
SYSTEMSystem promptSYSTEM "You are a helpful assistant"
PARAMETERSet model parametersPARAMETER temperature 0.7
TEMPLATEGo template for prompt formatTEMPLATE "{{ .Prompt }}"
ADAPTERPath to LoRA/QLoRA adapterADAPTER ./lora.gguf
MESSAGESeed conversation historyMESSAGE user "Hi"
LICENSELicense text for the modelLICENSE "MIT"

Common PARAMETER values

ParameterDefaultDescription
temperature0.8Controls randomness. Lower values make output more deterministic
top_p0.9Nucleus sampling threshold
top_k40Limits token selection to top K candidates
num_ctx2048Context window size in tokens
repeat_penalty1.1Penalizes repeated tokens
seed0Random seed for reproducible output (0 = random)
stopmodel-specificStop sequences that end generation
num_predict-1Maximum tokens to generate (-1 = unlimited)

Environment Variables Reference

These environment variables configure the Ollama server. Set them in your systemd override file or export them before running ollama serve.

VariableDefaultDescription
OLLAMA_HOST127.0.0.1:11434Listen address and port
OLLAMA_MODELS~/.ollama/modelsModel storage directory
OLLAMA_KEEP_ALIVE5mHow long models stay loaded after last request
OLLAMA_NUM_PARALLEL1Maximum concurrent requests per model
OLLAMA_MAX_LOADED_MODELS0 (auto)Maximum models loaded simultaneously
OLLAMA_MAX_QUEUE512Maximum queued requests before rejecting
OLLAMA_ORIGINSlocalhost variantsAllowed CORS origins (comma-separated)
OLLAMA_DEBUGfalseEnable debug logging (1=debug, 2=trace)
OLLAMA_FLASH_ATTENTIONfalseEnable flash attention for faster inference
OLLAMA_KV_CACHE_TYPEf16KV cache quantization type (f16, q8_0, q4_0)
OLLAMA_CONTEXT_LENGTH0 (auto)Default context length for all models
OLLAMA_GPU_OVERHEAD0Reserved VRAM per GPU in bytes
OLLAMA_LOAD_TIMEOUT5mMaximum time to wait for model loading
OLLAMA_NOPRUNEfalseSkip pruning unused model blobs on startup
OLLAMA_SCHED_SPREADfalseSpread model layers across all GPUs
OLLAMA_MULTIUSER_CACHEfalseOptimize prompt caching for multiple users
OLLAMA_NOHISTORYfalseDisable readline history in interactive mode
OLLAMA_TMPDIRsystem defaultTemporary directory for downloads and extraction

Useful One-Liners

Practical command combinations for scripting and daily use.

Check model disk usage

du -sh /usr/share/ollama/.ollama/models/

On a system with one 4B model:

3.2G	/usr/share/ollama/.ollama/models/

Pipe a file into a prompt

Summarize a log file, config, or any text file:

cat /var/log/dnf.log | ollama run gemma3:4b "Summarize what packages were installed"

Batch inference from a file

Process multiple prompts from a text file, one per line:

while IFS= read -r prompt; do
  echo "=== $prompt ==="
  echo "$prompt" | ollama run gemma3:4b
  echo ""
done < prompts.txt

Quick benchmark: tokens per second

Run a standardized prompt with --verbose to measure generation speed:

ollama run gemma3:4b --verbose "Write a 100-word essay about Linux." 2>&1 | grep "eval rate"

Check if Ollama API is reachable

curl -s http://localhost:11434/ && echo "Ollama is running"

Returns "Ollama is running" if the server is up.

Pull multiple models in sequence

for model in gemma3:4b llama3.2:3b qwen2.5:7b; do
  echo "Pulling $model..."
  ollama pull "$model"
done

Export model list as JSON

Useful for backup scripts or inventory tracking:

curl -s http://localhost:11434/api/tags | python3 -c "
import sys, json
data = json.load(sys.stdin)
for m in data['models']:
    print(f\"{m['name']:30s} {m['details']['parameter_size']:>8s} {m['details']['quantization_level']}\")"

Sample output:

gemma3:4b                          4.3B Q4_K_M

Use Ollama with the OpenAI Python SDK

Point the official openai package at your local Ollama instance. No API key required.

pip install openai

Then in Python:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
response = client.chat.completions.create(
    model="gemma3:4b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Quick Command Reference Table

All CLI commands at a glance, sorted by category.

CommandWhat it does
ollama pull MODELDownload a model from the registry
ollama listShow all local models
ollama psShow models loaded in memory
ollama show MODELDisplay model details (params, template, license)
ollama run MODELStart interactive chat or run one-shot prompt
ollama run MODEL --verboseRun with timing stats
ollama run MODEL --format jsonForce JSON output
ollama stop MODELUnload model from memory
ollama cp SRC DSTCopy a model to a new name
ollama rm MODELDelete a model from disk
ollama create NAME -f FILEBuild custom model from Modelfile
ollama serveStart the server manually
ollama --versionPrint Ollama version

Related Articles

AI OpenAI Codex CLI Cheat Sheet – Commands, Shortcuts, Tips AI How To Install LibreOffice 25.x on CentOS 9 | RHEL 9 AI 9 Canvas LMS Alternatives Built for Corporate Training, Not Classrooms AI Setup and Customize OpenCode – The Open Source AI Coding Agent

Leave a Comment

Press ESC to close