Ollama CLI Mastery

The Ollama CLI is your primary interface for managing and running models. Let's explore its full capabilities.

Complete Command Reference

┌─────────────────────────────────────────────────────────────────┐
│                    Ollama CLI Commands                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Model Management          Server Control                       │
│  ────────────────          ──────────────                       │
│  pull    - Download model  serve  - Start server                │
│  push    - Upload model    ps     - List running models         │
│  list    - Show models     stop   - Stop a running model        │
│  rm      - Delete model                                         │
│  cp      - Copy model      Information                          │
│  create  - Create custom   ───────────                          │
│                            show   - Model details               │
│  Execution                 help   - Show help                   │
│  ─────────                 version - Show version               │
│  run     - Interactive                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Model Management Commands

Listing Models

# List all downloaded models
ollama list

# NAME                    ID              SIZE      MODIFIED
# llama3.2:latest         a80c4f17acd5    4.7 GB    2 hours ago
# mistral:latest          2ae6f6dd7a3d    4.1 GB    1 day ago
# deepseek-coder:latest   8934d96d3f08    4.7 GB    3 days ago

Model Information

# Show detailed model info
ollama show llama3.2

# Output includes:
# - Model architecture
# - Parameters
# - Quantization level
# - License
# - Template format

# Show specific sections
ollama show llama3.2 --modelfile    # Show Modelfile
ollama show llama3.2 --license      # Show license
ollama show llama3.2 --template     # Show prompt template
ollama show llama3.2 --parameters   # Show parameters

Pulling Specific Versions

# Pull latest version
ollama pull llama3.2

# Pull specific size
ollama pull llama3.1:8b       # 8 billion parameters
ollama pull llama3.3:70b      # 70 billion parameters
ollama pull llama3.2:1b       # 1 billion parameters

# Pull specific quantization
ollama pull llama3.1:8b-q4_0  # 4-bit quantization
ollama pull llama3.1:8b-q8_0  # 8-bit quantization

Copying and Renaming

# Copy a model (useful before modifications)
ollama cp llama3.2 my-llama

# Now you have both:
# - llama3.2 (original)
# - my-llama (copy for customization)

Running Models with Parameters

Temperature Control

# Lower temperature = more deterministic
ollama run llama3.2 --temperature 0.1 "Write a haiku about coding"

# Higher temperature = more creative
ollama run llama3.2 --temperature 1.5 "Write a haiku about coding"

Context Length

# Increase context window (uses more memory)
ollama run llama3.2 --num-ctx 8192 "Summarize this long document..."

# Default is typically 2048 or 4096

GPU Layers

# Control how much runs on GPU vs CPU
ollama run llama3.2 --num-gpu 35  # 35 layers on GPU

# 0 = CPU only (useful if GPU memory is low)
ollama run llama3.2 --num-gpu 0

Runtime Parameters Table

Parameter	Description	Default	Range
`--temperature`	Randomness	0.8	0.0-2.0
`--top-p`	Nucleus sampling	0.9	0.0-1.0
`--top-k`	Top-k sampling	40	1-100
`--num-ctx`	Context length	2048	512-32768
`--num-gpu`	GPU layers	auto	0-100
`--repeat-penalty`	Repetition penalty	1.1	0.0-2.0

Process Management

Viewing Running Models

# See what models are currently loaded
ollama ps

# NAME        ID              SIZE      PROCESSOR     UNTIL
# llama3.2    a80c4f17acd5    5.1 GB    100% GPU      4 minutes
# mistral     2ae6f6dd7a3d    4.5 GB    50% GPU/CPU   Idle

Stopping Models

# Stop a specific model (free memory)
ollama stop llama3.2

# Models also unload automatically after timeout (default: 5 min)

Keep Models Loaded

# Keep a model loaded indefinitely
ollama run llama3.2 --keepalive -1

# Set specific duration
ollama run llama3.2 --keepalive 30m

Server Configuration

Environment Variables

# Change API host/port
export OLLAMA_HOST=0.0.0.0:11434

# Set model storage location
export OLLAMA_MODELS=/mnt/external/models

# Control GPU usage
export OLLAMA_NUM_PARALLEL=2  # Concurrent requests
export OLLAMA_MAX_LOADED_MODELS=3  # Models in memory

# Start server with config
ollama serve

Useful Server Flags

# Start server in debug mode
OLLAMA_DEBUG=1 ollama serve

# Allow cross-origin requests (for web apps)
OLLAMA_ORIGINS="*" ollama serve

Advanced Usage Patterns

Batch Processing

# Process multiple files
for file in *.txt; do
    echo "Processing $file..."
    cat "$file" | ollama run llama3.2 "Summarize:" > "${file%.txt}_summary.txt"
done

JSON Output

# Get structured JSON response
ollama run llama3.2 "List 3 programming languages as JSON array" --format json

# Output: ["Python", "JavaScript", "Go"]

Chaining Models

# Use one model's output as input to another
ollama run llama3.2 "Write a story about AI" | \
  ollama run mistral "Critique this story:"

Quick Debugging

# Check server logs
journalctl -u ollama -f          # Linux (systemd)
tail -f ~/.ollama/logs/server.log  # macOS

# Test API directly
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello"
}'

# Check GPU utilization
watch -n 1 nvidia-smi  # NVIDIA

The CLI gives you full control over your local LLM workflow. In the next lesson, we'll learn to create custom models with Modelfiles. :::