Build Local AI with Ollama and Qwen 3: RAG, Agents, and Beyond

March 5, 2026

Build Local AI with Ollama and Qwen 3: RAG, Agents, and Beyond

TL;DR

  • Run powerful LLMs entirely on your machine — no API keys, no cloud bills, no data leaving your network
  • Build a production-grade RAG pipeline that ingests PDFs, web pages, and structured data into a local vector store
  • Create autonomous AI agents with tool-calling, memory, and multi-step reasoning
  • Orchestrate multiple local models for specialized tasks (coding, reasoning, embedding, summarization)
  • Optimize performance with quantization strategies, context window tuning, and GPU memory management
  • Harden your setup with security best practices for local AI deployments
  • Includes complete, runnable code for every component

What You'll Learn

  • How Ollama abstracts away the complexity of running LLMs locally and exposes an OpenAI-compatible API
  • The architecture behind Qwen 3's dense and Mixture-of-Experts models — and when to use each
  • Building a full RAG pipeline: document ingestion, chunking strategies, embedding, vector storage, retrieval, and generation
  • Creating AI agents that can call tools, browse files, query databases, and chain actions together
  • Advanced patterns: conversation memory, streaming responses, multi-model routing, and hybrid search
  • Performance tuning: GPU offloading, quantization tradeoffs, batch processing, and context window sizing
  • Security considerations for local AI deployments in professional environments

Prerequisites

  • Python 3.10+ installed on your system
  • 8 GB RAM minimum (16 GB+ recommended for 8B parameter models)
  • Command line familiarity — you'll be running terminal commands throughout
  • Basic Python knowledge — functions, classes, pip, virtual environments
  • A dedicated GPU is optional but dramatically improves performance (NVIDIA CUDA or Apple Silicon)

If you have an Apple Silicon Mac (M1/M2/M3/M4), you're in a great position — Ollama leverages Metal acceleration natively.


Why Run AI Locally?

The cloud-first approach to AI has dominated the industry, but it comes with significant tradeoffs that local deployment eliminates entirely.

The Case for Local AI

Concern Cloud AI Local AI
Data Privacy Data transmitted to third-party servers Data never leaves your machine
Cost Pay-per-token, scales with usage One-time hardware cost, unlimited usage
Latency Network round-trip + queue time Direct hardware access, no network overhead
Availability Dependent on provider uptime Works offline, no internet required
Customization Limited to provider's API parameters Full control over model, prompts, and pipeline
Compliance Complex data residency requirements Data stays in your jurisdiction by default
Rate Limits Throttled during peak demand Limited only by your hardware

When Local AI Makes Sense

Local AI is particularly valuable when:

  • You're processing sensitive documents (legal, medical, financial, proprietary code)
  • You need predictable costs — no surprise bills from token-heavy workloads
  • Your environment has restricted or no internet access
  • You want to experiment freely without worrying about API costs during development
  • You need to customize model behavior beyond what cloud APIs allow
  • Regulatory compliance (GDPR, HIPAA, SOC 2) requires data to remain on-premises

When Cloud AI Is Still Better

Be honest about the tradeoffs. Cloud models still win for:

  • State-of-the-art performance on the hardest reasoning tasks (frontier models like Claude, GPT-4o)
  • Massive context windows (200K+ tokens) without the VRAM to match
  • Zero infrastructure management — no GPU drivers, no model updates, no hardware failures
  • Rapid prototyping where setup time matters more than ongoing costs

The good news: local and cloud AI aren't mutually exclusive. Many production systems use local models for routine tasks and escalate to cloud models for complex reasoning.

Ollama Cloud: The Hybrid Option

Ollama also offers a cloud-hosted API at ollama.com, giving you access to large models (like qwen3:235b-a22b or deepseek-v3.1:671b) that would be impractical to run locally. The API is OpenAI-compatible and uses the same interface as local Ollama, so switching between local and cloud requires only changing the host URL:

# Set up cloud access
export OLLAMA_API_KEY="your-api-key-here"
from langchain_ollama import ChatOllama

# Local model — runs on your hardware
llm_local = ChatOllama(
    model="qwen3:8b",
    base_url="http://localhost:11434",
)

# Cloud model — runs on Ollama's servers
llm_cloud = ChatOllama(
    model="qwen3:235b-a22b",
    base_url="https://api.ollama.com",
    headers={"Authorization": f"Bearer {os.getenv('OLLAMA_API_KEY')}"},
)

This lets you develop and test locally with smaller models, then use cloud-hosted large models for production or complex queries — all through the same API.


Understanding the Technology Stack

Ollama: Your Local Model Server

Ollama is the runtime layer that makes running LLMs locally as simple as running a Docker container. Under the hood, it handles:

  1. Model downloading and management — versioned model files with automatic updates
  2. Quantization — converts full-precision models to 4-bit or 8-bit formats that fit in consumer hardware
  3. GPU acceleration — automatic detection and utilization of NVIDIA CUDA, Apple Metal, or AMD ROCm
  4. API serving — exposes an OpenAI-compatible REST API on localhost:11434
  5. Concurrent model loading — run multiple models simultaneously (memory permitting)
graph LR
    A[Your Application] -->|HTTP API| B[Ollama Server]
    B --> C[Model Manager]
    C --> D[Quantized Model Files]
    B --> E[GPU Scheduler]
    E --> F[CUDA / Metal / ROCm]
    B --> G[Context Manager]
    G --> H[KV Cache]

Qwen 3: The Model Family

Qwen 3, developed by Alibaba Cloud, is an open-source LLM family that competes with models many times its size. It comes in two architectures:

Dense Models — all parameters activate for every token:

Model Parameters VRAM (4-bit) Best For
qwen3:0.6b 0.6B ~1 GB Edge devices, simple classification
qwen3:1.7b 1.7B ~2 GB Summarization, basic Q&A
qwen3:4b 4B ~3 GB General chat, light coding
qwen3:8b 8B ~6 GB Recommended balance — strong reasoning + coding
qwen3:14b 14B ~10 GB Complex analysis, detailed writing
qwen3:32b 32B ~20 GB Near-frontier performance, research

Mixture-of-Experts (MoE) Models — only a subset of "expert" sub-networks activate per token:

Model Total Params Active Params Memory (4-bit) Best For
qwen3:30b-a3b 30B 3B ~19 GB Fast reasoning with large knowledge capacity
qwen3:235b-a22b 235B 22B ~142 GB Frontier-class performance (multi-GPU required)

Why MoE matters: The qwen3:30b-a3b model has 30 billion total parameters but only activates 3 billion per token. Important caveat: all 30B parameters still need to be loaded into memory — the MoE advantage is inference speed, not memory savings. You get the knowledge capacity of a 30B model with the token generation speed closer to a 3B model. For RAG and agent workloads where speed matters, this is an excellent choice if you have the memory.

How the Pieces Fit Together

graph TD
    subgraph "Your Machine"
        A[Python Application] --> B[LangChain Framework]
        B --> C[Ollama Python Client]
        C --> D[Ollama Server localhost:11434]
        D --> E[Qwen 3 Model]
        D --> F[Embedding Model]

        B --> G[ChromaDB Vector Store]
        G --> H[Local SQLite + Parquet Files]

        B --> I[Tool Registry]
        I --> J[File System Tools]
        I --> K[Web Scraping Tools]
        I --> L[Database Tools]
        I --> M[Custom Functions]
    end

Installation and Setup

Step 1: Install Ollama

macOS (Homebrew):

brew install ollama

macOS / Linux (direct install):

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from ollama.com/download.

Verify the installation:

ollama --version
# Expected output: ollama version 0.x.x

Step 2: Pull Your Models

You'll need at least two models — one for generation and one for embeddings:

# Primary generation model (recommended starting point)
ollama pull qwen3:8b

# Embedding model for RAG (lightweight, fast)
ollama pull nomic-embed-text

# Optional: MoE model for complex reasoning tasks
ollama pull qwen3:30b-a3b

# Optional: Small model for fast classification/routing
ollama pull qwen3:0.6b

Verify your models are available:

ollama list

Test interactively:

ollama run qwen3:8b
>>> What are the key differences between RAG and fine-tuning?
>>> /bye

Step 3: Start the Ollama Server

Ollama needs to run as a background server for your Python applications to connect:

ollama serve

This starts the API on http://localhost:11434. You can verify it's running:

curl http://localhost:11434/api/tags

Tip: On macOS, the Ollama desktop app starts the server automatically. On Linux, you may want to set it up as a systemd service for automatic startup.

Step 4: Set Up Your Python Environment

# Create project directory
mkdir local-ai-project && cd local-ai-project

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # macOS/Linux
# venv\Scripts\activate   # Windows

# Install dependencies
pip install \
  langchain \
  langchain-community \
  langchain-core \
  langchain-ollama \
  langchain-chroma \
  langchain-text-splitters \
  chromadb \
  sentence-transformers \
  pypdf \
  python-dotenv \
  unstructured[pdf] \
  tiktoken \
  rich

What each package does:

Package Purpose
langchain Core framework for building LLM applications
langchain-ollama Ollama-specific integration (chat models, embeddings)
langchain-chroma LangChain integration for ChromaDB vector store
langchain-text-splitters Document chunking with multiple strategies
chromadb Local vector database — no server required
sentence-transformers Alternative embedding models via HuggingFace
pypdf PDF text extraction
unstructured[pdf] Advanced PDF parsing (tables, images, layouts)
tiktoken Token counting for context window management
rich Beautiful terminal output for debugging

Create a .env file for configuration:

# .env
OLLAMA_BASE_URL=http://localhost:11434
LLM_MODEL=qwen3:8b
EMBEDDING_MODEL=nomic-embed-text
CHROMA_PATH=./chroma_db
DATA_PATH=./data

Building a Production-Grade RAG System

RAG (Retrieval-Augmented Generation) is the most practical pattern for making LLMs useful with your own data. Instead of fine-tuning a model (expensive, slow, requires expertise), RAG feeds relevant context to the model at query time.

RAG Architecture Deep Dive

graph TD
    subgraph "Ingestion Pipeline (runs once per document)"
        A[Source Documents] --> B[Document Loader]
        B --> C[Text Splitter]
        C --> D[Chunk Optimizer]
        D --> E[Embedding Model]
        E --> F[Vector Store]
    end

    subgraph "Query Pipeline (runs per question)"
        G[User Question] --> H[Query Embedding]
        H --> I[Similarity Search]
        F --> I
        I --> J[Context Assembly]
        J --> K[Prompt Template]
        K --> L[LLM Generation]
        L --> M[Response]
    end

Step 1: Project Structure

local-ai-project/
├── .env
├── data/                    # Source documents go here
│   ├── papers/
│   ├── docs/
│   └── web/
├── chroma_db/               # Vector store (auto-created)
├── rag_pipeline.py          # Main RAG implementation
├── document_loaders.py      # Multi-format document loading
├── chunking.py              # Advanced chunking strategies
├── embeddings.py            # Embedding configuration
├── agent.py                 # AI agent implementation
└── utils.py                 # Shared utilities

Step 2: Multi-Format Document Loading

The original tutorial only covers PDFs. In practice, you'll need to ingest multiple formats. Here's a loader that handles PDFs, web pages, text files, and Markdown:

# document_loaders.py
import os
from typing import List
from langchain_core.documents import Document
from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    UnstructuredMarkdownLoader,
    WebBaseLoader,
    DirectoryLoader,
)


def load_pdf(file_path: str) -> List[Document]:
    """Load a PDF file, returning one Document per page."""
    loader = PyPDFLoader(file_path)
    docs = loader.load()
    # Add source metadata for traceability
    for doc in docs:
        doc.metadata["source_type"] = "pdf"
        doc.metadata["file_name"] = os.path.basename(file_path)
    print(f"Loaded {len(docs)} pages from {file_path}")
    return docs


def load_text(file_path: str) -> List[Document]:
    """Load a plain text file."""
    loader = TextLoader(file_path, encoding="utf-8")
    docs = loader.load()
    for doc in docs:
        doc.metadata["source_type"] = "text"
        doc.metadata["file_name"] = os.path.basename(file_path)
    return docs


def load_markdown(file_path: str) -> List[Document]:
    """Load a Markdown file with structure preservation."""
    loader = UnstructuredMarkdownLoader(file_path)
    docs = loader.load()
    for doc in docs:
        doc.metadata["source_type"] = "markdown"
        doc.metadata["file_name"] = os.path.basename(file_path)
    return docs


def load_web_page(url: str) -> List[Document]:
    """Load content from a web URL."""
    loader = WebBaseLoader(url)
    docs = loader.load()
    for doc in docs:
        doc.metadata["source_type"] = "web"
        doc.metadata["url"] = url
    print(f"Loaded content from {url}")
    return docs


def load_directory(dir_path: str, glob_pattern: str = "**/*.pdf") -> List[Document]:
    """Recursively load all matching files from a directory."""
    loader = DirectoryLoader(
        dir_path,
        glob=glob_pattern,
        loader_cls=PyPDFLoader,
        show_progress=True,
        use_multithreading=True,
    )
    docs = loader.load()
    print(f"Loaded {len(docs)} documents from {dir_path}")
    return docs


def load_documents(data_path: str) -> List[Document]:
    """
    Auto-detect and load all supported files from a directory.
    Handles: .pdf, .txt, .md files.
    """
    all_docs = []
    supported_extensions = {
        ".pdf": load_pdf,
        ".txt": load_text,
        ".md": load_markdown,
    }

    for root, _, files in os.walk(data_path):
        for file_name in files:
            ext = os.path.splitext(file_name)[1].lower()
            if ext in supported_extensions:
                file_path = os.path.join(root, file_name)
                try:
                    docs = supported_extensions[ext](file_path)
                    all_docs.extend(docs)
                except Exception as e:
                    print(f"Warning: Failed to load {file_path}: {e}")

    print(f"Total documents loaded: {len(all_docs)}")
    return all_docs

Step 3: Advanced Chunking Strategies

Chunking is where most RAG implementations succeed or fail. The wrong chunk size produces either too little context (the model can't answer) or too much noise (irrelevant information drowns out the answer).

# chunking.py
from typing import List
from langchain_core.documents import Document
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)


def chunk_by_size(
    documents: List[Document],
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
) -> List[Document]:
    """
    Split documents using recursive character splitting.

    This is the most general-purpose strategy. It tries to split on
    paragraph breaks first, then sentences, then words, preserving
    natural text boundaries.

    Args:
        chunk_size: Target characters per chunk. Larger = more context
                    per retrieval but fewer total chunks.
        chunk_overlap: Characters shared between adjacent chunks.
                       Prevents information loss at chunk boundaries.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks")
    print(f"  Avg chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} chars")
    return chunks


def chunk_markdown_by_headers(
    documents: List[Document],
    chunk_size: int = 1000,
    chunk_overlap: int = 200,
) -> List[Document]:
    """
    Split Markdown documents by headers first, then by size.

    This preserves the logical structure of documentation. Each chunk
    inherits the header hierarchy as metadata, so you know which
    section it belongs to.
    """
    headers_to_split_on = [
        ("#", "header_1"),
        ("##", "header_2"),
        ("###", "header_3"),
    ]

    md_splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
        strip_headers=False,
    )

    # First pass: split by headers
    header_chunks = []
    for doc in documents:
        splits = md_splitter.split_text(doc.page_content)
        for split in splits:
            split.metadata.update(doc.metadata)
            header_chunks.append(split)

    # Second pass: split oversized header sections by character count
    size_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )
    final_chunks = size_splitter.split_documents(header_chunks)

    print(f"Split {len(documents)} docs → {len(header_chunks)} header sections → {len(final_chunks)} chunks")
    return final_chunks


# Chunking strategy guide
CHUNKING_GUIDE = """
Choosing the right chunk size:

| Use Case               | chunk_size | chunk_overlap | Why                                    |
|------------------------|-----------|---------------|----------------------------------------|
| Q&A over documentation | 800-1200  | 150-250       | Balanced context per retrieval         |
| Legal/contract review  | 1500-2000 | 300-400       | Clauses need surrounding context       |
| Code documentation     | 500-800   | 100-150       | Functions are naturally short          |
| Chat/conversational    | 300-500   | 50-100        | Short, focused answers                 |
| Academic papers        | 1000-1500 | 200-300       | Dense content needs full paragraphs    |
"""

Step 4: Embedding Configuration

Embeddings convert text into numerical vectors that capture semantic meaning. Two approaches work well locally:

# embeddings.py
from langchain_ollama import OllamaEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings


def get_ollama_embeddings(model_name: str = "nomic-embed-text") -> OllamaEmbeddings:
    """
    Use Ollama-served embedding models.

    Pros: Consistent with your LLM stack, GPU-accelerated, easy setup.
    Cons: Requires Ollama server running, slightly slower than dedicated libs.

    Available models:
        - nomic-embed-text: 137M params, 768 dims, good general-purpose
        - mxbai-embed-large: 335M params, 1024 dims, higher quality
        - all-minilm: 23M params, 384 dims, fastest option
    """
    embeddings = OllamaEmbeddings(model=model_name)
    print(f"Initialized Ollama embeddings: {model_name}")
    return embeddings


def get_huggingface_embeddings(
    model_name: str = "all-MiniLM-L6-v2",
) -> HuggingFaceEmbeddings:
    """
    Use HuggingFace sentence-transformers directly.

    Pros: No Ollama dependency, huge model selection, well-benchmarked.
    Cons: Separate download, CPU-only by default.

    Recommended models:
        - all-MiniLM-L6-v2: Fast, lightweight, 384 dims
        - all-mpnet-base-v2: Higher quality, 768 dims
        - BAAI/bge-large-en-v1.5: Best quality, 1024 dims, slower
    """
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name,
        model_kwargs={"device": "cpu"},  # Change to "mps" for Apple Silicon
        encode_kwargs={"normalize_embeddings": True},
    )
    print(f"Initialized HuggingFace embeddings: {model_name}")
    return embeddings


# Embedding model comparison
EMBEDDING_COMPARISON = """
| Model                    | Dimensions | Speed  | Quality | VRAM    |
|--------------------------|-----------|--------|---------|---------|
| all-MiniLM-L6-v2         | 384       | Fast   | Good    | ~100MB  |
| nomic-embed-text         | 768       | Medium | Better  | ~300MB  |
| all-mpnet-base-v2        | 768       | Medium | Better  | ~400MB  |
| mxbai-embed-large        | 1024      | Slow   | Best    | ~700MB  |
| BAAI/bge-large-en-v1.5   | 1024      | Slow   | Best    | ~1.3GB  |
"""

Step 5: Vector Store with ChromaDB

ChromaDB runs entirely locally — no external database server needed. Data persists as SQLite + Parquet files on disk.

# rag_pipeline.py
import os
from typing import List, Optional
from dotenv import load_dotenv
from langchain_chroma import Chroma
from langchain_ollama import ChatOllama, OllamaEmbeddings
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

from document_loaders import load_documents
from chunking import chunk_by_size
from embeddings import get_ollama_embeddings

load_dotenv()

CHROMA_PATH = os.getenv("CHROMA_PATH", "./chroma_db")
DATA_PATH = os.getenv("DATA_PATH", "./data")
LLM_MODEL = os.getenv("LLM_MODEL", "qwen3:8b")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "nomic-embed-text")


class LocalRAG:
    """A complete local RAG system with ingestion and querying."""

    def __init__(
        self,
        llm_model: str = LLM_MODEL,
        embedding_model: str = EMBEDDING_MODEL,
        chroma_path: str = CHROMA_PATH,
        context_window: int = 8192,
        temperature: float = 0.1,
    ):
        self.llm_model = llm_model
        self.chroma_path = chroma_path
        self.context_window = context_window

        # Initialize embedding function
        self.embedding_function = get_ollama_embeddings(embedding_model)

        # Initialize LLM
        self.llm = ChatOllama(
            model=llm_model,
            temperature=temperature,
            num_ctx=context_window,
        )

        # Load or create vector store
        self.vector_store = Chroma(
            persist_directory=chroma_path,
            embedding_function=self.embedding_function,
        )

        print(f"LocalRAG initialized:")
        print(f"  LLM: {llm_model} (ctx: {context_window})")
        print(f"  Embeddings: {embedding_model}")
        print(f"  Vector store: {chroma_path}")
        print(f"  Existing documents: {self.vector_store._collection.count()}")

    def ingest(
        self,
        data_path: str = DATA_PATH,
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
    ) -> int:
        """
        Ingest documents from a directory into the vector store.
        Returns the number of chunks indexed.
        """
        # Load documents
        documents = load_documents(data_path)
        if not documents:
            print("No documents found to ingest.")
            return 0

        # Chunk documents
        chunks = chunk_by_size(documents, chunk_size, chunk_overlap)

        # Add to vector store with deduplication
        # Use content hash as ID to prevent duplicate indexing
        import hashlib

        ids = []
        unique_chunks = []
        seen = set()

        for chunk in chunks:
            content_hash = hashlib.md5(
                chunk.page_content.encode()
            ).hexdigest()
            if content_hash not in seen:
                seen.add(content_hash)
                ids.append(content_hash)
                unique_chunks.append(chunk)

        if unique_chunks:
            self.vector_store = Chroma.from_documents(
                documents=unique_chunks,
                embedding=self.embedding_function,
                persist_directory=self.chroma_path,
                ids=ids,
            )
            print(f"Indexed {len(unique_chunks)} unique chunks "
                  f"(skipped {len(chunks) - len(unique_chunks)} duplicates)")

        return len(unique_chunks)

    def query(
        self,
        question: str,
        k: int = 4,
        search_type: str = "similarity",
    ) -> str:
        """
        Query the RAG system with a question.

        Args:
            question: The user's question
            k: Number of chunks to retrieve (more = more context but slower)
            search_type: "similarity" for cosine similarity,
                         "mmr" for Maximum Marginal Relevance (more diverse results)
        """
        retriever = self.vector_store.as_retriever(
            search_type=search_type,
            search_kwargs={"k": k},
        )

        # RAG prompt template
        template = """You are a helpful assistant answering questions based on
the provided context. Use ONLY the context below to answer. If the context
doesn't contain enough information to answer fully, say so explicitly.

Context:
{context}

Question: {question}

Answer:"""

        prompt = ChatPromptTemplate.from_template(template)

        def format_docs(docs: List[Document]) -> str:
            """Format retrieved documents with source attribution."""
            formatted = []
            for i, doc in enumerate(docs, 1):
                source = doc.metadata.get("file_name", doc.metadata.get("source", "unknown"))
                page = doc.metadata.get("page", "")
                header = f"[Source {i}: {source}"
                if page:
                    header += f", page {page}"
                header += "]"
                formatted.append(f"{header}\n{doc.page_content}")
            return "\n\n---\n\n".join(formatted)

        # Build the chain
        chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

        return chain.invoke(question)

    def query_with_sources(
        self, question: str, k: int = 4
    ) -> dict:
        """
        Query and return both the answer and the source documents
        used to generate it — useful for verification and debugging.
        """
        retriever = self.vector_store.as_retriever(
            search_kwargs={"k": k}
        )

        # Retrieve relevant documents
        retrieved_docs = retriever.invoke(question)

        # Format context
        context = "\n\n---\n\n".join(
            doc.page_content for doc in retrieved_docs
        )

        template = """You are a helpful assistant. Answer the question based
ONLY on the following context. Cite which source numbers you used.

Context:
{context}

Question: {question}

Answer (cite sources as [1], [2], etc.):"""

        prompt = ChatPromptTemplate.from_template(template)
        chain = prompt | self.llm | StrOutputParser()

        answer = chain.invoke({
            "context": context,
            "question": question,
        })

        return {
            "answer": answer,
            "sources": [
                {
                    "content": doc.page_content[:200] + "...",
                    "metadata": doc.metadata,
                }
                for doc in retrieved_docs
            ],
        }


# --- Usage ---
if __name__ == "__main__":
    # Initialize the RAG system
    rag = LocalRAG(
        llm_model="qwen3:8b",
        context_window=8192,
    )

    # Ingest documents (run once, or when new documents are added)
    rag.ingest("./data")

    # Query
    response = rag.query("What is the main topic of the document?")
    print(f"\nAnswer: {response}")

    # Query with source attribution
    result = rag.query_with_sources("Summarize the key findings.")
    print(f"\nAnswer: {result['answer']}")
    print(f"\nSources used:")
    for i, src in enumerate(result["sources"], 1):
        print(f"  [{i}] {src['metadata']}")

Step 6: Streaming Responses

For a better user experience, stream responses token-by-token instead of waiting for the full answer:

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


def stream_rag_response(rag_instance, question: str):
    """Stream the RAG response for real-time display."""
    retriever = rag_instance.vector_store.as_retriever(
        search_kwargs={"k": 4}
    )

    docs = retriever.invoke(question)
    context = "\n\n".join(doc.page_content for doc in docs)

    template = """Answer based on this context:
{context}

Question: {question}
Answer:"""

    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | rag_instance.llm | StrOutputParser()

    print(f"Q: {question}")
    print("A: ", end="", flush=True)

    for chunk in chain.stream({
        "context": context,
        "question": question,
    }):
        print(chunk, end="", flush=True)

    print()  # Final newline

Building AI Agents with Tool Calling

Agents go beyond simple question-answering. They can reason about what tools to use, execute multi-step plans, and interact with external systems — all running locally.

Agent Architecture

graph TD
    A[User Input] --> B[Agent LLM]
    B --> C{Decision}
    C -->|Need info| D[Call Tool]
    D --> E[Tool Result]
    E --> B
    C -->|Ready to answer| F[Final Response]
    B --> G[Scratchpad / Memory]
    G --> B

The agent follows the ReAct (Reasoning + Acting) pattern:

  1. Thought: The model reasons about what to do next
  2. Action: It selects a tool and provides arguments
  3. Observation: The tool returns a result
  4. Repeat until the agent has enough information
  5. Final Answer: The model synthesizes a response

Step 1: Define Custom Tools

Tools are Python functions decorated with @tool. The docstring is critical — the LLM reads it to decide when and how to use the tool.

# agent.py
import os
import json
import datetime
import subprocess
from typing import Optional
from dotenv import load_dotenv
from langchain.agents import tool

load_dotenv()


@tool
def get_current_datetime(format: str = "%Y-%m-%d %H:%M:%S") -> str:
    """
    Returns the current date and time formatted as a string.
    Use this when the user asks about the current date, time, or both.

    Args:
        format: Python strftime format string.
            Examples: '%Y-%m-%d' for date only, '%H:%M:%S' for time only,
            '%A, %B %d, %Y' for 'Monday, January 01, 2026'.
    """
    try:
        return datetime.datetime.now().strftime(format)
    except ValueError as e:
        return f"Invalid format string: {e}"


@tool
def read_file(file_path: str) -> str:
    """
    Read the contents of a file and return it as text.
    Use this when the user asks you to read, examine, or analyze a file.

    Args:
        file_path: Path to the file to read (relative or absolute).
    """
    try:
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
        if len(content) > 5000:
            return content[:5000] + f"\n\n[Truncated — file is {len(content)} chars total]"
        return content
    except FileNotFoundError:
        return f"File not found: {file_path}"
    except Exception as e:
        return f"Error reading file: {e}"


@tool
def list_directory(path: str = ".") -> str:
    """
    List files and directories at the given path.
    Use this when the user asks what files are in a directory.

    Args:
        path: Directory path to list. Defaults to current directory.
    """
    try:
        entries = os.listdir(path)
        dirs = sorted(e for e in entries if os.path.isdir(os.path.join(path, e)))
        files = sorted(e for e in entries if os.path.isfile(os.path.join(path, e)))

        result = f"Directory: {os.path.abspath(path)}\n\n"
        if dirs:
            result += "Directories:\n" + "\n".join(f"  📁 {d}/" for d in dirs) + "\n\n"
        if files:
            result += "Files:\n" + "\n".join(f"  📄 {f}" for f in files)
        return result
    except FileNotFoundError:
        return f"Directory not found: {path}"


@tool
def calculate(expression: str) -> str:
    """
    Evaluate a mathematical expression and return the result.
    Use this for any calculation the user needs.

    Args:
        expression: A Python math expression (e.g., '2**10', 'sqrt(144)',
                    '3.14 * 5**2'). Supports: +, -, *, /, **, sqrt, abs,
                    round, min, max.
    """
    import math

    allowed_names = {
        "sqrt": math.sqrt, "abs": abs, "round": round,
        "min": min, "max": max, "pi": math.pi, "e": math.e,
        "log": math.log, "log10": math.log10, "sin": math.sin,
        "cos": math.cos, "tan": math.tan, "ceil": math.ceil,
        "floor": math.floor,
    }
    try:
        result = eval(expression, {"__builtins__": {}}, allowed_names)
        return str(result)
    except Exception as e:
        return f"Calculation error: {e}"


@tool
def search_files(
    directory: str, pattern: str, max_results: int = 10
) -> str:
    """
    Search for files matching a pattern in a directory tree.
    Use this when the user wants to find files by name or extension.

    Args:
        directory: Root directory to search from.
        pattern: Glob pattern to match (e.g., '*.py', '*.md', 'README*').
        max_results: Maximum number of results to return.
    """
    import glob

    search_pattern = os.path.join(directory, "**", pattern)
    matches = glob.glob(search_pattern, recursive=True)[:max_results]

    if not matches:
        return f"No files matching '{pattern}' found in {directory}"

    return f"Found {len(matches)} file(s):\n" + "\n".join(
        f"  {m}" for m in matches
    )


# Register all tools
tools = [
    get_current_datetime,
    read_file,
    list_directory,
    calculate,
    search_files,
]

Step 2: Build and Run the Agent

# agent.py (continued)
from langchain_ollama import ChatOllama
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder


def create_local_agent(
    model_name: str = "qwen3:8b",
    temperature: float = 0,
    verbose: bool = True,
):
    """
    Create a tool-calling agent powered by a local LLM.

    Args:
        model_name: Ollama model to use.
        temperature: Lower = more deterministic tool selection.
        verbose: Show the agent's reasoning steps.
    """
    # Initialize the LLM
    llm = ChatOllama(
        model=model_name,
        temperature=temperature,
    )

    # Agent prompt with system instructions
    prompt = ChatPromptTemplate.from_messages([
        (
            "system",
            "You are a helpful AI assistant with access to tools. "
            "Use tools when they would help answer the user's question. "
            "If you don't need a tool, answer directly. "
            "Always explain your reasoning briefly before using a tool."
        ),
        MessagesPlaceholder(variable_name="chat_history", optional=True),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ])

    # Create the agent
    agent = create_tool_calling_agent(llm, tools, prompt)

    # Wrap in executor
    executor = AgentExecutor(
        agent=agent,
        tools=tools,
        verbose=verbose,
        max_iterations=10,         # Safety limit on reasoning loops
        handle_parsing_errors=True, # Graceful recovery from bad outputs
    )

    return executor


def run_agent_conversation():
    """Run an interactive conversation with the agent."""
    agent = create_local_agent(model_name="qwen3:8b")

    print("Local AI Agent Ready. Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ("quit", "exit", "bye"):
            print("Goodbye!")
            break
        if not user_input:
            continue

        try:
            response = agent.invoke({"input": user_input})
            print(f"\nAgent: {response['output']}\n")
        except Exception as e:
            print(f"\nError: {e}\n")


if __name__ == "__main__":
    run_agent_conversation()

Step 3: Agent with Conversation Memory

Stateless agents forget everything between turns. Add memory for multi-turn conversations:

from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory


def create_agent_with_memory(model_name: str = "qwen3:8b"):
    """Create an agent that remembers conversation history."""

    llm = ChatOllama(model=model_name, temperature=0)

    prompt = ChatPromptTemplate.from_messages([
        (
            "system",
            "You are a helpful AI assistant with access to tools and "
            "memory of the conversation. Reference previous messages "
            "when relevant."
        ),
        MessagesPlaceholder(variable_name="chat_history"),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ])

    agent = create_tool_calling_agent(llm, tools, prompt)
    executor = AgentExecutor(
        agent=agent, tools=tools, verbose=True
    )

    # Session-based memory store
    store = {}

    def get_session_history(session_id: str):
        if session_id not in store:
            store[session_id] = InMemoryChatMessageHistory()
        return store[session_id]

    agent_with_memory = RunnableWithMessageHistory(
        executor,
        get_session_history,
        input_messages_key="input",
        history_messages_key="chat_history",
    )

    return agent_with_memory


# Usage
if __name__ == "__main__":
    agent = create_agent_with_memory()

    config = {"configurable": {"session_id": "user-1"}}

    # First turn
    r1 = agent.invoke(
        {"input": "What's the current date?"},
        config=config,
    )
    print(r1["output"])

    # Second turn — the agent remembers the first turn
    r2 = agent.invoke(
        {"input": "What day of the week is that?"},
        config=config,
    )
    print(r2["output"])

Advanced Patterns

Multi-Model Routing

Use different models for different tasks to optimize speed and quality. A small model decides which larger model to route to:

from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser


class ModelRouter:
    """
    Route queries to the best model based on task complexity.

    Uses a small, fast model to classify queries, then routes
    to specialized models for generation.
    """

    def __init__(self):
        # Fast classifier model
        self.classifier = ChatOllama(
            model="qwen3:0.6b", temperature=0, num_ctx=512
        )
        # Models for different task types
        self.models = {
            "simple": ChatOllama(
                model="qwen3:4b", temperature=0.3, num_ctx=4096
            ),
            "reasoning": ChatOllama(
                model="qwen3:8b", temperature=0.1, num_ctx=8192
            ),
            "complex": ChatOllama(
                model="qwen3:30b-a3b", temperature=0.1, num_ctx=16384
            ),
        }

    def classify(self, query: str) -> str:
        """Classify query complexity."""
        prompt = ChatPromptTemplate.from_template(
            "Classify this query as 'simple', 'reasoning', or 'complex'. "
            "Reply with only one word.\n\n"
            "simple: greetings, factual lookups, basic questions\n"
            "reasoning: analysis, comparisons, multi-step logic\n"
            "complex: creative writing, code generation, research\n\n"
            "Query: {query}\nClassification:"
        )
        chain = prompt | self.classifier | StrOutputParser()
        result = chain.invoke({"query": query}).strip().lower()

        # Default to reasoning if classification is unclear
        if result not in self.models:
            result = "reasoning"
        return result

    def route(self, query: str) -> str:
        """Route query to the appropriate model and get a response."""
        category = self.classify(query)
        model = self.models[category]
        print(f"[Router] Query classified as '{category}'")

        prompt = ChatPromptTemplate.from_template("{query}")
        chain = prompt | model | StrOutputParser()
        return chain.invoke({"query": query})

Combining RAG with Agents

The most powerful pattern: an agent that can search your document store as one of its tools:

from langchain.agents import tool
from rag_pipeline import LocalRAG


# Initialize RAG as a global resource
rag = LocalRAG(llm_model="qwen3:8b")


@tool
def search_knowledge_base(query: str) -> str:
    """
    Search the local knowledge base for information relevant to the query.
    Use this when the user asks about topics covered in the indexed documents.

    Args:
        query: The search query — be specific for better results.
    """
    result = rag.query_with_sources(query, k=3)
    answer = result["answer"]
    sources = ", ".join(
        s["metadata"].get("file_name", "unknown")
        for s in result["sources"]
    )
    return f"{answer}\n\n[Sources: {sources}]"

This lets the agent decide when to search documents vs. use other tools vs. answer from its training data.


Performance Tuning

Context Window Configuration

The num_ctx parameter is the single most impactful performance setting. It determines how many tokens the model can see at once.

# How to choose num_ctx:
#
# num_ctx = prompt_tokens + retrieved_context_tokens + output_tokens
#
# For RAG with k=4 chunks of ~250 tokens each:
#   prompt (~100) + context (4 * 250 = 1000) + output (~500) = ~1600 tokens
#   Add safety margin: 1600 * 1.5 = 2400 → set num_ctx=4096
#
# For agents with tool calling:
#   System prompt (~200) + history (~1000) + scratchpad (~2000) + output (~500)
#   = ~3700 → set num_ctx=8192

llm_rag = ChatOllama(model="qwen3:8b", num_ctx=4096)    # RAG queries
llm_agent = ChatOllama(model="qwen3:8b", num_ctx=8192)  # Agent tasks
llm_long = ChatOllama(model="qwen3:8b", num_ctx=32768)  # Long documents

VRAM vs. RAM Tradeoffs

Scenario VRAM Used RAM Used Speed
Model fits entirely in VRAM Full model Minimal Fastest
Model partially in VRAM Partial Overflow Medium
Model entirely in RAM (CPU) None Full model Slowest (5-10x)

Key rules:

  • Every 1B parameters ≈ 0.5-0.7 GB memory at 4-bit quantization (plus KV cache overhead)
  • num_ctx also consumes VRAM — larger context = more KV cache memory
  • Monitor with nvidia-smi (NVIDIA) or Activity Monitor (macOS)
  • If you hear your fans spinning hard, your model is falling back to CPU

Quantization Guide

Ollama serves quantized models by default. Here's what the quantization levels mean:

Quantization Size vs. Full Quality Loss Speed
Q4_0 ~25% Noticeable Fastest
Q4_K_M ~27% Minor Fast
Q5_K_M ~33% Minimal Medium
Q6_K ~40% Negligible Slower
Q8_0 ~50% None Slowest
FP16 100% None Requires massive VRAM

Most Ollama models default to Q4_K_M, which offers the best balance. For higher quality:

# Pull a specific quantization
ollama pull qwen3:8b-q6_K

Batch Processing for Large Document Sets

When ingesting many documents, process in batches to manage memory:

def batch_ingest(
    rag: LocalRAG,
    data_path: str,
    batch_size: int = 50,
):
    """Ingest documents in batches to prevent memory exhaustion."""
    from document_loaders import load_documents
    from chunking import chunk_by_size

    all_docs = load_documents(data_path)
    chunks = chunk_by_size(all_docs)

    total = len(chunks)
    for i in range(0, total, batch_size):
        batch = chunks[i : i + batch_size]
        rag.vector_store.add_documents(batch)
        print(f"Indexed batch {i // batch_size + 1} "
              f"({min(i + batch_size, total)}/{total} chunks)")

    print(f"Ingestion complete: {total} chunks indexed")

Controlling Qwen 3's Thinking Mode

Qwen 3 supports a unique hybrid thinking capability. You can toggle between deep reasoning and fast responses at the prompt level:

from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser

llm = ChatOllama(model="qwen3:8b", temperature=0)

# Deep reasoning mode — slower but more accurate for complex problems
response_think = llm.invoke(
    "Prove that the square root of 2 is irrational. /think"
)

# Fast mode — skip the internal chain-of-thought for simple queries
response_fast = llm.invoke(
    "What is the capital of France? /no_think"
)

When to use each mode:

Mode Trigger Best For
/think Complex reasoning Math proofs, code debugging, multi-step analysis
/no_think Simple lookups Factual questions, formatting, classification
Default Model decides General use — the model chooses based on complexity

For RAG specifically, /no_think often works better because the answer should come from the retrieved context, not from extended reasoning.


Security Considerations

Running AI locally doesn't automatically make it secure. Here are the key areas to address:

Input Sanitization

import re


def sanitize_query(query: str, max_length: int = 2000) -> str:
    """
    Sanitize user input before passing to the LLM.
    Prevents prompt injection and resource exhaustion.
    """
    # Truncate to prevent context window abuse
    query = query[:max_length]

    # Remove common prompt injection patterns
    injection_patterns = [
        r"ignore\s+(previous|all|above)\s+instructions",
        r"you\s+are\s+now\s+",
        r"system\s*:\s*",
        r"<\|.*?\|>",
    ]
    for pattern in injection_patterns:
        query = re.sub(pattern, "[filtered]", query, flags=re.IGNORECASE)

    return query.strip()

File Access Controls

When agents have file system tools, restrict what they can access:

import os

ALLOWED_DIRECTORIES = [
    os.path.abspath("./data"),
    os.path.abspath("./output"),
]


def validate_file_path(file_path: str) -> bool:
    """Ensure file access stays within allowed directories."""
    abs_path = os.path.abspath(file_path)
    return any(
        abs_path.startswith(allowed_dir)
        for allowed_dir in ALLOWED_DIRECTORIES
    )

Network Isolation

Ollama binds to localhost by default, which is correct for local-only deployments. If you need network access:

# ONLY do this if you need remote access — and use a firewall
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Better: use a reverse proxy with authentication
# nginx, caddy, or traefik in front of Ollama

Troubleshooting Guide

Problem Cause Solution
connection refused on port 11434 Ollama server not running Run ollama serve or start the Ollama desktop app
Model responses are very slow Model exceeding VRAM, falling back to CPU Use a smaller model or reduce num_ctx
out of memory error Model + context don't fit in available memory Switch to a smaller model (e.g., qwen3:4b) or close other apps
Empty or nonsensical RAG answers Chunk size too small or k too low Increase chunk_size to 1200 and k to 5
Agent loops without answering Tool docstrings unclear, or model too small Improve docstrings; use qwen3:8b minimum for agents
model not found error Model not pulled yet Run ollama pull <model_name>
Duplicate results in retrieval Missing deduplication during ingestion Use content-hash IDs (shown in the ingest method above)
ChromaDB permission denied Vector store directory locked Delete chroma_db/ and re-ingest
Embeddings dimension mismatch Changed embedding model after indexing Delete chroma_db/ and re-ingest with the new model
Agent uses wrong tool Ambiguous tool docstrings Make tool descriptions more specific and mutually exclusive

Key Takeaways

Local AI is production-ready. With Ollama and Qwen 3, you can build RAG pipelines and AI agents that rival cloud-based solutions for most practical workloads — while keeping your data private and your costs predictable.

The core principles:

  1. Start small, scale up: Begin with qwen3:4b to validate your pipeline, then upgrade to 8b or 30b-a3b for production
  2. Chunking is everything: The quality of your RAG system depends more on chunking strategy than model size
  3. Tools need great docs: Agent tool-calling reliability is directly proportional to the quality of your function docstrings
  4. Monitor your resources: Watch VRAM usage and adjust num_ctx before it becomes a bottleneck
  5. Layer your models: Use small models for classification/routing and large models for generation
  6. Secure by default: Restrict file access, sanitize inputs, and keep Ollama on localhost

Next Steps

  • Add more document types: Extend the loader to handle CSV, JSON, HTML, and DOCX files using LangChain's community loaders
  • Build a web interface: Wrap your RAG system with a FastAPI backend and a React or Streamlit frontend
  • Implement evaluation: Use ragas or deepeval to measure retrieval quality and answer accuracy
  • Explore Modelfiles: Create custom Ollama Modelfiles to set default system prompts, temperature, and context windows per use case
  • Set up monitoring: Track query latency, retrieval relevance scores, and model resource usage over time

References

Frequently Asked Questions

A: 8 GB RAM with no dedicated GPU can run qwen3:0.6b and qwen3:1.7b . For the recommended qwen3:8b , you need at least 6-8 GB VRAM (or 16 GB unified memory on Apple Silicon). For larger models that don't fit on your hardware, consider Ollama Cloud — it uses the same API so your code works unchanged.

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.