Lesson 9 of 23

Advanced Chunking Strategies

Chunking Methods Compared

4 min read

Chunking determines how documents are split into retrievable units. The right strategy significantly impacts retrieval quality.

Why Chunking Matters

Poor chunking leads to:

  • Split context: Related information in different chunks
  • Noise: Irrelevant content mixed with relevant
  • Lost meaning: Key concepts broken apart

Chunking Methods Overview

Method How It Works Best For
Fixed-size Split by character/token count Simple documents
Recursive Split by separators hierarchically Structured text
Sentence Split on sentence boundaries Narrative content
Semantic Split by topic/meaning change Complex documents

Fixed-Size Chunking

Simplest approach—split by character count:

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator=""  # Character-level splitting
)

chunks = splitter.split_text(document)

Pros: Simple, predictable chunk sizes Cons: Splits mid-sentence, ignores structure

Recursive Character Splitting

Most commonly used—respects document structure:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n\n",    # Paragraphs first
        "\n",      # Then lines
        ". ",      # Then sentences
        ", ",      # Then clauses
        " ",       # Then words
        ""         # Finally characters
    ]
)

chunks = splitter.split_documents(documents)

How it works:

  1. Try to split on \n\n (paragraphs)
  2. If chunks still too large, try \n
  3. Continue down the separator list
  4. Ensures natural break points

Sentence-Based Chunking

Preserves complete sentences:

from langchain.text_splitter import SentenceTransformersTokenTextSplitter

# Using sentence-transformers tokenizer
splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=50,
    tokens_per_chunk=256
)

# Or with NLTK
import nltk
nltk.download('punkt')

from langchain.text_splitter import NLTKTextSplitter

splitter = NLTKTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

chunks = splitter.split_text(document)

Best for: Long-form content, articles, documentation

Semantic Chunking

Splits based on meaning changes using embeddings:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Split when embedding similarity drops
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # Split at 95th percentile dissimilarity
)

chunks = splitter.split_text(document)

How it works:

  1. Embeds consecutive sentences
  2. Calculates similarity between adjacent sentences
  3. Splits where similarity drops significantly
# Custom semantic chunking
def semantic_chunk(text: str, threshold: float = 0.7):
    sentences = split_into_sentences(text)
    embeddings = embed_model.encode(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        if similarity < threshold:
            # Topic change detected - start new chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

Method Comparison

Method Retrieval Quality Processing Speed Consistency
Fixed-size Low Fast High
Recursive Medium-High Fast High
Sentence Medium Medium High
Semantic High Slow Variable

Benchmarks

Studies show chunking strategy significantly impacts retrieval:

Strategy Recall@10 Precision@10
Fixed 256 0.65 0.42
Recursive 512 0.78 0.58
Sentence 0.75 0.55
Semantic 0.82 0.64

Results vary by dataset. Always test on your specific data.

Choosing a Method

START
Simple, uniform documents?
  ├─ YES → Fixed-size (fast, predictable)
Structured text (headers, paragraphs)?
  ├─ YES → Recursive (respects structure)
Long-form narrative content?
  ├─ YES → Sentence-based
Complex topics, variable length sections?
  ├─ YES → Semantic (best quality, slower)
Default → Recursive (best balance)

Implementation Tips

class AdaptiveChunker:
    """Choose chunking method based on document type."""

    def __init__(self):
        self.recursive = RecursiveCharacterTextSplitter(
            chunk_size=512, chunk_overlap=50
        )
        self.semantic = SemanticChunker(
            embeddings=OpenAIEmbeddings(),
            breakpoint_threshold_type="percentile"
        )

    def chunk(self, document: str, doc_type: str) -> list[str]:
        if doc_type in ["faq", "qa"]:
            # Q&A pairs should stay together
            return self._chunk_qa(document)
        elif doc_type in ["technical", "research"]:
            # Technical docs benefit from semantic chunking
            return self.semantic.split_text(document)
        else:
            # Default to recursive for most content
            return self.recursive.split_text(document)

    def _chunk_qa(self, document: str) -> list[str]:
        # Keep Q&A pairs together
        qa_pattern = r'Q:.*?A:.*?(?=Q:|$)'
        return re.findall(qa_pattern, document, re.DOTALL)

Key Insight: The best chunking method depends on your content type. Start with recursive for general use, then experiment with semantic chunking for complex documents where quality matters most.

Next, let's explore hierarchical and contextual chunking for advanced retrieval scenarios. :::

Quiz

Module 3: Advanced Chunking Strategies

Take Quiz