Lesson 9 of 24

Advanced Chunking Strategies

Chunking Methods Compared

4 min read

Chunking determines how documents are split into retrievable units. The right strategy significantly impacts retrieval quality.

Why Chunking Matters

Poor chunking leads to:

  • Split context: Related information in different chunks
  • Noise: Irrelevant content mixed with relevant
  • Lost meaning: Key concepts broken apart

Chunking Methods Overview

MethodHow It WorksBest For
Fixed-sizeSplit by character/token countSimple documents
RecursiveSplit by separators hierarchicallyStructured text
SentenceSplit on sentence boundariesNarrative content
SemanticSplit by topic/meaning changeComplex documents

Fixed-Size Chunking

Simplest approach—split by character count:

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator=""  # Character-level splitting
)

chunks = splitter.split_text(document)

Pros: Simple, predictable chunk sizes Cons: Splits mid-sentence, ignores structure

Recursive Character Splitting

Most commonly used—respects document structure:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n\n",    # Paragraphs first
        "\n",      # Then lines
        ". ",      # Then sentences
        ", ",      # Then clauses
        " ",       # Then words
        ""         # Finally characters
    ]
)

chunks = splitter.split_documents(documents)

How it works:

  1. Try to split on \n\n (paragraphs)
  2. If chunks still too large, try \n
  3. Continue down the separator list
  4. Ensures natural break points

Sentence-Based Chunking

Preserves complete sentences:

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Using sentence-transformers tokenizer
splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=50,
    tokens_per_chunk=256
)

# Or with NLTK
import nltk
nltk.download('punkt')

from langchain_text_splitters import NLTKTextSplitter

splitter = NLTKTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

chunks = splitter.split_text(document)

Best for: Long-form content, articles, documentation

Semantic Chunking

Splits based on meaning changes using embeddings:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Split when embedding similarity drops
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # Split at 95th percentile dissimilarity
)

chunks = splitter.split_text(document)

How it works:

  1. Embeds consecutive sentences
  2. Calculates similarity between adjacent sentences
  3. Splits where similarity drops significantly
# Custom semantic chunking
def semantic_chunk(text: str, threshold: float = 0.7):
    sentences = split_into_sentences(text)
    embeddings = embed_model.encode(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        if similarity < threshold:
            # Topic change detected - start new chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

Method Comparison

MethodRetrieval QualityProcessing SpeedConsistency
Fixed-sizeLowFastHigh
RecursiveMedium-HighFastHigh
SentenceMediumMediumHigh
SemanticHighSlowVariable

Benchmarks

Studies show chunking strategy significantly impacts retrieval:

StrategyRecall@10Precision@10
Fixed 2560.650.42
Recursive 5120.780.58
Sentence0.750.55
Semantic0.820.64

Illustrative example results -- actual performance varies significantly by dataset and implementation.

Results vary by dataset. Always test on your specific data.

Choosing a Method

START
Simple, uniform documents?
  ├─ YES → Fixed-size (fast, predictable)
Structured text (headers, paragraphs)?
  ├─ YES → Recursive (respects structure)
Long-form narrative content?
  ├─ YES → Sentence-based
Complex topics, variable length sections?
  ├─ YES → Semantic (best quality, slower)
Default → Recursive (best balance)

Implementation Tips

class AdaptiveChunker:
    """Choose chunking method based on document type."""

    def __init__(self):
        self.recursive = RecursiveCharacterTextSplitter(
            chunk_size=512, chunk_overlap=50
        )
        self.semantic = SemanticChunker(
            embeddings=OpenAIEmbeddings(),
            breakpoint_threshold_type="percentile"
        )

    def chunk(self, document: str, doc_type: str) -> list[str]:
        if doc_type in ["faq", "qa"]:
            # Q&A pairs should stay together
            return self._chunk_qa(document)
        elif doc_type in ["technical", "research"]:
            # Technical docs benefit from semantic chunking
            return self.semantic.split_text(document)
        else:
            # Default to recursive for most content
            return self.recursive.split_text(document)

    def _chunk_qa(self, document: str) -> list[str]:
        # Keep Q&A pairs together
        qa_pattern = r'Q:.*?A:.*?(?=Q:|$)'
        return re.findall(qa_pattern, document, re.DOTALL)

Key Insight: The best chunking method depends on your content type. Start with recursive for general use, then experiment with semantic chunking for complex documents where quality matters most.

Next, let's explore hierarchical and contextual chunking for advanced retrieval scenarios. :::

Quick check: how does this lesson land for you?

Quiz

Module 3: Advanced Chunking Strategies

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.