Chunking Methods Compared

Chunking determines how documents are split into retrievable units. The right strategy significantly impacts retrieval quality.

Why Chunking Matters

Poor chunking leads to:

Split context: Related information in different chunks
Noise: Irrelevant content mixed with relevant
Lost meaning: Key concepts broken apart

Chunking Methods Overview

Method	How It Works	Best For
Fixed-size	Split by character/token count	Simple documents
Recursive	Split by separators hierarchically	Structured text
Sentence	Split on sentence boundaries	Narrative content
Semantic	Split by topic/meaning change	Complex documents

Fixed-Size Chunking

Simplest approach—split by character count:

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator=""  # Character-level splitting
)

chunks = splitter.split_text(document)

Pros: Simple, predictable chunk sizes Cons: Splits mid-sentence, ignores structure

Recursive Character Splitting

Most commonly used—respects document structure:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=[
        "\n\n",    # Paragraphs first
        "\n",      # Then lines
        ". ",      # Then sentences
        ", ",      # Then clauses
        " ",       # Then words
        ""         # Finally characters
    ]
)

chunks = splitter.split_documents(documents)

How it works:

Try to split on \n\n (paragraphs)
If chunks still too large, try \n
Continue down the separator list
Ensures natural break points

Sentence-Based Chunking

Preserves complete sentences:

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Using sentence-transformers tokenizer
splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=50,
    tokens_per_chunk=256
)

# Or with NLTK
import nltk
nltk.download('punkt')

from langchain_text_splitters import NLTKTextSplitter

splitter = NLTKTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

chunks = splitter.split_text(document)

Best for: Long-form content, articles, documentation

Semantic Chunking

Splits based on meaning changes using embeddings:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Split when embedding similarity drops
splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95  # Split at 95th percentile dissimilarity
)

chunks = splitter.split_text(document)

How it works:

Embeds consecutive sentences
Calculates similarity between adjacent sentences
Splits where similarity drops significantly

# Custom semantic chunking
def semantic_chunk(text: str, threshold: float = 0.7):
    sentences = split_into_sentences(text)
    embeddings = embed_model.encode(sentences)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])

        if similarity < threshold:
            # Topic change detected - start new chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(" ".join(current_chunk))
    return chunks

Method Comparison

Method	Retrieval Quality	Processing Speed	Consistency
Fixed-size	Low	Fast	High
Recursive	Medium-High	Fast	High
Sentence	Medium	Medium	High
Semantic	High	Slow	Variable

Benchmarks

Studies show chunking strategy significantly impacts retrieval:

Strategy	Recall@10	Precision@10
Fixed 256	0.65	0.42
Recursive 512	0.78	0.58
Sentence	0.75	0.55
Semantic	0.82	0.64

Illustrative example results -- actual performance varies significantly by dataset and implementation.

Results vary by dataset. Always test on your specific data.

Choosing a Method

START
  │
  ▼
Simple, uniform documents?
  │
  ├─ YES → Fixed-size (fast, predictable)
  │
  ▼
Structured text (headers, paragraphs)?
  │
  ├─ YES → Recursive (respects structure)
  │
  ▼
Long-form narrative content?
  │
  ├─ YES → Sentence-based
  │
  ▼
Complex topics, variable length sections?
  │
  ├─ YES → Semantic (best quality, slower)
  │
  ▼
Default → Recursive (best balance)

Implementation Tips

class AdaptiveChunker:
    """Choose chunking method based on document type."""

    def __init__(self):
        self.recursive = RecursiveCharacterTextSplitter(
            chunk_size=512, chunk_overlap=50
        )
        self.semantic = SemanticChunker(
            embeddings=OpenAIEmbeddings(),
            breakpoint_threshold_type="percentile"
        )

    def chunk(self, document: str, doc_type: str) -> list[str]:
        if doc_type in ["faq", "qa"]:
            # Q&A pairs should stay together
            return self._chunk_qa(document)
        elif doc_type in ["technical", "research"]:
            # Technical docs benefit from semantic chunking
            return self.semantic.split_text(document)
        else:
            # Default to recursive for most content
            return self.recursive.split_text(document)

    def _chunk_qa(self, document: str) -> list[str]:
        # Keep Q&A pairs together
        qa_pattern = r'Q:.*?A:.*?(?=Q:|$)'
        return re.findall(qa_pattern, document, re.DOTALL)

Key Insight: The best chunking method depends on your content type. Start with recursive for general use, then experiment with semantic chunking for complex documents where quality matters most.

Next, let's explore hierarchical and contextual chunking for advanced retrieval scenarios. :::