Advanced Chunking Strategies
Chunking Methods Compared
Chunking determines how documents are split into retrievable units. The right strategy significantly impacts retrieval quality.
Why Chunking Matters
Poor chunking leads to:
- Split context: Related information in different chunks
- Noise: Irrelevant content mixed with relevant
- Lost meaning: Key concepts broken apart
Chunking Methods Overview
| Method | How It Works | Best For |
|---|---|---|
| Fixed-size | Split by character/token count | Simple documents |
| Recursive | Split by separators hierarchically | Structured text |
| Sentence | Split on sentence boundaries | Narrative content |
| Semantic | Split by topic/meaning change | Complex documents |
Fixed-Size Chunking
Simplest approach—split by character count:
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separator="" # Character-level splitting
)
chunks = splitter.split_text(document)
Pros: Simple, predictable chunk sizes Cons: Splits mid-sentence, ignores structure
Recursive Character Splitting
Most commonly used—respects document structure:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=[
"\n\n", # Paragraphs first
"\n", # Then lines
". ", # Then sentences
", ", # Then clauses
" ", # Then words
"" # Finally characters
]
)
chunks = splitter.split_documents(documents)
How it works:
- Try to split on
\n\n(paragraphs) - If chunks still too large, try
\n - Continue down the separator list
- Ensures natural break points
Sentence-Based Chunking
Preserves complete sentences:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
# Using sentence-transformers tokenizer
splitter = SentenceTransformersTokenTextSplitter(
chunk_overlap=50,
tokens_per_chunk=256
)
# Or with NLTK
import nltk
nltk.download('punkt')
from langchain.text_splitter import NLTKTextSplitter
splitter = NLTKTextSplitter(
chunk_size=512,
chunk_overlap=50
)
chunks = splitter.split_text(document)
Best for: Long-form content, articles, documentation
Semantic Chunking
Splits based on meaning changes using embeddings:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
# Split when embedding similarity drops
splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95 # Split at 95th percentile dissimilarity
)
chunks = splitter.split_text(document)
How it works:
- Embeds consecutive sentences
- Calculates similarity between adjacent sentences
- Splits where similarity drops significantly
# Custom semantic chunking
def semantic_chunk(text: str, threshold: float = 0.7):
sentences = split_into_sentences(text)
embeddings = embed_model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < threshold:
# Topic change detected - start new chunk
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
chunks.append(" ".join(current_chunk))
return chunks
Method Comparison
| Method | Retrieval Quality | Processing Speed | Consistency |
|---|---|---|---|
| Fixed-size | Low | Fast | High |
| Recursive | Medium-High | Fast | High |
| Sentence | Medium | Medium | High |
| Semantic | High | Slow | Variable |
Benchmarks
Studies show chunking strategy significantly impacts retrieval:
| Strategy | Recall@10 | Precision@10 |
|---|---|---|
| Fixed 256 | 0.65 | 0.42 |
| Recursive 512 | 0.78 | 0.58 |
| Sentence | 0.75 | 0.55 |
| Semantic | 0.82 | 0.64 |
Results vary by dataset. Always test on your specific data.
Choosing a Method
START
│
▼
Simple, uniform documents?
│
├─ YES → Fixed-size (fast, predictable)
│
▼
Structured text (headers, paragraphs)?
│
├─ YES → Recursive (respects structure)
│
▼
Long-form narrative content?
│
├─ YES → Sentence-based
│
▼
Complex topics, variable length sections?
│
├─ YES → Semantic (best quality, slower)
│
▼
Default → Recursive (best balance)
Implementation Tips
class AdaptiveChunker:
"""Choose chunking method based on document type."""
def __init__(self):
self.recursive = RecursiveCharacterTextSplitter(
chunk_size=512, chunk_overlap=50
)
self.semantic = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
def chunk(self, document: str, doc_type: str) -> list[str]:
if doc_type in ["faq", "qa"]:
# Q&A pairs should stay together
return self._chunk_qa(document)
elif doc_type in ["technical", "research"]:
# Technical docs benefit from semantic chunking
return self.semantic.split_text(document)
else:
# Default to recursive for most content
return self.recursive.split_text(document)
def _chunk_qa(self, document: str) -> list[str]:
# Keep Q&A pairs together
qa_pattern = r'Q:.*?A:.*?(?=Q:|$)'
return re.findall(qa_pattern, document, re.DOTALL)
Key Insight: The best chunking method depends on your content type. Start with recursive for general use, then experiment with semantic chunking for complex documents where quality matters most.
Next, let's explore hierarchical and contextual chunking for advanced retrieval scenarios. :::