llm-integration

Token-Aware Text Chunking for RAG in TypeScript (2026)

June 28, 2026

#text chunking #rag #typescript #langchain #js-tiktoken #embeddings #llm #ai-ml

Token-Aware Text Chunking for RAG in TypeScript (2026)

To chunk text for RAG in TypeScript, split documents by token count rather than by characters: use a RecursiveCharacterTextSplitter with a js-tiktoken length function so every chunk fits your embedding model's budget, add ten to twenty percent overlap, and respect Markdown or code structure before falling back to smaller separators.

TL;DR

Chunking is the step that decides RAG quality, and it is easy to get the unit wrong: splitting on characters when embedding models think in tokens. This guide builds a token-aware splitter in TypeScript with @langchain/textsplitters 1.0.1 and js-tiktoken 1.0.21. You will count tokens with the same encoding your embedding model uses, split a document into right-sized chunks with overlap, write a tiny splitter from scratch to understand the algorithm, split Markdown on its real structure, attach source metadata for citations, and measure the token-size distribution of your chunks. Every code block was compiled with tsc and run with tsx before publishing; the numbers you see are real output. No API key and no model download are required — chunking happens entirely on your machine.

What you'll learn

Why budgeting chunk size in characters silently degrades retrieval
How to count tokens with the same encoding your embedding model uses (cl100k_base)
How to split by tokens with RecursiveCharacterTextSplitter and a length function
How the recursive algorithm works, by building a token-aware splitter from scratch
How to choose chunk size and overlap for your query type
How to split Markdown and code on structure instead of blind character windows
How to attach source metadata so a retrieved chunk can cite where it came from
How to measure your chunk-size distribution and catch over-budget chunks before you embed

Prerequisites

Node.js 20+ (tested on Node 22.22.3). Both @langchain/textsplitters and @langchain/core declare engines.node >= 20.¹²
Comfort with TypeScript and async/await.
An editor and a terminal. You do not need an OpenAI key, a GPU, or a vector database — this guide stops at the point where chunks are ready to embed.

Create a project and pin every dependency so your build is reproducible:

npm init -y
npm pkg set type=module
npm install @langchain/textsplitters@1.0.1 @langchain/core@1.2.1 js-tiktoken@1.0.21
npm install -D typescript@6.0.3 tsx@4.22.4 @types/node@26.0.1

@langchain/core is a peer dependency of @langchain/textsplitters — the splitter returns @langchain/core Document objects from createDocuments, so install it explicitly rather than relying on your package manager to hoist it.¹ Add a tsconfig.json:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "preserve",
    "moduleResolution": "bundler",
    "strict": true,
    "skipLibCheck": true,
    "noEmit": true,
    "types": ["node"]
  },
  "include": ["src"]
}

Why character counts quietly break RAG retrieval

A RAG system retrieves passages from your documents and feeds them to the model as context, so the passages you store are the ceiling on answer quality.³ Chunking produces those passages, and a common, costly mistake is measuring chunk size in characters.

Embedding models read tokens, not characters. A token is roughly four characters of English text, but the ratio is a loose rule of thumb, not a constant — code, punctuation, and non-English text tokenize very differently.⁴ The sample document used throughout this guide is 2,736 characters but only 511 tokens, a ratio of about 5.4 characters per token rather than 4. If you set a chunk size of "256" and quietly mean 256 characters, you are budgeting roughly 50 tokens — about a fifth of what you intended — and your retrieval will be full of cramped, context-starved fragments.

The fix is to make every measurement a token measurement, using the same tokenizer your embedding model uses. The rest of this guide is built on that one decision.

Step 1 — Count tokens the way your embedding model does

js-tiktoken is a pure-JavaScript port of OpenAI's tiktoken. It ships the byte-pair-encoding tables inside the package, so it counts tokens entirely offline — no network call and no model download.⁵ The important detail is choosing the right encoding. Different model families use different tokenizers, and the token boundaries differ between them:

Model	Encoding (`js-tiktoken`)
`text-embedding-3-small`	`cl100k_base`
`text-embedding-3-large`	`cl100k_base`
`text-embedding-ada-002`	`cl100k_base`
`gpt-4o`, `gpt-4o-mini`	`o200k_base`
`gpt-4`	`cl100k_base`

The mapping above was produced by calling getEncodingNameForModel for each model. The practical takeaway: OpenAI's text-embedding-3-small and text-embedding-3-large both use cl100k_base (as does the older text-embedding-ada-002), so that is the encoding you should count with when you budget chunks for them.⁵⁶ Counting with o200k_base (the GPT-4o chat encoding) would give you slightly different boundaries and a misleading budget.

Rather than hard-code an encoding name, let the model name pick it. encodingForModel resolves the correct encoding for you. Create src/tokens.ts:

import { encodingForModel, type TiktokenModel } from "js-tiktoken";

// Pick the encoding that matches YOUR embedding model.
// The text-embedding-3-* models all use cl100k_base.
export function tokenCounter(model: TiktokenModel = "text-embedding-3-small") {
  const enc = encodingForModel(model);
  return (text: string): number => enc.encode(text).length;
}

tokenCounter returns a function that turns any string into its token count. That returned function is exactly the shape a text splitter wants for its length function, which is what makes the next step work.

Step 2 — Split by tokens, not characters

RecursiveCharacterTextSplitter is the workhorse splitter. It walks an ordered list of separators — by default ["\n\n", "\n", " ", ""] — trying to split on the largest natural boundary first (paragraph breaks), then progressively smaller ones (lines, spaces, and finally individual characters) until each piece is under the size limit. It then merges adjacent pieces back up to the limit and adds the configured overlap.⁷

The size limit is measured by the splitter's lengthFunction. Its default is text.length — characters. Pass the token counter from Step 1 and the same chunkSize number now means tokens. Create src/chunk.ts:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { tokenCounter } from "./tokens";

const countTokens = tokenCounter("text-embedding-3-small");

export function makeSplitter(chunkSize = 256, chunkOverlap = 32) {
  return new RecursiveCharacterTextSplitter({
    chunkSize,
    chunkOverlap,
    lengthFunction: countTokens, // <-- chunkSize now counts TOKENS
  });
}

To see why this matters, run the same document through the splitter twice — once with the default character length function and once token-aware. With the 511-token sample document and chunkSize: 256, the results are not close:

Character mode (256 means characters): 19 chunks, with token sizes [8, 50, 33, 7, 46, 47, 17, 9, 49, 47, 19, 7, 54, 46, 9, 7, 47, 44, 7].
Token mode (256 means tokens): 3 chunks, with token sizes [202, 228, 97].

Same number, same document, wildly different outcome. Character mode shatters the document into nineteen fragments averaging about 30 tokens each — a fraction of a useful embedding chunk — because 256 characters is only about 50 tokens, and the short heading lines split off into tiny pieces of their own. Token mode produces three chunks that actually fill the budget you set. The fragments from character mode would each embed into a vague vector that matches everything and answers nothing.

Step 3 — How the splitter works, from scratch

Reaching for a library is fine, but you should understand the algorithm it runs, because the edge cases are where chunking goes wrong. The core idea is greedy packing: accumulate small units of text until adding the next one would exceed the budget, emit a chunk, then start the next chunk with an overlapping tail. Create src/scratch.ts:

export interface SplitOptions {
  chunkSize: number;
  chunkOverlap: number;
  countTokens: (text: string) => number;
}

export function splitByTokens(text: string, opts: SplitOptions): string[] {
  const { chunkSize, chunkOverlap, countTokens } = opts;
  // Break into atomic units: paragraphs, then sentences. Never split a unit further.
  const units = text
    .split(/\n\n+/)
    .flatMap((p) => p.match(/[^.!?]+[.!?]*\s*/g) ?? [p]);

  const chunks: string[] = [];
  let cur: string[] = [];

  for (const unit of units) {
    // Peek BEFORE committing: if this unit would overflow, close the chunk first.
    if (cur.length && countTokens([...cur, unit].join("")) > chunkSize) {
      chunks.push(cur.join("").trim());
      // Carry an overlapping tail into the next chunk.
      const tail: string[] = [];
      while (
        cur.length &&
        countTokens([cur[cur.length - 1], ...tail].join("")) <= chunkOverlap
      ) {
        tail.unshift(cur.pop() as string);
      }
      cur = tail;
    }
    cur.push(unit);
  }
  if (cur.join("").trim()) chunks.push(cur.join("").trim());
  return chunks;
}

The detail that matters is the word peek. A naive version pushes the unit first and checks the size afterward, which lets a chunk overshoot the budget by a whole sentence — in testing, the naive version produced chunks of 270 and 272 tokens against a 256 budget. Checking before committing keeps each chunk at or under the limit. Run on the sample document with 256/32, this version returns 3 chunks sized [242, 239, 51], none over budget.

There is one case even the peek version cannot fix: a single unit that is itself larger than the budget — a 400-token sentence, or a long line of minified code with no spaces. A greedy packer can only emit it oversized. That is precisely why RecursiveCharacterTextSplitter is recursive: when a unit is too big, it drops to the next separator and splits that unit further, recursing all the way down to individual characters if it has to. For production, prefer the library; the from-scratch version is here so the library stops being a black box.

Step 4 — Choose chunk size and overlap

There is no universal best size, but the trade-offs are well understood, and you can reason about them instead of guessing.⁸ Smaller chunks (roughly 128–256 tokens) make each embedding sharply focused, which helps fact-style lookups where the answer is a sentence or two. Larger chunks (roughly 256–512 tokens) preserve more surrounding context, which helps analytical questions that need a paragraph of reasoning. Most pipelines land somewhere in the 256–512 range as a starting point and tune from there using a retrieval evaluation.

For overlap, a common starting point is ten to twenty percent of the chunk size — enough to carry a sentence across a boundary without storing near-duplicate text.⁸ The 256/32 default used in this guide is a 12.5 percent overlap, comfortably inside that range. Overlap is not free: at twenty percent overlap, about a fifth of the text you store is duplicated, which inflates your vector count and your embedding bill. Too little overlap and a fact that straddles a boundary becomes unretrievable; too much and the same query keeps returning adjacent near-duplicate chunks.

Overlap is easy to see in the real output. With 256/32, the end of chunk 0 and the start of chunk 1 share the same heading:

chunk 0 (202 tokens) ... ends: "...preserving enough context to stand alone.

## Tokens are the real unit, not characters"

chunk 1 (228 tokens) starts: "## Tokens are the real unit, not characters

Embedding models read tokens, not characters..."

The ## Tokens are the real unit heading appears at the tail of one chunk and the head of the next. If a user asks about tokens, either chunk can be retrieved and still carries the heading that frames the section.

Step 5 — Respect document structure: Markdown and code

Splitting on raw \n\n/\n/space separators treats a document as a flat stream of text. That strands a heading from its section and slices code blocks in half. RecursiveCharacterTextSplitter.fromLanguage swaps in a separator list tuned to a specific format, so the splitter breaks on structural boundaries first.⁷ For Markdown, the separator list leads with headings and block markers:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { tokenCounter } from "./tokens";

const countTokens = tokenCounter("text-embedding-3-small");

const mdSplitter = RecursiveCharacterTextSplitter.fromLanguage("markdown", {
  chunkSize: 256,
  chunkOverlap: 0,
  lengthFunction: countTokens,
});

The Markdown separators, in order, are headings level two through six (\n## … \n###### ), the end of a fenced code block, horizontal rules (***, ---, ___), then the usual paragraph, line, space, and character fallbacks. (Note that level-one # is not in the list — the separators start at level two, since a document usually has a single title.) Run on the sample document with 256/0, this produces 3 chunks sized [193, 221, 97], each one aligned to a section heading instead of an arbitrary character offset.

The same idea applies to source code: fromLanguage supports many languages, and for code it splits on function and class boundaries before lines, so a function rarely gets cut in half. When you ingest mixed content, route each file to the splitter that matches its type — Markdown for docs, the matching language for code — rather than running everything through one generic splitter.

Step 6 — Attach metadata so you can cite sources

A chunk is far more useful when it remembers where it came from. createDocuments takes an array of texts and a parallel array of metadata, and returns Document objects that carry both the chunk text and its metadata — including the line range the chunk spans in the original document. Extend src/chunk.ts:

export async function chunkText(text: string, source: string) {
  const splitter = makeSplitter();
  const docs = await splitter.createDocuments([text], [{ source }]);
  return docs.map((d) => ({
    text: d.pageContent,
    tokens: countTokens(d.pageContent),
    metadata: d.metadata,
  }));
}

For the first chunk, metadata looks like this:

{ "source": "rag-intro.md", "loc": { "lines": { "from": 1, "to": 9 } } }

The source field is whatever you passed in; the loc.lines range is added by the splitter. Carry this metadata all the way into your vector store, and when a chunk is retrieved you can show the user exactly which file and which lines grounded the answer — the difference between "the docs say X" and "the docs say X (rag-intro.md, lines 1–9)." That provenance is also what lets you debug bad answers back to the passage that caused them.

Step 7 — Measure your chunk distribution

You should never ship a chunking config you have not measured. A one-line bug in your separators or a stray oversized chunk will quietly blow past your embedding model's input limit or waste tokens on near-empty fragments. Create src/measure.ts:

export function measure(
  chunks: string[],
  countTokens: (t: string) => number,
  budget: number
) {
  const sizes = chunks.map(countTokens);
  return {
    count: sizes.length,
    min: Math.min(...sizes),
    max: Math.max(...sizes),
    avg: Math.round(sizes.reduce((a, b) => a + b, 0) / sizes.length),
    overBudget: sizes.filter((s) => s > budget).length,
  };
}

overBudget is the number that matters most: it should be zero. One subtlety worth knowing — chunk sizes can drift one or two tokens over the budget even when the splitter is working correctly, because tokenization at a join is context-dependent (two pieces that each encode to N tokens can encode to N+1 when concatenated). In a separate test with a 128-token budget, chunks landed at sizes like [127, 128, 127, 129, 128, ...]. The practical defense is to leave roughly ten percent of headroom: if your hard ceiling is 512 tokens, set your chunk budget to 460–480, not 512.

Verification

Wire the pieces together in src/main.ts and run it:

import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { tokenCounter } from "./tokens";
import { measure } from "./measure";
import { doc } from "./sample";

const countTokens = tokenCounter("text-embedding-3-small");

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 256,
  chunkOverlap: 32,
  lengthFunction: countTokens,
});

const docs = await splitter.createDocuments([doc], [{ source: "rag-intro.md" }]);

for (const [i, d] of docs.entries()) {
  console.log(
    `chunk ${i}: ${countTokens(d.pageContent)} tokens, ` +
      `lines ${d.metadata.loc.lines.from}-${d.metadata.loc.lines.to}`
  );
}
console.log(measure(docs.map((d) => d.pageContent), countTokens, 256));

Put any multi-paragraph Markdown document in src/sample.ts and export it as doc, then run:

npx tsx src/main.ts

With the 511-token sample document, the output is:

chunk 0: 202 tokens, lines 1-9
chunk 1: 228 tokens, lines 15-23
chunk 2: 97 tokens, lines 29-31
{ count: 3, min: 97, max: 228, avg: 176, overBudget: 0 }

Three chunks, each comfortably under the 256-token budget, overBudget: 0, every chunk tagged with the lines it came from. Run npx tsc alongside it and the project type-checks cleanly under strict mode. You now have chunks that are ready to embed.

Troubleshooting

"My chunks are tiny — dozens of fragments instead of a handful." You left the default character length function in place, so chunkSize: 256 means 256 characters (~50 tokens), not 256 tokens. Pass lengthFunction: countTokens as in Step 2.

"A chunk is a few tokens over my limit." Expected: tokenization at a boundary is context-dependent, so sizes drift one or two tokens past the budget. Set your chunk budget about ten percent below the model's hard limit (Step 7) instead of at it.

"A single chunk is hundreds of tokens over budget." One atomic unit — a very long line or an unbroken code blob — is bigger than the budget, and a greedy packer cannot split it. Use RecursiveCharacterTextSplitter (not the from-scratch version), which recurses into smaller separators all the way down to individual characters, and for code use fromLanguage so it breaks on syntax.

"Token counts look off versus my model's bill." You are counting with the wrong encoding. Embedding models in the text-embedding-3-* family use cl100k_base; the GPT-4o chat family uses o200k_base. Let encodingForModel pick it from the model name (Step 1).

"Headings and code blocks are getting split mid-section." You are using the generic separators. Switch to fromLanguage("markdown") for docs and the matching language for code (Step 5) so structural boundaries are tried first.

Next steps

You now have a token-aware chunker: it counts with your embedding model's encoding, splits on structure, overlaps for context, tags every chunk with its source lines, and reports a distribution you can verify. Chunking is the input stage of retrieval, and getting it right is one of the cheapest quality wins in the pipeline.

From here, three directions are worth your time. First, the tokenizer itself: the same js-tiktoken encodings drive context-window budgeting, covered in counting tokens in TypeScript for the context window. Second, what happens to these chunks next — embedding them and retrieving them — is the subject of building a robust RAG system. Third, once retrieval is live, tuning recall and relevance is covered in RAG optimization techniques; the chunk size and overlap you chose here are the first knobs you will turn.

@langchain/textsplitters — npm package (peer dependency on @langchain/core, engines.node >= 20). https://www.npmjs.com/package/@langchain/textsplitters ↩ ↩²
@langchain/core — npm package. https://www.npmjs.com/package/@langchain/core ↩
Nerd Level Tech — "Building a Robust RAG System." /building-a-robust-rag-system-a-complete-implementation-guide ↩
OpenAI Help Center — "What are tokens and how to count them." https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them ↩
js-tiktoken — npm package (pure-JS port of OpenAI's tiktoken; encodings bundled). https://www.npmjs.com/package/js-tiktoken ↩ ↩²
OpenAI Cookbook — "How to count tokens with tiktoken" (third-generation embedding models use the cl100k_base encoding). https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb ↩
LangChain.js — @langchain/textsplitters source for RecursiveCharacterTextSplitter (default separators, merge logic, and getSeparatorsForLanguage). https://github.com/langchain-ai/langchainjs/blob/main/libs/langchain-textsplitters/src/text_splitter.ts ↩ ↩²
Firecrawl — "Best Chunking Strategies for RAG (2026)" (chunk size and overlap ranges). https://www.firecrawl.dev/blog/best-chunking-strategies-rag ↩ ↩²