llm-integration

Count Tokens in TypeScript: Fit the Context Window (2026)

June 25, 2026

Count Tokens in TypeScript: Fit the Context Window (2026)

To count OpenAI tokens in TypeScript without calling the API, install gpt-tokenizer and call countTokens(text) for strings or encodeChat(messages, model).length for a chat array. It is pure JavaScript - no WASM, no API key - and it powers a trimToFit() that fits a conversation to the model's context window.

TL;DR

You will build a small TypeScript module that counts tokens locally with gpt-tokenizer 3.4.0, measures chat-message overhead correctly (not the stale "~4 tokens per message" guess), and trims a growing conversation so it fits inside a model's context window while reserving room for the reply. The stack is Node 22 LTS + TypeScript 6 + tsx, every snippet is copy-paste-runnable, and the whole thing runs offline. It takes about 15 minutes.

What you'll learn

  • Why you'd count tokens locally instead of calling the API - and when local counting is not enough
  • How to count tokens in a string with encode, decode, and countTokens
  • How to check a budget cheaply with isWithinTokenLimit (and its number | false return value)
  • How to count a chat-messages array accurately with encodeChat, including the per-message framing overhead
  • How to pick the right encoding (o200k_base vs cl100k_base) for your model
  • How to write a trimToFit() that keeps the system prompt and reserves output tokens
  • How to split a long document into token-bounded chunks for embedding or summarizing
  • How gpt-tokenizer compares to tiktoken and js-tiktoken on the edge

Prerequisites

  • Node 20.19+ or 22+ (this guide was verified on Node 22 LTS)
  • TypeScript 6.0.3 and tsx 4.22.4 to run .ts files directly
  • gpt-tokenizer 3.4.0 - the pure-JavaScript BPE tokenizer for OpenAI models1
  • Basic familiarity with the OpenAI chat-messages shape ({ role, content })

A token is the unit a model actually reads: roughly a word-piece. OpenAI's GPT models use a Byte-Pair Encoding (BPE) tokenizer, and gpt-tokenizer is a faithful pure-JS port of OpenAI's tiktoken that ships the BPE rank tables inline, so it works in Node, the browser, and edge runtimes without a WASM binary or a runtime download.2

Why count tokens locally (and when not to)

Counting locally is instant, free, and private - you don't ship your prompt to a third party just to size it. That's what you want for trimming history, splitting documents, sizing a batch, or showing a live token counter in a UI.

But be honest about the boundary: a local BPE tokenizer counts text only. It does not see the tokens added by tool/function schemas, images, or files, and providers occasionally adjust how requests are framed. OpenAI's own documentation says character-based estimates are inaccurate and that tools and images need a server-side count.3 So use local counting for fast, approximate sizing of text, and fall back to the provider's official token-counting endpoint when you need an exact, billable number that includes non-text inputs. We'll wire up both.

Step 1 - Set up the project

Create a project and install the runtime and dev dependencies with pinned versions:

mkdir token-budget && cd token-budget
npm init -y
npm install gpt-tokenizer@3.4.0
npm install -D typescript@6.0.3 tsx@4.22.4 @types/node@22.20.0

Mark the package as an ES module and add a strict tsconfig.json. Because we run .ts files with tsx (no build step), we enable allowImportingTsExtensions and noEmit:

npm pkg set type=module
{
  "compilerOptions": {
    "target": "ES2023",
    "module": "nodenext",
    "moduleResolution": "nodenext",
    "strict": true,
    "noUncheckedIndexedAccess": true,
    "verbatimModuleSyntax": true,
    "allowImportingTsExtensions": true,
    "noEmit": true,
    "skipLibCheck": true,
    "types": ["node"]
  },
  "include": ["src"]
}

That's the whole setup. No model download, no API key, no .env.

Step 2 - Count tokens in a string

This is the core of OpenAI token counting: turn text into token IDs and back. Create src/count.ts:

import { encode, decode, countTokens, isWithinTokenLimit } from 'gpt-tokenizer'

const text = 'The quick brown fox jumps over the lazy dog.'

const tokens: number[] = encode(text)
console.log('token ids:', tokens)
console.log('token count:', countTokens(text)) // same as encode(text).length
console.log('round-trips:', decode(tokens) === text)

Run it:

npx tsx src/count.ts

Expected output:

token ids: [
    976,  4853, 19705,
  68347, 65613,  1072,
    290, 29082,  6446,
     13
]
token count: 10
round-trips: true

Three things worth noticing. First, countTokens(text) is the number you want for budgeting - it returns the same value as encode(text).length but counts natively without building the token array. Second, decode(encode(text)) round-trips exactly, which is how you know the tokenizer is lossless. Third, the default import uses the o200k_base encoding, the one used by every modern OpenAI model (gpt-5, gpt-4o, gpt-4.1, and the o-series).2 We'll handle older models in Step 5.

Step 3 - Check a limit cheaply with isWithinTokenLimit

Often you don't need the exact count - you only need to know "does this fit under N tokens?" Encoding a huge string just to compare its length is wasteful. isWithinTokenLimit stops early once the limit is exceeded, and its return type is the small gotcha worth internalizing: it returns the token count (a number) when the text fits, or false when it doesn't. Append this to src/count.ts:

// isWithinTokenLimit returns the token COUNT if within the limit, or false if over
const within: number | false = isWithinTokenLimit(text, 20)
console.log('within 20?', within) // 10

const over: number | false = isWithinTokenLimit(text, 5)
console.log('within 5?', over) // false

Output:

within 20? 10
within 5? false

Because false and 0 are both falsy, always check the type explicitly. Write if (within === false) rather than if (!within) - otherwise an empty string (legitimately 0 tokens) would be misread as "over the limit." Under TypeScript's strict mode the number | false union forces you to handle both arms, which is exactly the safety you want around a budget check.

Step 4 - Count chat messages accurately with encodeChat

Here's where most token-counting blog posts go wrong. The OpenAI Chat Completions API doesn't send the model your raw content strings - it wraps each message in special boundary tokens that mark the role and the start/end of every message. If you only count content, you undercount, and undercounting is how you blow past the context window in production.

encodeChat does the real thing. Create src/count-chat.ts:

import { encodeChat, countTokens } from 'gpt-tokenizer'

type ChatMessage = { role: 'system' | 'user' | 'assistant'; content: string }

const messages: ChatMessage[] = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'What is the capital of France?' },
]

// encodeChat needs the model so it can apply that model's chat framing
const chatTokens = encodeChat(messages, 'gpt-4o')
console.log('chat token count:', chatTokens.length)

// Naive raw-text counting UNDERCOUNTS because it ignores per-message framing:
const naive = messages.reduce((sum, m) => sum + countTokens(m.content), 0)
console.log('naive content-only count:', naive)
console.log('framing overhead:', chatTokens.length - naive)

Run it:

npx tsx src/count-chat.ts

Output:

chat token count: 24
naive content-only count: 13
framing overhead: 11

The two messages are 24 tokens, not 13. The 11-token difference is the ChatML framing - encodeChat emits the <|im_start|> / <|im_sep|> / <|im_end|> boundary tokens that wrap each message's role and content in OpenAI's chat format, which naive character- or content-based counting misses entirely. Pass the model name so encodeChat applies that model's framing; an empty chat still costs a few special tokens because the conversation itself has a start and end.

If you call encodeChat a lot, import a model-bound version once instead of repeating the model string:

import { encodeChat, countTokens } from 'gpt-tokenizer/model/gpt-4o'

// encodeChat(messages) now uses gpt-4o framing without a second argument

That subpath import binds the encoding at import time, so encodeChat(messages) and countTokens(text) both use gpt-4o's o200k_base settings.

Step 5 - Pick the right encoding for your model

A tokenizer is only correct if it uses the encoding your model actually uses. OpenAI models map to two encodings you'll meet in practice:

EncodingModelsgpt-tokenizer import
o200k_basegpt-5, gpt-4o, gpt-4.1, o-seriesdefault import (or gpt-tokenizer/model/gpt-4o)
cl100k_basegpt-3.5-turbo, gpt-4gpt-tokenizer/encoding/cl100k_base

The default import is o200k_base, so if you're on a current model you're already correct. Using the wrong encoding doesn't crash - it silently returns a different number, which is the dangerous kind of wrong. The gap is biggest for non-English text, where o200k_base is markedly more efficient. Create src/encoding.ts:

import { countTokens as o200k } from 'gpt-tokenizer/encoding/o200k_base'
import { countTokens as cl100k } from 'gpt-tokenizer/encoding/cl100k_base'

const text = '日本語のトークン数を数える'
console.log('o200k:', o200k(text)) // 10
console.log('cl100k:', cl100k(text)) // 14

Output:

o200k: 10
cl100k: 14

Same string, 10 tokens versus 14 - a 40% difference driven purely by encoding choice. If you hardcoded cl100k_base out of habit while serving gpt-4o, every budget calculation would be inflated. Match the encoding to the model and the numbers line up. For deeper strategy on what to put in that window, see our companion piece on context window optimization for LLMs.

Step 6 - Trim chat history to fit the context window

Now the payoff. A chat app accumulates history until it overflows the model's context window. The fix is to trim the oldest turns - while always keeping the system prompt, which steers the model - and to reserve tokens for the reply, because the context window is shared by input and output. (gpt-4o, for example, has a 128,000-token context window but caps output at 16,384 tokens, so you must leave room.4)

Create src/trim-to-fit.ts:

import { encodeChat } from 'gpt-tokenizer'

export type ChatMessage = { role: 'system' | 'user' | 'assistant'; content: string }
export type GptModel = 'gpt-4o' | 'gpt-4o-mini' | 'gpt-4.1' | 'gpt-5'

export interface TrimOptions {
  model: GptModel
  /** Total context window of the model, in tokens. */
  maxContextTokens: number
  /** Tokens to leave free for the model's reply. */
  reserveForOutput: number
}

export function countChatTokens(messages: ChatMessage[], model: GptModel): number {
  return encodeChat(messages, model).length
}

/**
 * Drop the oldest non-system messages until the chat fits inside
 * maxContextTokens - reserveForOutput. The system message is always kept.
 */
export function trimToFit(messages: ChatMessage[], opts: TrimOptions): ChatMessage[] {
  const budget = opts.maxContextTokens - opts.reserveForOutput
  if (budget <= 0) throw new RangeError('reserveForOutput must be smaller than the context window')

  const system = messages.filter((m) => m.role === 'system')
  const turns = messages.filter((m) => m.role !== 'system')

  // Keep removing the oldest turn until we fit (or only the system prompt remains).
  const kept = [...turns]
  while (kept.length > 0 && countChatTokens([...system, ...kept], opts.model) > budget) {
    kept.shift()
  }
  return [...system, ...kept]
}

Now exercise it with a synthetic conversation in src/demo.ts. We use a deliberately tiny 400-token window so the trimming is visible; in production you'd pass the model's real window:

import { trimToFit, countChatTokens, type ChatMessage } from './trim-to-fit.ts'

const history: ChatMessage[] = [
  { role: 'system', content: 'You are a terse assistant.' },
]
for (let i = 1; i <= 40; i++) {
  history.push({ role: 'user', content: `Question number ${i}: tell me a fact about the number ${i}.` })
  history.push({ role: 'assistant', content: `Fact ${i}: ${i} is a number with interesting properties worth a sentence or two of explanation here.` })
}

const model = 'gpt-4o' as const
console.log('full history tokens:', countChatTokens(history, model), 'messages:', history.length)

const trimmed = trimToFit(history, { model, maxContextTokens: 400, reserveForOutput: 120 })
console.log('trimmed tokens:', countChatTokens(trimmed, model), 'messages:', trimmed.length)
console.log('budget was:', 400 - 120)
console.log('system kept?', trimmed[0]?.role === 'system')
console.log('fits budget?', countChatTokens(trimmed, model) <= 400 - 120)

Run it:

npx tsx src/demo.ts

Output:

full history tokens: 1773 messages: 81
trimmed tokens: 277 messages: 13
budget was: 280
system kept? true
fits budget? true

The 81-message, 1,773-token history is trimmed to the 13 most recent messages (277 tokens), which fits the 280-token budget (400 minus the 120 reserved for output), and the system prompt survives at index 0. Swap maxContextTokens: 400 for 128_000 and the same function manages a real gpt-4o conversation.

Two production refinements once the shape is clear. First, dropping a lone shift() can leave a dangling assistant reply with no matching question; if message coherence matters, remove turns in user+assistant pairs. Second, if you'd rather summarize old turns than delete them, replace the dropped messages with a single "summary so far" system message - count that summary with countTokens so it, too, stays inside the budget.

Step 7 - Split a long document by token budget

Trimming handles a conversation that's too long; chunking handles a single document that's too long - the everyday prep step before embedding text for search or feeding it to a summarizer one piece at a time. The pattern is the same primitive (countTokens) used greedily: pack words until the next one would push the chunk over the limit, then start a fresh chunk. Create src/chunk.ts:

import { countTokens } from 'gpt-tokenizer'

/**
 * Greedily pack whitespace-separated words into chunks that each stay
 * at or under maxTokens. Useful for splitting a document before embedding
 * or summarizing it.
 */
export function chunkByTokens(text: string, maxTokens: number): string[] {
  const words = text.split(/\s+/).filter(Boolean)
  const chunks: string[] = []
  let current: string[] = []

  for (const word of words) {
    const candidate = [...current, word].join(' ')
    if (countTokens(candidate) > maxTokens && current.length > 0) {
      chunks.push(current.join(' '))
      current = [word]
    } else {
      current.push(word)
    }
  }
  if (current.length > 0) chunks.push(current.join(' '))
  return chunks
}

const doc = Array.from({ length: 120 }, (_, i) => `sentence ${i + 1} has a few words in it.`).join(' ')
const chunks = chunkByTokens(doc, 64)
console.log('document tokens:', countTokens(doc))
console.log('chunks:', chunks.length)
console.log('max chunk tokens:', Math.max(...chunks.map((c) => countTokens(c))))
console.log('all within 64?', chunks.every((c) => countTokens(c) <= 64))

Run it:

npx tsx src/chunk.ts

Output:

document tokens: 1200
chunks: 19
max chunk tokens: 64
all within 64? true

The 1,200-token document is split into 19 chunks, none exceeding the 64-token budget. Two notes for real documents: the current.length > 0 guard prevents an infinite loop when a single "word" is itself longer than maxTokens (it gets its own oversized chunk rather than wedging the loop) - if that's a real risk for your input, split such tokens further. And for retrieval quality you usually want a small overlap between chunks; seed each new current with the last sentence of the previous chunk so context isn't sheared at the boundary.

Verification

You've already seen each script print its expected output, but here's a single end-to-end check. Type-check the whole project, then confirm the trim demo reports fits budget? true:

npx tsc --noEmit && npx tsx src/demo.ts | tail -1

Expected:

fits budget? true

A clean tsc --noEmit (it prints nothing and exits 0) plus fits budget? true means the types are sound and the trimming actually respects the budget you set.

gpt-tokenizer vs tiktoken vs js-tiktoken

There are three pure-or-near-pure JavaScript options, and the right choice depends on where the code runs:

LibraryVersionMechanismEdge / offline notes
gpt-tokenizer3.4.0Pure JS, BPE ranks bundled inlineNo WASM, no network fetch - works on any runtime1
tiktoken1.0.22WASM binding to the Rust tokenizerHigh throughput on large inputs, but needs WASM support5
js-tiktoken1.0.21Pure JSMain build bundles ranks; the js-tiktoken/lite build expects you to supply rank data, commonly fetched from a CDN6

For Cloudflare Workers, Vercel Edge, or any cold-start-sensitive path, gpt-tokenizer is the low-friction default: it has no WASM dependency and no runtime download, so there's nothing to fetch and nothing to fail. tiktoken (the WASM package) is built around a compiled WASM core aimed at raw throughput on very large inputs, at the cost of needing WASM support. js-tiktoken sits in between - its full build also bundles ranks, but the lite variant's bring-your-own-ranks pattern is the one that bites people on the edge when the CDN fetch is slow or blocked.

Troubleshooting

Error: Model 'gpt-4o-banana' does not support chat. - You passed an unknown model name to encodeChat. The model string must be one gpt-tokenizer recognizes (e.g. gpt-4o, gpt-4.1, gpt-5). Fix the typo or import the model-bound version, gpt-tokenizer/model/gpt-4o, and call encodeChat(messages) without the second argument.

Error: Disallowed special token found: <|endoftext|> - Your input text literally contains a special token like <|endoftext|>, and encode refuses to silently merge it for safety. If you trust the input and want it tokenized as that special token, opt in explicitly:

import { encode } from 'gpt-tokenizer'

const ids = encode('hello <|endoftext|> world', {
  allowedSpecial: new Set(['<|endoftext|>']),
})
console.log(ids.length) // 9

ERR_MODULE_NOT_FOUND: Cannot find package 'gpt-tokenizer' - You ran a script from outside the project directory, so Node couldn't resolve node_modules. Run scripts from the project root (where package.json and node_modules live).

tsc errors about CommonJS under verbatimModuleSyntax - You forgot "type": "module" in package.json, so TypeScript treats .ts files as CommonJS and rejects the ESM import syntax. Run npm pkg set type=module. If tsc then complains about importing ./trim-to-fit.ts, confirm allowImportingTsExtensions and noEmit are set as shown in Step 1.

My local count doesn't match the API's billed tokens. - Expected, within a small margin. Local counting ignores tool/function schemas, images, and files, and providers add framing of their own. For an exact number, use the provider's count endpoint (next section).

When local counting isn't enough

Two cases need the network. First, exact, billable counts that include non-text inputs. OpenAI exposes an input-token-count endpoint that accepts the same payload as the Responses API and returns the precise input_tokens, covering text, messages, images, files, and tools - the things a local BPE tokenizer can't see.3 Reach for it when accuracy matters more than latency, e.g. right before a costly call or for usage metering.

Second, other model families. gpt-tokenizer and tiktoken are OpenAI-specific. Anthropic's Claude uses a different tokenizer, so counting Claude prompts with a GPT tokenizer is simply wrong; use Anthropic's POST /v1/messages/count_tokens endpoint (or client.messages.countTokens()) instead.7 Google's Gemini has its own countTokens method as well. The rule of thumb: tokenizers are per-vendor, sometimes per-generation, so match the counter to the model you're actually calling.

For the broader cost picture once you can count, see our guides on saving tokens and optimizing prompts and prompt caching with Claude in TypeScript, which both build on the budgeting muscle you just developed.

Next steps

  • Add a summarization fallback to trimToFit: when you'd drop a turn, fold it into a running summary system message instead of deleting it.
  • Wire the local count into a pre-flight guard that rejects requests over budget before they hit the API.
  • For exact metering, call the provider's count endpoint and reconcile it against your local estimate to learn your typical overhead.

You now have a fast, offline, type-safe way to count OpenAI tokens in TypeScript and keep any conversation inside its context window - without a WASM binary, a CDN fetch, or an API key.

Footnotes

  1. gpt-tokenizer on npm - 3.4.0 is the latest dist-tag as of 25 June 2026. Pure-JavaScript BPE tokenizer for OpenAI models. https://www.npmjs.com/package/gpt-tokenizer 2

  2. niieani/gpt-tokenizer (GitHub) - the default export uses the o200k_base encoding, used by gpt-5, gpt-4o, gpt-4.1, and the o-series. https://github.com/niieani/gpt-tokenizer 2

  3. OpenAI - Counting tokens. The input-token-count endpoint accepts the same input as the Responses API and returns the exact input_tokens, including tools and images that local tokenizers can't measure. https://developers.openai.com/api/docs/guides/token-counting 2

  4. OpenAI gpt-4o has a 128,000-token context window and, since the 2024-11-20 update, a 16,384-token maximum output; input and output share the window. https://platform.openai.com/docs/models/gpt-4o

  5. tiktoken on npm (1.0.22) - the WASM binding to OpenAI's Rust tokenizer. https://www.npmjs.com/package/tiktoken

  6. js-tiktoken on npm (1.0.21) - pure-JS port; the js-tiktoken/lite build expects caller-supplied rank data. https://www.npmjs.com/package/js-tiktoken

  7. Anthropic - Token counting. Claude uses a different tokenizer from OpenAI; count via POST /v1/messages/count_tokens or client.messages.countTokens(). https://platform.claude.com/docs/en/build-with-claude/token-counting