ai-ml

Claude Prompt Caching in TypeScript: Cut Costs 2026

June 15, 2026

#claude #prompt caching #cache_control #typescript #anthropic #llm #ai-ml #tutorial

Claude Prompt Caching in TypeScript: Cut Costs 2026

Claude prompt caching reuses a static prefix — a long system prompt, tool definitions, or a reference document — across requests, charging about 10% of the input price on each repeat. In TypeScript you mark the prefix with cache_control and confirm the hit through the response usage fields.

TL;DR

You will add prompt caching to a TypeScript Claude app on @anthropic-ai/sdk 0.104.1: mark a static system prompt with cache_control: { type: 'ephemeral' }, confirm the cache hit by reading cache_creation_input_tokens and cache_read_input_tokens off the response, and calculate the cost difference. A 2,000-token system prompt reused 100 times on Sonnet 4.6 drops from $0.60 to about $0.067 — an 88.8% cut on that prefix. You will also use the newer automatic (top-level) caching for growing conversations, the one-hour TTL, and learn the per-model minimum token thresholds that silently break caching when you get them wrong. Budget about 25 minutes.

What you'll learn

How Claude prompt caching works: ephemeral entries, the tools → system → messages prefix, and the 10% read price
How to add an explicit cache breakpoint with cache_control on a system block
How to measure a cache hit from the response usage object
How to calculate the real cost savings with verified pricing
How to use automatic (top-level) caching for multi-turn conversations
How to extend the cache to one hour and place multiple breakpoints
Why a cache silently misses, and how to fix each cause

Prerequisites

Node.js 24 LTS (Active LTS through October 2026, then Maintenance LTS through end-of-life in April 2028)¹
An Anthropic API key in ANTHROPIC_API_KEY
Pinned packages, current on npm as of June 15, 2026:²

npm install --save-exact @anthropic-ai/sdk@0.104.1
npm install --save-exact -D typescript@6.0.3 tsx@4.22.4 @types/node@24.13.2

Use "type": "module" in package.json and a strict tsconfig.json ("module": "nodenext", "strict": true). Run the scripts with node --import tsx src/<file>.ts, which executes TypeScript with no build step; the new Anthropic() client reads ANTHROPIC_API_KEY from the environment, so no dotenv dependency is needed.

How Claude prompt caching works

Prompt caching stores a prefix of your request in memory so the next request that starts with the same prefix reads it back instead of reprocessing it. Caching is "ephemeral": an entry has a five-minute lifetime by default, refreshed for free every time it is reused, and you can opt into a one-hour lifetime.³ The cache covers your request in a fixed order — tools, then system, then messages — and a breakpoint caches everything up to and including the block you mark.³

Pricing is the reason to care. A cache read costs 0.1× the base input price. Writing a new five-minute entry costs 1.25× base (you pay the surcharge once, then amortize it across every hit); a one-hour entry costs 2× base.⁴ On Claude Sonnet 4.6, base input is $3 per million tokens, a cache read is $0.30, and a five-minute write is $3.75.⁴

Step 1: Establish a no-cache baseline

Start with a shared client and a small helper that prints the cache-related usage fields. Create src/client.ts:

import Anthropic from '@anthropic-ai/sdk';

export const client = new Anthropic(); // reads ANTHROPIC_API_KEY from the environment

export function summarizeUsage(usage: Anthropic.Usage): string {
  const write = usage.cache_creation_input_tokens ?? 0;
  const read = usage.cache_read_input_tokens ?? 0;
  const fresh = usage.input_tokens;
  return `write=${write} read=${read} fresh=${fresh} output=${usage.output_tokens}`;
}

Now send a request with the system prompt as a plain string — no cache_control anywhere. Create src/baseline.ts:

import { readFile } from 'node:fs/promises';
import { client, summarizeUsage } from './client.js';

const SYSTEM_PROMPT = await readFile('style-guide.md', 'utf8');

const res = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 512,
  system: SYSTEM_PROMPT, // plain string: no cache_control, nothing is cached
  messages: [{ role: 'user', content: 'Summarize the refund policy in one line.' }],
});

console.log(summarizeUsage(res.usage)); // write=0 read=0 fresh=<all input> output=<n>

Provide any style-guide.md longer than 1,024 tokens (roughly 750 words or more). Run it twice within a minute:

node --import tsx src/baseline.ts

Both runs report write=0 read=0, and fresh equals the full system prompt plus the user message every time. Nothing is cached, so you pay the base input rate on the entire prefix on every call.

Step 2: Add a cache breakpoint

Caching the static prefix is one field. Move the system prompt into a content block and attach cache_control. Create src/cached-system.ts:

import { readFile } from 'node:fs/promises';
import { client, summarizeUsage } from './client.js';

// A large, static instruction block reused on every request. It must reach the
// per-model minimum (1,024 tokens for Sonnet 4.6) to actually cache.
const SYSTEM_PROMPT = await readFile('style-guide.md', 'utf8');

export async function ask(question: string) {
  const res = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 512,
    system: [
      {
        type: 'text',
        text: SYSTEM_PROMPT,
        cache_control: { type: 'ephemeral' }, // breakpoint on the static prefix
      },
    ],
    messages: [{ role: 'user', content: question }],
  });
  console.log(summarizeUsage(res.usage));
  return res;
}

The breakpoint goes on the block that stays identical across requests — the system prompt — not on the changing user message. Place a breakpoint on content that varies (a timestamp, the incoming question) and the prefix hash never matches, so nothing ever hits.³

One detail decides whether this works at all: the cached prefix has to reach a per-model minimum length. Below it, the request is processed with no caching and no error is raised — you just keep paying full price.⁵

Model	Minimum cacheable tokens
Claude Sonnet 4.6	1,024
Claude Opus 4.8	1,024
Claude Opus 4.6 / 4.5	4,096
Claude Haiku 4.5	4,096

That Haiku number is an easy one to get wrong: Claude Haiku 4.5 needs 4,096 tokens to cache, not the 2,048 figure that applied to older Haiku models.⁵ If you cache a 2,000-token prefix on Haiku 4.5, it silently does nothing.

Step 3: Measure the cache hit

The response usage object tells you exactly what happened. Three fields matter:⁶

cache_creation_input_tokens — tokens written to the cache on this request
cache_read_input_tokens — tokens read back from the cache on this request
input_tokens — only the tokens after your last breakpoint, not the whole prompt

Call ask() twice inside the five-minute window:

import { ask } from './cached-system.js';

await ask('What is the refund window?');       // first call
await ask('How do I escalate a billing issue?'); // second call, same cached prefix

The first call writes the entry: write=<prefix size> read=0. The second call reads it: write=0 read=<prefix size>, with fresh holding only the short user message. The whole prompt was sent both times, but the second request charged the cached prefix at the read rate.

The clearest diagnostic is the zero-zero rule: if both cache_creation_input_tokens and cache_read_input_tokens are 0, nothing cached — almost always because the prefix was under the minimum length.⁵ Note also that total_input_tokens = cache_read_input_tokens + cache_creation_input_tokens + input_tokens; the input_tokens field alone is not your total prompt size.⁶

Step 4: Calculate the savings

The pricing is a fixed set of multipliers on the base input price: five-minute writes are 1.25×, one-hour writes are 2×, and reads are 0.1×.⁴ Encode the table and a cost function. Create src/cost.ts:

import type Anthropic from '@anthropic-ai/sdk';

export interface Pricing {
  base: number;     // $ / MTok, uncached input
  write5m: number;  // 5-minute cache write
  write1h: number;  // 1-hour cache write
  read: number;     // cache hit / refresh
  output: number;
}

export const PRICING: Record<string, Pricing> = {
  'claude-sonnet-4-6': { base: 3, write5m: 3.75, write1h: 6, read: 0.3, output: 15 },
  'claude-haiku-4-5':  { base: 1, write5m: 1.25, write1h: 2, read: 0.1, output: 5 },
  'claude-opus-4-8':   { base: 5, write5m: 6.25, write1h: 10, read: 0.5, output: 25 },
};

const perToken = (mtokPrice: number) => mtokPrice / 1_000_000;

/** Dollar cost of one real response, from its usage object. */
export function requestCost(usage: Anthropic.Usage, p: Pricing): number {
  const read = usage.cache_read_input_tokens ?? 0;
  const fresh = usage.input_tokens;
  const out = usage.output_tokens;
  const w5 = usage.cache_creation?.ephemeral_5m_input_tokens
    ?? usage.cache_creation_input_tokens ?? 0;
  const w1 = usage.cache_creation?.ephemeral_1h_input_tokens ?? 0;
  return (
    fresh * perToken(p.base) +
    read * perToken(p.read) +
    w5 * perToken(p.write5m) +
    w1 * perToken(p.write1h) +
    out * perToken(p.output)
  );
}

/** Cost of a static prefix of `tokens` reused across `requests` calls (write once, read after). */
export function prefixCost(tokens: number, requests: number, p: Pricing) {
  const cached = tokens * perToken(p.write5m) + (requests - 1) * tokens * perToken(p.read);
  const uncached = requests * tokens * perToken(p.base);
  return { cached, uncached, savedPct: (1 - cached / uncached) * 100 };
}

Take a 2,000-token system prompt answered 100 times on Sonnet 4.6, with the calls close enough together that the entry stays warm. Without caching you pay for 2,000 tokens at $3/MTok on all 100 calls: 100 × 2000 × $3 / 1e6 = $0.60. With caching you write once at $3.75/MTok and read 99 times at $0.30/MTok: 2000 × $3.75/1e6 + 99 × 2000 × $0.30/1e6 = $0.0669. That is an 88.8% reduction on the prefix.⁴ The requestCost helper applies the same math to a real usage object so you can log the exact dollar cost of any response.

Step 5: Automatic caching for conversations

Explicit breakpoints are precise but tedious in a multi-turn chat, where the "last stable block" moves every turn. Automatic caching solves that: set a single cache_control at the top level of the request and the system applies the breakpoint to the last cacheable block, advancing it as the history grows.³ Create src/automatic.ts:

import { client, summarizeUsage } from './client.js';
import type Anthropic from '@anthropic-ai/sdk';

// Automatic caching: one top-level cache_control marker. The breakpoint moves
// to the last cacheable block on every request as the history grows.
export async function chat(history: Anthropic.MessageParam[]) {
  const res = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 512,
    cache_control: { type: 'ephemeral' }, // top-level => automatic breakpoint
    system: 'You are a terse senior engineer.',
    messages: history,
  });
  console.log(summarizeUsage(res.usage));
  return res;
}

Each turn caches everything up to the last block and reads the prior turns back. Automatic caching uses the same infrastructure as explicit breakpoints — identical pricing, minimum thresholds, and the 20-block lookback — and it consumes one of your four breakpoint slots.³ You can combine the two: an explicit breakpoint on a long system prompt plus automatic caching for the conversation. If the last block already carries an explicit marker with the same TTL, the top-level field is a harmless no-op; with a different TTL the API returns a 400.³

Step 6: Extended TTL and multiple breakpoints

A five-minute window is fine for back-to-back requests but loses the entry between sparser calls. Opt into a one-hour lifetime by adding ttl to the marker. Create src/extended-ttl.ts:

import { readFile } from 'node:fs/promises';
import { client } from './client.js';

const REFERENCE_DOC = await readFile('handbook.md', 'utf8');

export async function queryDoc(question: string) {
  const res = await client.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 512,
    system: [
      {
        type: 'text',
        text: REFERENCE_DOC,
        cache_control: { type: 'ephemeral', ttl: '1h' }, // hold for one hour
      },
    ],
    messages: [{ role: 'user', content: question }],
  });
  const c = res.usage.cache_creation;
  console.log(`5m_write=${c?.ephemeral_5m_input_tokens ?? 0} 1h_write=${c?.ephemeral_1h_input_tokens ?? 0}`);
  return res;
}

The one-hour write costs 2× base instead of 1.25×, so reserve it for prompts reused less often than every five minutes; anything hit more frequently is cheaper on the default TTL, which refreshes for free on each use.⁴ When you mix durations, usage.cache_creation breaks the write down into ephemeral_5m_input_tokens and ephemeral_1h_input_tokens.⁶

You can define up to four breakpoints. The reason to add a second is the 20-block lookback limit: the system checks at most 20 positions back from a breakpoint for a prior cache write, so in a long conversation a breakpoint that drifts 20-plus blocks past the last write stops hitting. A second breakpoint nearer that position keeps a usable entry alive.³ Breakpoints themselves are free — you pay only for tokens actually written and read.³

Verification

Everything above is type-checked against the installed SDK. The cost math is also testable with no API key. Create src/cost.test.ts:

import { test } from 'node:test';
import assert from 'node:assert/strict';
import { PRICING, prefixCost } from './cost.js';

test('2k-token Sonnet prefix reused 100x saves ~89%', () => {
  const p = PRICING['claude-sonnet-4-6']!;
  const { cached, uncached, savedPct } = prefixCost(2000, 100, p);
  console.log(`uncached=$${uncached.toFixed(4)} cached=$${cached.toFixed(4)} saved=${savedPct.toFixed(1)}%`);
  assert.ok(Math.abs(uncached - 0.6) < 1e-9);
  assert.ok(savedPct > 88 && savedPct < 90);
});

Then run both checks:

npx tsc --noEmit          # TypeScript 6.0.3 strict: clean
node --import tsx --test src/cost.test.ts

The test confirms the headline number: the 2,000-token-by-100 example resolves to uncached=$0.6000 cached=$0.0669 saved=88.8%. Because requestCost and prefixCost need no API key, you can fold cost assertions into CI. For the API-calling scripts, the live check is the usage fields: a second call inside the window should print read= equal to your prefix size.

Troubleshooting

Both usage counters are 0. The prefix did not reach the minimum length. Sonnet 4.6 needs 1,024 tokens; Haiku 4.5 needs 4,096. Lengthen the cached content or move shorter prompts to a model with a lower threshold.⁵

The cache writes but never reads. Your breakpoint is on a block that changes every request — a timestamp, per-request context, or the user message itself. Move the marker to the end of the static prefix so the hashed prefix is identical across calls.³

Hits stop after a gap. The default entry lives five minutes from its last use. Calls spaced further apart need the ttl: '1h' marker, or a request cadence under five minutes to keep the entry warm for free.³

A hit suddenly turns into a write. Something upstream in the prefix changed. Changing tool definitions invalidates the entire cache; editing the system prompt invalidates the system and message caches; and toggling tool_choice or adding or removing an image anywhere in the prompt forces a new entry.³ Caching follows the tools → system → messages hierarchy, and a change at any level invalidates that level and everything after it.³

Parallel requests all miss on the first burst. A cache entry only becomes available after the first response begins, so fire one request, wait for it, then send the rest of the batch.⁵

Next steps and further reading

Prompt caching composes with the rest of the Claude API. Pair it with structured outputs in TypeScript to cache a long schema-bearing system prompt across an extraction run, with the tool-use agentic loop where stable tool definitions are prime caching targets, or stack it on the Claude Batch API tutorial — the caching multipliers apply on top of the batch discount.⁴ Start with automatic caching for conversations and reach for explicit breakpoints when different sections change at different rates.³

Node.js release schedule, Node 24 "Krypton" Active LTS. https://nodejs.org/en/about/previous-releases ↩
Package versions verified via npm view <pkg> version, June 15, 2026. https://www.npmjs.com/package/@anthropic-ai/sdk ↩
Anthropic, "Prompt caching" — automatic vs. explicit caching, breakpoints, the 20-block lookback, TTL, and invalidation. https://platform.claude.com/docs/en/build-with-claude/prompt-caching ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹ ↩¹⁰ ↩¹¹ ↩¹² ↩¹³
Anthropic, "Prompt caching" pricing table and multipliers (5m write 1.25×, 1h write 2×, read 0.1×). https://platform.claude.com/docs/en/build-with-claude/prompt-caching#pricing ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Anthropic, "Prompt caching" cache limitations — per-model minimum cacheable prompt lengths and silent no-cache behavior. https://platform.claude.com/docs/en/build-with-claude/prompt-caching#cache-limitations ↩ ↩² ↩³ ↩⁴ ↩⁵
Anthropic, "Prompt caching" tracking cache performance — cache_creation_input_tokens, cache_read_input_tokens, input_tokens. https://platform.claude.com/docs/en/build-with-claude/prompt-caching#tracking-cache-performance ↩ ↩² ↩³