Checkpointing Fundamentals

Why Checkpointing is Critical

Real Scenario (January 2026): A 4-hour research workflow crashed at hour 3. Without checkpointing: restart from zero, $180 wasted. With checkpointing: resume from last state, complete in 20 minutes.

What is Checkpointing?

Checkpointing saves the complete workflow state at each step, enabling:

Resume after crashes - Pick up where you left off
Time-travel debugging - Inspect any previous state
Human-in-the-loop - Pause, get approval, continue
Branching - Fork from any checkpoint

from langgraph.graph import StateGraph
from langgraph.checkpoint.memory import MemorySaver

# Create checkpointer
checkpointer = MemorySaver()

# Compile with checkpointing
app = graph.compile(checkpointer=checkpointer)

# Run with thread_id for state isolation
config = {"configurable": {"thread_id": "user-123"}}
result = app.invoke({"query": "research AI"}, config)

# Resume later with same thread_id
result2 = app.invoke({"query": "continue"}, config)

MemorySaver: Development & Testing

from langgraph.checkpoint.memory import MemorySaver

# Simple in-memory storage
checkpointer = MemorySaver()

# Pros:
# - Zero setup
# - Fast (RAM)
# - Great for testing

# Cons:
# - Lost on restart
# - Single process only
# - Not for production

Use For: Local development, unit tests, demos.

State Snapshots: What Gets Saved

from typing import TypedDict, Annotated
import operator

class ResearchState(TypedDict):
    query: str
    documents: Annotated[list[str], operator.add]
    analysis: str
    iteration: int

# At each node exit, LangGraph saves:
# {
#     "query": "current query",
#     "documents": ["doc1", "doc2", ...],
#     "analysis": "current analysis",
#     "iteration": 3
# }

Checkpoints include:

Full state dictionary
Current node position
Pending tasks (for parallel execution)
Metadata (timestamp, thread_id)

Thread Isolation

# Each thread_id gets isolated state
config_user1 = {"configurable": {"thread_id": "user-1"}}
config_user2 = {"configurable": {"thread_id": "user-2"}}

# User 1's workflow
app.invoke({"query": "AI research"}, config_user1)

# User 2's workflow - completely separate state
app.invoke({"query": "ML training"}, config_user2)

# Resume User 1 - gets User 1's state only
app.invoke({"query": "continue"}, config_user1)

Checkpoint Structure

# Get current checkpoint
state = app.get_state(config)

# Checkpoint contains:
print(state.values)      # Current state dict
print(state.next)        # Next node(s) to execute
print(state.config)      # Configuration
print(state.metadata)    # Timestamps, etc.
print(state.parent_config)  # Previous checkpoint

# List all checkpoints for thread
history = list(app.get_state_history(config))
for checkpoint in history:
    print(f"Step: {checkpoint.metadata.get('step')}")

Interview Questions

Q: What is checkpointing in LangGraph?

"Checkpointing saves complete workflow state at each step—state values, current position, pending tasks. It enables crash recovery, time-travel debugging, human-in-the-loop workflows, and branching from any point."

Q: When use MemorySaver vs production checkpointers?

"MemorySaver for development, testing, and demos—it's fast but lost on restart. Production needs SqliteSaver (single instance) or PostgresSaver (distributed). Never use MemorySaver in production."

Key Takeaways

✅ Checkpointing saves full state at each step ✅ Thread IDs isolate user sessions ✅ MemorySaver only for development ✅ Enables resume, time-travel, human-in-loop

:::