LangSmith Deep Dive

Multi-turn Evaluation

3 min read

Real conversations span multiple turns. Evaluating just single responses misses context, coherence, and conversation flow. LangSmith supports evaluating entire conversation threads.

Why Multi-turn Evaluation?

Single-turnMulti-turn
Evaluates one responseEvaluates conversation flow
Misses context issuesCatches context loss
Can't detect contradictionsFinds inconsistencies
Ignores conversation goalsMeasures goal completion

Conversation Structure

A multi-turn conversation trace:

Conversation Thread
├── Turn 1: User asks question
│   └── Assistant responds
├── Turn 2: User follows up
│   └── Assistant responds (using context)
├── Turn 3: User asks clarification
│   └── Assistant responds (maintaining coherence)
└── Turn 4: User confirms resolution
    └── Assistant closes appropriately

Tracing Conversations

Use thread IDs to link conversation turns:

from langsmith import traceable
import uuid

@traceable
def chat(messages: list, thread_id: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5.4-mini",
        messages=messages
    )
    return response.choices[0].message.content

# Same thread_id links all turns
thread_id = str(uuid.uuid4())

# Turn 1
response1 = chat(
    [{"role": "user", "content": "I need to cancel my order"}],
    thread_id=thread_id
)

# Turn 2 - includes history
response2 = chat(
    [
        {"role": "user", "content": "I need to cancel my order"},
        {"role": "assistant", "content": response1},
        {"role": "user", "content": "Order #12345"}
    ],
    thread_id=thread_id
)

Multi-turn Evaluation Criteria

Evaluate conversations on:

1. Context Retention

Does the assistant remember earlier turns?

def evaluate_context_retention(conversation: list) -> dict:
    # Check if later responses reference earlier context
    prompt = """
    Review this conversation. Did the assistant maintain
    context from earlier turns appropriately?

    Conversation: {conversation}

    Score 1-5 and explain.
    """
    return llm_judge(prompt, conversation=conversation)

2. Goal Completion

Did the conversation achieve its purpose?

Conversation GoalSuccess Criteria
Support ticketIssue resolved
Information queryQuestion answered completely
Task completionAction confirmed
ClarificationUnderstanding verified

3. Coherence

No contradictions between turns:

def evaluate_coherence(conversation: list) -> dict:
    prompt = """
    Check if the assistant contradicted itself
    at any point in this conversation.

    Conversation: {conversation}

    Return: coherent (yes/no), contradictions found.
    """
    return llm_judge(prompt, conversation=conversation)

4. Appropriate Length

Conversation shouldn't be unnecessarily long:

  • Did it resolve in reasonable turns?
  • Were there redundant exchanges?
  • Did the user have to repeat themselves?

Building Multi-turn Datasets

Structure your evaluation data:

conversations:
  - id: "conv-001"
    scenario: "Order cancellation"
    turns:
      - role: user
        content: "I want to cancel my order"
      - role: assistant
        content: "I can help with that. What's your order number?"
      - role: user
        content: "12345"
      - role: assistant
        content: "Order #12345 has been cancelled. Refund in 3-5 days."
    expected_outcome: "Order cancelled successfully"
    max_turns: 4

Key Metrics

Track these for conversation quality:

MetricDescription
Resolution rate% conversations achieving goal
Average turnsTurns needed for resolution
Escalation rate% needing human handoff
SatisfactionEnd-of-conversation feedback

Tip: Start with your longest conversations—they often reveal the most issues with context retention.

Next, we'll learn how to build custom evaluators and manage evaluation datasets in LangSmith. :::

Quick check: how does this lesson land for you?

Quiz

Module 3: LangSmith Deep Dive

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.