Multi-turn Evaluation

Real conversations span multiple turns. Evaluating just single responses misses context, coherence, and conversation flow. LangSmith supports evaluating entire conversation threads.

Why Multi-turn Evaluation?

Single-turn	Multi-turn
Evaluates one response	Evaluates conversation flow
Misses context issues	Catches context loss
Can't detect contradictions	Finds inconsistencies
Ignores conversation goals	Measures goal completion

Conversation Structure

A multi-turn conversation trace:

Conversation Thread
├── Turn 1: User asks question
│   └── Assistant responds
├── Turn 2: User follows up
│   └── Assistant responds (using context)
├── Turn 3: User asks clarification
│   └── Assistant responds (maintaining coherence)
└── Turn 4: User confirms resolution
    └── Assistant closes appropriately

Tracing Conversations

Use thread IDs to link conversation turns:

from langsmith import traceable
import uuid

@traceable
def chat(messages: list, thread_id: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages
    )
    return response.choices[0].message.content

# Same thread_id links all turns
thread_id = str(uuid.uuid4())

# Turn 1
response1 = chat(
    [{"role": "user", "content": "I need to cancel my order"}],
    thread_id=thread_id
)

# Turn 2 - includes history
response2 = chat(
    [
        {"role": "user", "content": "I need to cancel my order"},
        {"role": "assistant", "content": response1},
        {"role": "user", "content": "Order #12345"}
    ],
    thread_id=thread_id
)

Multi-turn Evaluation Criteria

Evaluate conversations on:

1. Context Retention

Does the assistant remember earlier turns?

def evaluate_context_retention(conversation: list) -> dict:
    # Check if later responses reference earlier context
    prompt = """
    Review this conversation. Did the assistant maintain
    context from earlier turns appropriately?

    Conversation: {conversation}

    Score 1-5 and explain.
    """
    return llm_judge(prompt, conversation=conversation)

2. Goal Completion

Did the conversation achieve its purpose?

Conversation Goal	Success Criteria
Support ticket	Issue resolved
Information query	Question answered completely
Task completion	Action confirmed
Clarification	Understanding verified

3. Coherence

No contradictions between turns:

def evaluate_coherence(conversation: list) -> dict:
    prompt = """
    Check if the assistant contradicted itself
    at any point in this conversation.

    Conversation: {conversation}

    Return: coherent (yes/no), contradictions found.
    """
    return llm_judge(prompt, conversation=conversation)

4. Appropriate Length

Conversation shouldn't be unnecessarily long:

Did it resolve in reasonable turns?
Were there redundant exchanges?
Did the user have to repeat themselves?

Building Multi-turn Datasets

Structure your evaluation data:

conversations:
  - id: "conv-001"
    scenario: "Order cancellation"
    turns:
      - role: user
        content: "I want to cancel my order"
      - role: assistant
        content: "I can help with that. What's your order number?"
      - role: user
        content: "12345"
      - role: assistant
        content: "Order #12345 has been cancelled. Refund in 3-5 days."
    expected_outcome: "Order cancelled successfully"
    max_turns: 4

Key Metrics

Track these for conversation quality:

Metric	Description
Resolution rate	% conversations achieving goal
Average turns	Turns needed for resolution
Escalation rate	% needing human handoff
Satisfaction	End-of-conversation feedback

Tip: Start with your longest conversations—they often reveal the most issues with context retention.

Next, we'll learn how to build custom evaluators and manage evaluation datasets in LangSmith. :::