LangSmith Deep Dive
Multi-turn Evaluation
3 min read
Real conversations span multiple turns. Evaluating just single responses misses context, coherence, and conversation flow. LangSmith supports evaluating entire conversation threads.
Why Multi-turn Evaluation?
| Single-turn | Multi-turn |
|---|---|
| Evaluates one response | Evaluates conversation flow |
| Misses context issues | Catches context loss |
| Can't detect contradictions | Finds inconsistencies |
| Ignores conversation goals | Measures goal completion |
Conversation Structure
A multi-turn conversation trace:
Conversation Thread
├── Turn 1: User asks question
│ └── Assistant responds
├── Turn 2: User follows up
│ └── Assistant responds (using context)
├── Turn 3: User asks clarification
│ └── Assistant responds (maintaining coherence)
└── Turn 4: User confirms resolution
└── Assistant closes appropriately
Tracing Conversations
Use thread IDs to link conversation turns:
from langsmith import traceable
import uuid
@traceable
def chat(messages: list, thread_id: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return response.choices[0].message.content
# Same thread_id links all turns
thread_id = str(uuid.uuid4())
# Turn 1
response1 = chat(
[{"role": "user", "content": "I need to cancel my order"}],
thread_id=thread_id
)
# Turn 2 - includes history
response2 = chat(
[
{"role": "user", "content": "I need to cancel my order"},
{"role": "assistant", "content": response1},
{"role": "user", "content": "Order #12345"}
],
thread_id=thread_id
)
Multi-turn Evaluation Criteria
Evaluate conversations on:
1. Context Retention
Does the assistant remember earlier turns?
def evaluate_context_retention(conversation: list) -> dict:
# Check if later responses reference earlier context
prompt = """
Review this conversation. Did the assistant maintain
context from earlier turns appropriately?
Conversation: {conversation}
Score 1-5 and explain.
"""
return llm_judge(prompt, conversation=conversation)
2. Goal Completion
Did the conversation achieve its purpose?
| Conversation Goal | Success Criteria |
|---|---|
| Support ticket | Issue resolved |
| Information query | Question answered completely |
| Task completion | Action confirmed |
| Clarification | Understanding verified |
3. Coherence
No contradictions between turns:
def evaluate_coherence(conversation: list) -> dict:
prompt = """
Check if the assistant contradicted itself
at any point in this conversation.
Conversation: {conversation}
Return: coherent (yes/no), contradictions found.
"""
return llm_judge(prompt, conversation=conversation)
4. Appropriate Length
Conversation shouldn't be unnecessarily long:
- Did it resolve in reasonable turns?
- Were there redundant exchanges?
- Did the user have to repeat themselves?
Building Multi-turn Datasets
Structure your evaluation data:
conversations:
- id: "conv-001"
scenario: "Order cancellation"
turns:
- role: user
content: "I want to cancel my order"
- role: assistant
content: "I can help with that. What's your order number?"
- role: user
content: "12345"
- role: assistant
content: "Order #12345 has been cancelled. Refund in 3-5 days."
expected_outcome: "Order cancelled successfully"
max_turns: 4
Key Metrics
Track these for conversation quality:
| Metric | Description |
|---|---|
| Resolution rate | % conversations achieving goal |
| Average turns | Turns needed for resolution |
| Escalation rate | % needing human handoff |
| Satisfaction | End-of-conversation feedback |
Tip: Start with your longest conversations—they often reveal the most issues with context retention.
Next, we'll learn how to build custom evaluators and manage evaluation datasets in LangSmith. :::