Production MCP Systems
Monitoring and Observability
5 min read
Production MCP servers need comprehensive monitoring to ensure reliability and debug issues.
The Three Pillars
| Pillar | Purpose | Tools |
|---|---|---|
| Logs | Event details | Structured logging |
| Metrics | Measurements | Prometheus, StatsD |
| Traces | Request flow | OpenTelemetry |
Structured Logging
import logging
import json
from datetime import datetime
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"level": record.levelname,
"message": record.getMessage(),
"module": record.module,
"function": record.funcName,
}
if hasattr(record, "extra"):
log_data.update(record.extra)
return json.dumps(log_data)
# Configure logging
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("mcp")
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Usage
logger.info("Tool called", extra={"tool": "search", "user": "user123"})
Metrics with Prometheus
from prometheus_client import Counter, Histogram, generate_latest
from starlette.responses import Response
# Define metrics
TOOL_CALLS = Counter(
"mcp_tool_calls_total",
"Total tool calls",
["tool_name", "status"]
)
TOOL_LATENCY = Histogram(
"mcp_tool_latency_seconds",
"Tool call latency",
["tool_name"]
)
# Instrument tool calls
@server.call_tool()
async def call_tool(name: str, arguments: dict):
with TOOL_LATENCY.labels(tool_name=name).time():
try:
result = await execute_tool(name, arguments)
TOOL_CALLS.labels(tool_name=name, status="success").inc()
return result
except Exception as e:
TOOL_CALLS.labels(tool_name=name, status="error").inc()
raise
# Metrics endpoint
@app.route("/metrics")
async def metrics(request):
return Response(generate_latest(), media_type="text/plain")
Health Checks
@app.route("/health")
async def health(request):
checks = {
"database": await check_database(),
"redis": await check_redis(),
"external_api": await check_external_api(),
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return JSONResponse(
{"status": "healthy" if all_healthy else "unhealthy", "checks": checks},
status_code=status_code
)
async def check_database():
try:
await db.execute("SELECT 1")
return True
except:
return False
Alerting Rules
Configure alerts in Prometheus/Grafana:
groups:
- name: mcp-alerts
rules:
- alert: HighErrorRate
expr: rate(mcp_tool_calls_total{status="error"}[5m]) > 0.1
for: 5m
annotations:
summary: "High error rate in MCP server"
- alert: SlowToolCalls
expr: histogram_quantile(0.95, mcp_tool_latency_seconds) > 5
for: 5m
annotations:
summary: "Tool calls taking too long"
Next, let's explore testing strategies. :::