Production MCP Systems

Monitoring and Observability

5 min read

Production MCP servers need comprehensive monitoring to ensure reliability and debug issues.

The Three Pillars

Pillar Purpose Tools
Logs Event details Structured logging
Metrics Measurements Prometheus, StatsD
Traces Request flow OpenTelemetry

Structured Logging

import logging
import json
from datetime import datetime

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
            "function": record.funcName,
        }
        if hasattr(record, "extra"):
            log_data.update(record.extra)
        return json.dumps(log_data)

# Configure logging
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger("mcp")
logger.addHandler(handler)
logger.setLevel(logging.INFO)

# Usage
logger.info("Tool called", extra={"tool": "search", "user": "user123"})

Metrics with Prometheus

from prometheus_client import Counter, Histogram, generate_latest
from starlette.responses import Response

# Define metrics
TOOL_CALLS = Counter(
    "mcp_tool_calls_total",
    "Total tool calls",
    ["tool_name", "status"]
)

TOOL_LATENCY = Histogram(
    "mcp_tool_latency_seconds",
    "Tool call latency",
    ["tool_name"]
)

# Instrument tool calls
@server.call_tool()
async def call_tool(name: str, arguments: dict):
    with TOOL_LATENCY.labels(tool_name=name).time():
        try:
            result = await execute_tool(name, arguments)
            TOOL_CALLS.labels(tool_name=name, status="success").inc()
            return result
        except Exception as e:
            TOOL_CALLS.labels(tool_name=name, status="error").inc()
            raise

# Metrics endpoint
@app.route("/metrics")
async def metrics(request):
    return Response(generate_latest(), media_type="text/plain")

Health Checks

@app.route("/health")
async def health(request):
    checks = {
        "database": await check_database(),
        "redis": await check_redis(),
        "external_api": await check_external_api(),
    }

    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503

    return JSONResponse(
        {"status": "healthy" if all_healthy else "unhealthy", "checks": checks},
        status_code=status_code
    )

async def check_database():
    try:
        await db.execute("SELECT 1")
        return True
    except:
        return False

Alerting Rules

Configure alerts in Prometheus/Grafana:

groups:
  - name: mcp-alerts
    rules:
      - alert: HighErrorRate
        expr: rate(mcp_tool_calls_total{status="error"}[5m]) > 0.1
        for: 5m
        annotations:
          summary: "High error rate in MCP server"

      - alert: SlowToolCalls
        expr: histogram_quantile(0.95, mcp_tool_latency_seconds) > 5
        for: 5m
        annotations:
          summary: "Tool calls taking too long"

Next, let's explore testing strategies. :::

Quiz

Module 5 Quiz: Production MCP Systems

Take Quiz