ML Monitoring & Next Steps

ML Governance & Compliance

3 min read

As ML systems make higher-stakes decisions, regulatory requirements grow. Organizations need governance frameworks to ensure responsible AI deployment.

Why ML Governance Matters

Risk Impact Example
Bias Discrimination lawsuits Hiring model favoring demographics
Privacy GDPR fines Model memorizing PII
Safety Product liability Autonomous system failures
Transparency Regulatory penalties "Black box" loan decisions

Regulatory Landscape (2025)

EU AI Act

The world's first comprehensive AI regulation, with enforcement beginning in 2025-2026:

Risk Category Requirements Examples
Unacceptable Banned Social scoring, manipulative AI
High-Risk Strict compliance Healthcare, hiring, credit
Limited Transparency Chatbots, emotion detection
Minimal Self-regulation Spam filters, games

High-Risk Requirements

# Required for high-risk AI systems
documentation:
  - Technical documentation of design
  - Risk management system
  - Data governance practices
  - Human oversight measures
  - Accuracy/robustness metrics

capabilities:
  - Audit logging
  - Traceability
  - Bias testing
  - Human override

US Regulatory Framework

  • NIST AI RMF: Voluntary risk management framework
  • State Laws: California, Colorado, Illinois AI regulations
  • Sector-Specific: FDA (healthcare), SEC (finance), FTC (consumer protection)

Model Cards

Standard Format

# model_card.yaml
model_details:
  name: "fraud-detector-v3"
  version: "3.2.1"
  type: "Binary Classification"
  framework: "scikit-learn 1.4.0"
  owner: "risk-team@company.com"
  created: "2025-01-15"

intended_use:
  primary: "Detect fraudulent payment transactions"
  out_of_scope:
    - "Account-level fraud detection"
    - "Identity verification"
  users: "Payment processing pipeline (automated)"

training_data:
  source: "Internal transaction database"
  size: "5.2M transactions"
  date_range: "2024-01-01 to 2024-12-31"
  preprocessing: "See data_pipeline.md"

evaluation:
  metrics:
    accuracy: 0.94
    precision: 0.91
    recall: 0.87
    f1: 0.89
    auc_roc: 0.96
  test_set: "500K transactions (holdout)"
  slices:
    - name: "High-value (>$10K)"
      accuracy: 0.92
    - name: "International"
      accuracy: 0.89
    - name: "New customers (<30 days)"
      accuracy: 0.85

ethical_considerations:
  bias_analysis:
    performed: true
    method: "Demographic parity analysis"
    findings: "No significant disparate impact detected"
  fairness_constraints:
    - "Equal opportunity across merchant categories"
  limitations:
    - "Lower accuracy on new customer segments"
    - "May not detect novel fraud patterns"

caveats:
  - "Requires at least 3 prior transactions for accuracy"
  - "Performance degrades on international transactions"
  - "Retrain recommended every 90 days"

Generating Model Cards

# model_card_generator.py
from dataclasses import dataclass
from typing import Optional
import yaml
import json

@dataclass
class ModelCard:
    name: str
    version: str
    owner: str
    description: str
    metrics: dict
    training_data_info: dict
    ethical_considerations: dict
    limitations: list[str]

    def to_yaml(self) -> str:
        return yaml.dump(self.__dict__, default_flow_style=False)

    def to_markdown(self) -> str:
        md = f"# Model Card: {self.name} v{self.version}\n\n"
        md += f"**Owner**: {self.owner}\n\n"
        md += f"## Description\n{self.description}\n\n"
        md += "## Metrics\n"
        for k, v in self.metrics.items():
            md += f"- **{k}**: {v}\n"
        md += "\n## Limitations\n"
        for lim in self.limitations:
            md += f"- {lim}\n"
        return md

def generate_model_card(
    model,
    test_data,
    test_labels,
    metadata: dict
) -> ModelCard:
    """Generate model card from trained model."""
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

    predictions = model.predict(test_data)

    metrics = {
        "accuracy": round(accuracy_score(test_labels, predictions), 4),
        "precision": round(precision_score(test_labels, predictions), 4),
        "recall": round(recall_score(test_labels, predictions), 4),
        "f1": round(f1_score(test_labels, predictions), 4)
    }

    return ModelCard(
        name=metadata["name"],
        version=metadata["version"],
        owner=metadata["owner"],
        description=metadata.get("description", ""),
        metrics=metrics,
        training_data_info=metadata.get("training_data", {}),
        ethical_considerations=metadata.get("ethics", {}),
        limitations=metadata.get("limitations", [])
    )

Audit Logging

Comprehensive Logging

# audit_logger.py
import json
import logging
from datetime import datetime
from typing import Any

class AuditLogger:
    def __init__(self, model_name: str, version: str):
        self.model_name = model_name
        self.version = version
        self.logger = logging.getLogger("ml_audit")

    def log_prediction(
        self,
        prediction_id: str,
        input_features: dict,
        output: Any,
        confidence: float,
        user_id: str | None = None,
        metadata: dict | None = None
    ):
        """Log prediction for audit trail."""
        record = {
            "timestamp": datetime.utcnow().isoformat(),
            "event_type": "prediction",
            "model_name": self.model_name,
            "model_version": self.version,
            "prediction_id": prediction_id,
            "input_hash": self._hash_input(input_features),
            "output": output,
            "confidence": confidence,
            "user_id": user_id,
            "metadata": metadata or {}
        }
        self.logger.info(json.dumps(record))

    def log_model_deployment(
        self,
        previous_version: str | None,
        deployment_type: str,
        approver: str
    ):
        """Log model deployment event."""
        record = {
            "timestamp": datetime.utcnow().isoformat(),
            "event_type": "deployment",
            "model_name": self.model_name,
            "model_version": self.version,
            "previous_version": previous_version,
            "deployment_type": deployment_type,
            "approver": approver
        }
        self.logger.info(json.dumps(record))

    def log_override(
        self,
        prediction_id: str,
        original_output: Any,
        overridden_output: Any,
        reason: str,
        operator: str
    ):
        """Log human override of model prediction."""
        record = {
            "timestamp": datetime.utcnow().isoformat(),
            "event_type": "human_override",
            "model_name": self.model_name,
            "prediction_id": prediction_id,
            "original_output": original_output,
            "overridden_output": overridden_output,
            "reason": reason,
            "operator": operator
        }
        self.logger.info(json.dumps(record))

    def _hash_input(self, features: dict) -> str:
        import hashlib
        return hashlib.sha256(
            json.dumps(features, sort_keys=True).encode()
        ).hexdigest()[:16]

Bias Testing

Fairness Metrics

# bias_testing.py
import pandas as pd
import numpy as np

def demographic_parity(
    predictions: np.ndarray,
    protected_attribute: np.ndarray
) -> dict:
    """Calculate demographic parity difference."""
    groups = np.unique(protected_attribute)
    positive_rates = {}

    for group in groups:
        mask = protected_attribute == group
        positive_rates[group] = predictions[mask].mean()

    max_rate = max(positive_rates.values())
    min_rate = min(positive_rates.values())

    return {
        "positive_rates": positive_rates,
        "demographic_parity_difference": max_rate - min_rate,
        "demographic_parity_ratio": min_rate / max_rate if max_rate > 0 else 0
    }

def equal_opportunity(
    predictions: np.ndarray,
    labels: np.ndarray,
    protected_attribute: np.ndarray
) -> dict:
    """Calculate equal opportunity difference (TPR parity)."""
    groups = np.unique(protected_attribute)
    tpr_by_group = {}

    for group in groups:
        mask = (protected_attribute == group) & (labels == 1)
        if mask.sum() > 0:
            tpr_by_group[group] = predictions[mask].mean()
        else:
            tpr_by_group[group] = None

    valid_tprs = [v for v in tpr_by_group.values() if v is not None]

    return {
        "tpr_by_group": tpr_by_group,
        "equal_opportunity_difference": max(valid_tprs) - min(valid_tprs) if valid_tprs else None
    }

# Example usage
results = demographic_parity(
    predictions=model.predict(X_test),
    protected_attribute=X_test["gender"]
)

if results["demographic_parity_difference"] > 0.1:
    print("WARNING: Potential bias detected")
    print(f"Positive rates by group: {results['positive_rates']}")

Governance Checklist

Pre-Deployment

  • Model card documented
  • Bias testing completed
  • Human oversight mechanism defined
  • Rollback procedure documented
  • Audit logging implemented
  • Data lineage traceable
  • Approval from compliance/legal

Production Monitoring

  • Prediction distribution monitored
  • Fairness metrics tracked
  • Human override rate monitored
  • Drift detection active
  • Audit logs retained per policy

Key insight: Governance isn't just compliance—it's building trustworthy AI systems that organizations and users can rely on.

Next, we'll review your MLOps journey and explore what to learn next. :::

Quiz

Module 6: ML Monitoring & Next Steps

Take Quiz