ML Monitoring & Next Steps
ML Governance & Compliance
3 min read
As ML systems make higher-stakes decisions, regulatory requirements grow. Organizations need governance frameworks to ensure responsible AI deployment.
Why ML Governance Matters
| Risk | Impact | Example |
|---|---|---|
| Bias | Discrimination lawsuits | Hiring model favoring demographics |
| Privacy | GDPR fines | Model memorizing PII |
| Safety | Product liability | Autonomous system failures |
| Transparency | Regulatory penalties | "Black box" loan decisions |
Regulatory Landscape (2025)
EU AI Act
The world's first comprehensive AI regulation, with enforcement beginning in 2025-2026:
| Risk Category | Requirements | Examples |
|---|---|---|
| Unacceptable | Banned | Social scoring, manipulative AI |
| High-Risk | Strict compliance | Healthcare, hiring, credit |
| Limited | Transparency | Chatbots, emotion detection |
| Minimal | Self-regulation | Spam filters, games |
High-Risk Requirements
# Required for high-risk AI systems
documentation:
- Technical documentation of design
- Risk management system
- Data governance practices
- Human oversight measures
- Accuracy/robustness metrics
capabilities:
- Audit logging
- Traceability
- Bias testing
- Human override
US Regulatory Framework
- NIST AI RMF: Voluntary risk management framework
- State Laws: California, Colorado, Illinois AI regulations
- Sector-Specific: FDA (healthcare), SEC (finance), FTC (consumer protection)
Model Cards
Standard Format
# model_card.yaml
model_details:
name: "fraud-detector-v3"
version: "3.2.1"
type: "Binary Classification"
framework: "scikit-learn 1.4.0"
owner: "risk-team@company.com"
created: "2025-01-15"
intended_use:
primary: "Detect fraudulent payment transactions"
out_of_scope:
- "Account-level fraud detection"
- "Identity verification"
users: "Payment processing pipeline (automated)"
training_data:
source: "Internal transaction database"
size: "5.2M transactions"
date_range: "2024-01-01 to 2024-12-31"
preprocessing: "See data_pipeline.md"
evaluation:
metrics:
accuracy: 0.94
precision: 0.91
recall: 0.87
f1: 0.89
auc_roc: 0.96
test_set: "500K transactions (holdout)"
slices:
- name: "High-value (>$10K)"
accuracy: 0.92
- name: "International"
accuracy: 0.89
- name: "New customers (<30 days)"
accuracy: 0.85
ethical_considerations:
bias_analysis:
performed: true
method: "Demographic parity analysis"
findings: "No significant disparate impact detected"
fairness_constraints:
- "Equal opportunity across merchant categories"
limitations:
- "Lower accuracy on new customer segments"
- "May not detect novel fraud patterns"
caveats:
- "Requires at least 3 prior transactions for accuracy"
- "Performance degrades on international transactions"
- "Retrain recommended every 90 days"
Generating Model Cards
# model_card_generator.py
from dataclasses import dataclass
from typing import Optional
import yaml
import json
@dataclass
class ModelCard:
name: str
version: str
owner: str
description: str
metrics: dict
training_data_info: dict
ethical_considerations: dict
limitations: list[str]
def to_yaml(self) -> str:
return yaml.dump(self.__dict__, default_flow_style=False)
def to_markdown(self) -> str:
md = f"# Model Card: {self.name} v{self.version}\n\n"
md += f"**Owner**: {self.owner}\n\n"
md += f"## Description\n{self.description}\n\n"
md += "## Metrics\n"
for k, v in self.metrics.items():
md += f"- **{k}**: {v}\n"
md += "\n## Limitations\n"
for lim in self.limitations:
md += f"- {lim}\n"
return md
def generate_model_card(
model,
test_data,
test_labels,
metadata: dict
) -> ModelCard:
"""Generate model card from trained model."""
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
predictions = model.predict(test_data)
metrics = {
"accuracy": round(accuracy_score(test_labels, predictions), 4),
"precision": round(precision_score(test_labels, predictions), 4),
"recall": round(recall_score(test_labels, predictions), 4),
"f1": round(f1_score(test_labels, predictions), 4)
}
return ModelCard(
name=metadata["name"],
version=metadata["version"],
owner=metadata["owner"],
description=metadata.get("description", ""),
metrics=metrics,
training_data_info=metadata.get("training_data", {}),
ethical_considerations=metadata.get("ethics", {}),
limitations=metadata.get("limitations", [])
)
Audit Logging
Comprehensive Logging
# audit_logger.py
import json
import logging
from datetime import datetime
from typing import Any
class AuditLogger:
def __init__(self, model_name: str, version: str):
self.model_name = model_name
self.version = version
self.logger = logging.getLogger("ml_audit")
def log_prediction(
self,
prediction_id: str,
input_features: dict,
output: Any,
confidence: float,
user_id: str | None = None,
metadata: dict | None = None
):
"""Log prediction for audit trail."""
record = {
"timestamp": datetime.utcnow().isoformat(),
"event_type": "prediction",
"model_name": self.model_name,
"model_version": self.version,
"prediction_id": prediction_id,
"input_hash": self._hash_input(input_features),
"output": output,
"confidence": confidence,
"user_id": user_id,
"metadata": metadata or {}
}
self.logger.info(json.dumps(record))
def log_model_deployment(
self,
previous_version: str | None,
deployment_type: str,
approver: str
):
"""Log model deployment event."""
record = {
"timestamp": datetime.utcnow().isoformat(),
"event_type": "deployment",
"model_name": self.model_name,
"model_version": self.version,
"previous_version": previous_version,
"deployment_type": deployment_type,
"approver": approver
}
self.logger.info(json.dumps(record))
def log_override(
self,
prediction_id: str,
original_output: Any,
overridden_output: Any,
reason: str,
operator: str
):
"""Log human override of model prediction."""
record = {
"timestamp": datetime.utcnow().isoformat(),
"event_type": "human_override",
"model_name": self.model_name,
"prediction_id": prediction_id,
"original_output": original_output,
"overridden_output": overridden_output,
"reason": reason,
"operator": operator
}
self.logger.info(json.dumps(record))
def _hash_input(self, features: dict) -> str:
import hashlib
return hashlib.sha256(
json.dumps(features, sort_keys=True).encode()
).hexdigest()[:16]
Bias Testing
Fairness Metrics
# bias_testing.py
import pandas as pd
import numpy as np
def demographic_parity(
predictions: np.ndarray,
protected_attribute: np.ndarray
) -> dict:
"""Calculate demographic parity difference."""
groups = np.unique(protected_attribute)
positive_rates = {}
for group in groups:
mask = protected_attribute == group
positive_rates[group] = predictions[mask].mean()
max_rate = max(positive_rates.values())
min_rate = min(positive_rates.values())
return {
"positive_rates": positive_rates,
"demographic_parity_difference": max_rate - min_rate,
"demographic_parity_ratio": min_rate / max_rate if max_rate > 0 else 0
}
def equal_opportunity(
predictions: np.ndarray,
labels: np.ndarray,
protected_attribute: np.ndarray
) -> dict:
"""Calculate equal opportunity difference (TPR parity)."""
groups = np.unique(protected_attribute)
tpr_by_group = {}
for group in groups:
mask = (protected_attribute == group) & (labels == 1)
if mask.sum() > 0:
tpr_by_group[group] = predictions[mask].mean()
else:
tpr_by_group[group] = None
valid_tprs = [v for v in tpr_by_group.values() if v is not None]
return {
"tpr_by_group": tpr_by_group,
"equal_opportunity_difference": max(valid_tprs) - min(valid_tprs) if valid_tprs else None
}
# Example usage
results = demographic_parity(
predictions=model.predict(X_test),
protected_attribute=X_test["gender"]
)
if results["demographic_parity_difference"] > 0.1:
print("WARNING: Potential bias detected")
print(f"Positive rates by group: {results['positive_rates']}")
Governance Checklist
Pre-Deployment
- Model card documented
- Bias testing completed
- Human oversight mechanism defined
- Rollback procedure documented
- Audit logging implemented
- Data lineage traceable
- Approval from compliance/legal
Production Monitoring
- Prediction distribution monitored
- Fairness metrics tracked
- Human override rate monitored
- Drift detection active
- Audit logs retained per policy
Key insight: Governance isn't just compliance—it's building trustworthy AI systems that organizations and users can rely on.
Next, we'll review your MLOps journey and explore what to learn next. :::