Platform Team Operations & Maturity

Platform Maturity Model

12 min read

Platform maturity models help assess current capabilities and plan improvement. This lesson covers the five levels of platform maturity, assessment techniques, and roadmaps for advancement.

Why Maturity Models Matter

Maturity models provide:

  • Baseline assessment: Understand where you are today
  • Goal setting: Define target state for your platform
  • Prioritization: Focus on highest-impact improvements
  • Communication: Shared language with stakeholders
  • Progress tracking: Measure advancement over time

Five Levels of Platform Maturity

┌─────────────────────────────────────────────────────────────────────┐
│               Platform Maturity Levels                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Level 4: Optimizing                    ████████████████████████   │
│  • AI-assisted operations               │ Continuous improvement  │ │
│  • Predictive scaling                   │ Innovation focus        │ │
│  • Self-healing platform                └──────────────────────────│
│                                                                     │
│  Level 3: Standardized                  ████████████████████       │
│  • Golden paths enforced                │ Consistent experience   │ │
│  • Self-service mature                  │ Measured outcomes       │ │
│  • Platform-as-Product                  └──────────────────────────│
│                                                                     │
│  Level 2: Managed                       ████████████████           │
│  • Central tooling                      │ Reduced variance        │ │
│  • Basic self-service                   │ Improving DX            │ │
│  • Documentation exists                 └──────────────────────────│
│                                                                     │
│  Level 1: Reactive                      ████████████               │
│  • Ticket-based requests                │ High manual effort      │ │
│  • Inconsistent tooling                 │ Slow onboarding         │ │
│  • Tribal knowledge                     └──────────────────────────│
│                                                                     │
│  Level 0: Ad-hoc                        ████████                   │
│  • No platform team                     │ Complete chaos          │ │
│  • Every team does their own            │ Maximum friction        │ │
│  • No standardization                   └──────────────────────────│
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Level 0: Ad-hoc

No formal platform exists. Each team manages infrastructure independently.

level_0_characteristics:
  infrastructure:
    - Teams provision their own cloud resources
    - No standard deployment process
    - Inconsistent security practices
    - Manual server configuration

  developer_experience:
    - High cognitive load
    - Steep learning curve for new hires
    - Duplicated effort across teams
    - Slow time-to-first-deploy

  operations:
    - No central visibility
    - Incident response varies by team
    - Cost tracking impossible
    - Compliance gaps

  indicators:
    - "Every team uses different CI tools"
    - "New developers take weeks to deploy"
    - "We don't know our cloud spend by team"
    - "Security finds issues in every audit"

Level 1: Reactive

Basic platform team exists but operates in ticket-driven mode.

level_1_characteristics:
  infrastructure:
    - Central team handles requests
    - Standard tooling recommended (not enforced)
    - Some automation exists
    - Basic templates available

  developer_experience:
    - Requests take days/weeks
    - Bottleneck on platform team
    - Documentation scattered
    - Some tribal knowledge captured

  operations:
    - Central monitoring exists
    - Incident response improving
    - Basic cost tracking
    - Manual compliance checks

  indicators:
    - "Submit a ticket for a new database"
    - "Platform team is always backlogged"
    - "We have docs but they're outdated"
    - "Onboarding takes 1-2 weeks"

  typical_metrics:
    ticket_backlog: "50+ tickets"
    request_lead_time: "3-7 days"
    onboarding_time: "1-2 weeks"
    deployment_frequency: "Weekly"

Level 2: Managed

Platform provides consistent tooling with basic self-service.

level_2_characteristics:
  infrastructure:
    - Self-service for common resources
    - Standard CI/CD pipelines
    - Infrastructure as Code adopted
    - Environment provisioning automated

  developer_experience:
    - Portal with basic catalog
    - Templates for new services
    - Centralized documentation
    - Defined golden paths

  operations:
    - Centralized observability
    - SLOs defined for platform
    - Cost allocation by team
    - Automated compliance scans

  indicators:
    - "I can create a database without tickets"
    - "All services use the same CI pipeline"
    - "New services start from templates"
    - "I can see my team's cloud costs"

  typical_metrics:
    self_service_ratio: "50-70%"
    request_lead_time: "< 1 day"
    onboarding_time: "2-5 days"
    deployment_frequency: "Daily"

Level 3: Standardized

Platform operates as an internal product with mature self-service.

level_3_characteristics:
  infrastructure:
    - Full self-service infrastructure
    - Policy-as-Code enforced
    - Multi-cloud abstraction
    - Automatic resource optimization

  developer_experience:
    - Comprehensive developer portal
    - All services in catalog
    - Rich documentation (TechDocs)
    - Golden paths for all common patterns

  operations:
    - Error budgets and SLOs active
    - Automated remediation
    - FinOps practices mature
    - Continuous compliance

  indicators:
    - "I deploy to production multiple times daily"
    - "Platform automatically scales my services"
    - "I've never submitted an infra ticket"
    - "Compliance is automatic"

  typical_metrics:
    self_service_ratio: "90%+"
    request_lead_time: "Minutes"
    onboarding_time: "< 1 day"
    deployment_frequency: "On-demand"

Level 4: Optimizing

Platform continuously improves with advanced automation and AI assistance.

level_4_characteristics:
  infrastructure:
    - AI-optimized resource allocation
    - Predictive scaling
    - Self-healing systems
    - Automatic cost optimization

  developer_experience:
    - AI-assisted development
    - Personalized recommendations
    - Intelligent error resolution
    - Continuous experience improvement

  operations:
    - AIOps for incident prediction
    - Autonomous remediation
    - Real-time cost optimization
    - Proactive security patching

  indicators:
    - "Platform predicts and prevents incidents"
    - "AI suggests optimal configurations"
    - "Costs automatically optimized weekly"
    - "Zero-touch operations for 90%+ of issues"

  typical_metrics:
    automation_rate: "95%+"
    incident_prevention_rate: "70%+"
    time_to_remediation: "< 5 minutes"
    developer_nps: "70+"

Maturity Assessment Scorecard

Use this scorecard to assess your current level across key dimensions.

# maturity-assessment.yaml
assessment_dimensions:

  self_service:
    level_0:
      description: "No self-service"
      indicators:
        - All infrastructure requests via tickets
        - Manual provisioning for everything
      score: 0

    level_1:
      description: "Limited self-service"
      indicators:
        - Some resources available via wiki guides
        - Scripts shared informally
      score: 1

    level_2:
      description: "Basic self-service"
      indicators:
        - Portal with resource provisioning
        - Templates for common patterns
      score: 2

    level_3:
      description: "Comprehensive self-service"
      indicators:
        - All infrastructure via self-service
        - Policy guardrails automatic
      score: 3

    level_4:
      description: "Intelligent self-service"
      indicators:
        - AI recommendations for configurations
        - Predictive resource provisioning
      score: 4

  developer_experience:
    level_0:
      description: "Fragmented"
      score: 0
    level_1:
      description: "Documented"
      score: 1
    level_2:
      description: "Consistent"
      score: 2
    level_3:
      description: "Optimized"
      score: 3
    level_4:
      description: "Delightful"
      score: 4

  observability:
    level_0:
      description: "No monitoring"
      score: 0
    level_1:
      description: "Basic metrics"
      score: 1
    level_2:
      description: "Centralized dashboards"
      score: 2
    level_3:
      description: "SLO-driven"
      score: 3
    level_4:
      description: "Predictive/AIOps"
      score: 4

  security_compliance:
    level_0:
      description: "Manual audits"
      score: 0
    level_1:
      description: "Periodic scans"
      score: 1
    level_2:
      description: "Automated scans"
      score: 2
    level_3:
      description: "Policy-as-Code"
      score: 3
    level_4:
      description: "Continuous compliance"
      score: 4

  cost_management:
    level_0:
      description: "Unknown costs"
      score: 0
    level_1:
      description: "Monthly reports"
      score: 1
    level_2:
      description: "Team allocation"
      score: 2
    level_3:
      description: "Real-time tracking"
      score: 3
    level_4:
      description: "AI optimization"
      score: 4

scoring:
  total_possible: 20
  level_mapping:
    0-4: "Level 0 (Ad-hoc)"
    5-8: "Level 1 (Reactive)"
    9-12: "Level 2 (Managed)"
    13-16: "Level 3 (Standardized)"
    17-20: "Level 4 (Optimizing)"

Assessment Template

# platform-assessment-template.yaml
assessment:
  date: "2025-01-15"
  assessor: "Platform Team"
  organization: "Example Corp"

dimensions:
  self_service:
    current_score: 2
    evidence:
      - "Backstage portal deployed"
      - "3 templates available"
      - "Database provisioning self-service"
    gaps:
      - "No Kubernetes namespace self-service"
      - "Manual secret management"

  developer_experience:
    current_score: 2
    evidence:
      - "Central documentation exists"
      - "Golden path for web services"
      - "Onboarding time ~3 days"
    gaps:
      - "No TechDocs integration"
      - "Local development setup manual"

  observability:
    current_score: 2
    evidence:
      - "Prometheus/Grafana deployed"
      - "Standard dashboards available"
    gaps:
      - "No SLOs defined"
      - "Alerting inconsistent"

  security_compliance:
    current_score: 1
    evidence:
      - "Quarterly security scans"
      - "Basic RBAC in place"
    gaps:
      - "No policy-as-code"
      - "Compliance is manual"

  cost_management:
    current_score: 1
    evidence:
      - "Monthly AWS cost reports"
    gaps:
      - "No team allocation"
      - "No optimization automation"

summary:
  total_score: 8
  current_level: "Level 1 (Reactive)"
  target_level: "Level 3 (Standardized)"
  priority_gaps:
    - "Implement policy-as-code (Kyverno)"
    - "Deploy Kubecost for cost allocation"
    - "Define platform SLOs"

Maturity Roadmap

Level 1 to Level 2 Roadmap

# roadmap-level-1-to-2.yaml
roadmap:
  current_level: 1
  target_level: 2
  timeline: "6 months"

phase_1:
  name: "Foundation"
  duration: "Month 1-2"
  objectives:
    - Deploy developer portal (Backstage)
    - Create first 3 software templates
    - Centralize documentation

  key_results:
    - Backstage live with SSO
    - 50% of services in catalog
    - All teams can access docs

  investments:
    - Backstage setup: 2 engineers × 4 weeks
    - Template creation: 1 engineer × 2 weeks
    - Doc migration: 1 engineer × 2 weeks

phase_2:
  name: "Self-Service"
  duration: "Month 3-4"
  objectives:
    - Enable database self-service (Crossplane)
    - Standardize CI/CD pipelines
    - Implement cost allocation

  key_results:
    - Zero tickets for standard databases
    - 80% of services use standard CI
    - Cost visible by team

  investments:
    - Crossplane setup: 2 engineers × 4 weeks
    - Pipeline templates: 1 engineer × 3 weeks
    - Kubecost deployment: 1 engineer × 1 week

phase_3:
  name: "Observability"
  duration: "Month 5-6"
  objectives:
    - Deploy centralized observability
    - Create standard dashboards
    - Define initial SLOs

  key_results:
    - All services have dashboards
    - Platform SLOs published
    - Alert fatigue reduced 50%

  investments:
    - Prometheus/Grafana: 1 engineer × 2 weeks
    - Dashboard templates: 1 engineer × 3 weeks
    - SLO definition: Platform team × 2 weeks

success_criteria:
  self_service_ratio: "> 50%"
  onboarding_time: "< 1 week"
  ticket_reduction: "50%"
  developer_satisfaction: "> 60% positive"

Level 2 to Level 3 Roadmap

# roadmap-level-2-to-3.yaml
roadmap:
  current_level: 2
  target_level: 3
  timeline: "9 months"

phase_1:
  name: "Policy-as-Code"
  duration: "Month 1-3"
  objectives:
    - Implement Kyverno/OPA for policies
    - Automate compliance checks
    - Enable policy self-service

  key_results:
    - 100% of deployments pass policy checks
    - Compliance reports automated
    - Teams can request policy exceptions

phase_2:
  name: "Advanced Self-Service"
  duration: "Month 4-6"
  objectives:
    - Complete infrastructure self-service
    - Multi-environment automation
    - Golden paths for all patterns

  key_results:
    - Zero infrastructure tickets
    - Prod deployments fully automated
    - 5+ golden path templates

phase_3:
  name: "Platform-as-Product"
  duration: "Month 7-9"
  objectives:
    - Implement feedback loops
    - Track platform NPS
    - Optimize based on metrics

  key_results:
    - Monthly developer surveys
    - NPS > 50
    - Feature prioritization data-driven

success_criteria:
  self_service_ratio: "> 90%"
  deployment_frequency: "On-demand"
  developer_nps: "> 50"
  compliance_automation: "100%"

Continuous Assessment

Quarterly Review Template

# quarterly-review.yaml
quarterly_platform_review:

  metrics_review:
    adoption:
      - Services in catalog (current vs target)
      - Template usage rate
      - Self-service ratio

    developer_experience:
      - NPS score
      - Support ticket volume
      - Onboarding time

    reliability:
      - Platform uptime
      - SLO achievement
      - Incident count

    efficiency:
      - Cost per deployment
      - Time to provision
      - Automation rate

  qualitative_review:
    feedback_themes:
      - Top 3 developer complaints
      - Top 3 feature requests
      - Success stories

    roadmap_progress:
      - Completed initiatives
      - Delayed initiatives
      - New priorities

  action_planning:
    continue:
      - What's working well
    stop:
      - What to deprioritize
    start:
      - New initiatives

Summary

Platform maturity progression:

Level Focus Key Capability
0 - Ad-hoc Survival None
1 - Reactive Stability Central team
2 - Managed Efficiency Self-service basics
3 - Standardized Scale Platform-as-Product
4 - Optimizing Innovation AI-assisted ops

Key principles:

  • Assess honestly: Don't overestimate current state
  • Progress incrementally: Don't skip levels
  • Focus on outcomes: Metrics over features
  • Iterate continuously: Maturity is a journey

Next Lesson: Future of Platform Engineering explores emerging trends, AI integration, and what platform engineering looks like in 2026 and beyond.

:::

Quiz

Module 6: Platform Team Operations & Maturity

Take Quiz