Leadership System Design: Org Architecture & Process Engineering

Designing Engineering Processes That Scale

4 min read

Engineering processes are the operating system of your organization. Done well, they create predictability, reduce waste, and free engineers to focus on solving problems. Done poorly, they become bureaucratic overhead that slows everyone down. EM interviews test your ability to design processes that work at scale -- not just follow a textbook Scrum guide.

Sprint Planning and Estimation

Most engineering teams use some form of iterative planning, whether Scrum sprints, Kanban flow, or a hybrid. The specific methodology matters less than whether it creates a sustainable cadence for delivery.

Sprint planning should accomplish three things in a single meeting: align the team on the sprint goal (the "why"), select work items from a prioritized backlog (the "what"), and break items into tasks with clear acceptance criteria (the "how"). A common failure is over-filling the sprint. A good rule is to plan for 70-80% of theoretical capacity, leaving room for interruptions, code reviews, and unplanned work.

Estimation exists to enable planning, not to create accountability contracts. Story points -- a relative sizing technique -- remain the most widely used approach. The key principle is that story points measure complexity and uncertainty, not hours. A 5-point story is roughly twice as complex as a 3-point story, but the actual time may vary based on who picks it up and what surprises emerge. Teams that estimate in hours often fall into the trap of treating estimates as commitments, which creates perverse incentives to pad numbers or cut corners.

When asked about estimation in interviews, emphasize that you use estimation as a forecasting tool -- tracking velocity over time to predict delivery capacity, not as a mechanism to pressure engineers.

SDLC Choices

The Software Development Life Cycle (SDLC) defines how work flows from idea to production. Your choice depends on team maturity, product type, and risk tolerance:

Approach Best For Tradeoffs
Scrum (2-week sprints) Teams learning agile, products with clear backlogs Provides structure but can feel rigid; sprint boundaries can delay urgent work
Kanban (continuous flow) Operations-heavy teams, maintenance work, support teams Maximum flexibility but requires discipline in WIP limits; harder to forecast
Shape Up (6-week cycles) Product-driven teams with appetite-based scoping (developed at Basecamp) Longer cycles allow deeper work; risk of scope creep without strong shaping
Continuous delivery Mature teams with strong CI/CD and test automation Maximum throughput but requires infrastructure investment and operational excellence

In practice, most teams adopt a hybrid. The EM's role is to choose the approach that fits the team's context and evolve it as the team matures.

Incident Management and On-Call

A well-designed incident management process has four phases:

  1. Detection -- Automated alerting through monitoring and observability tools. The goal is to detect incidents before customers report them.
  2. Response -- A defined on-call rotation with clear escalation paths. The on-call engineer triages the alert, assesses severity, and either resolves it or escalates.
  3. Resolution -- Coordinate the response, communicate status to stakeholders, and restore service. For major incidents, designate an Incident Commander who owns coordination while others focus on debugging.
  4. Post-incident review -- Blameless retrospectives that focus on systemic causes, not individual mistakes. The output is action items to prevent recurrence.

On-call rotations must be sustainable. Key principles include: compensating on-call time fairly (time off or additional pay), limiting on-call shifts to one week with at least three weeks off between shifts, tracking alert volume to identify noisy alerts that should be fixed or silenced, and ensuring the on-call engineer is never the only person who can respond to a critical issue.

Tech Debt Management

Technical debt accumulates when teams take shortcuts to ship faster. It is not inherently bad -- sometimes taking on debt is the right business decision. The problem is when debt is invisible and unmanaged.

Two proven approaches for managing tech debt:

The tech debt register. Maintain a visible backlog of known technical debt items, each annotated with the impact (what breaks or slows down if this is not addressed), estimated effort to fix, and a priority level. Review the register quarterly and prioritize items that create the most drag on delivery speed.

The allocation approach. Reserve a fixed percentage of sprint capacity -- typically 20% -- for tech debt reduction, infrastructure improvements, and developer experience work. This approach ensures continuous progress on debt without requiring a separate planning process. When leadership asks "why are we spending 20% on non-feature work?", the answer is that without this investment, the remaining 80% slows down every quarter.

RFC and Design Doc Processes

An RFC (Request for Comments) or design doc process ensures that significant technical decisions are reviewed before implementation. This catches design flaws early, spreads knowledge across the team, and creates a written record of why decisions were made.

A lightweight RFC process looks like this:

  1. The author writes a one-to-three page document covering the problem, proposed solution, alternatives considered, and risks
  2. The document is shared with relevant reviewers (typically senior engineers, affected team leads, and infrastructure owners)
  3. Reviewers have a defined window (3-5 business days) to provide written feedback
  4. The author addresses feedback and either revises the proposal or documents why they chose not to
  5. A decision is recorded: approved, approved with modifications, or rejected

The threshold for requiring an RFC should be clear. Common triggers include: new services, database schema changes, new external dependencies, changes affecting multiple teams, and any work exceeding a defined size threshold (for example, more than two weeks of effort).

Build vs. Buy

One of the most important process decisions EMs face is whether to build a capability in-house or purchase a third-party solution. The framework for this decision considers:

  • Core vs. context -- Is this capability a core differentiator for your business, or is it context (necessary but not differentiating)? Build core capabilities. Buy context capabilities.
  • Total cost of ownership -- Building includes not just initial development but ongoing maintenance, on-call support, documentation, and opportunity cost. Buying includes license fees, integration effort, vendor lock-in risk, and customization limitations.
  • Team expertise -- Do you have the specialized knowledge to build and maintain this well? A team without database expertise should not build a custom database.
  • Time to value -- If speed matters, buying an existing solution gets you to value faster even if it is not perfect.

In interviews, demonstrating that you reason through build-vs-buy with a structured framework rather than defaulting to "we should build it" shows operational maturity.

Next, we will explore delivery systems and metrics -- including DORA metrics, velocity tracking, roadmap creation, and OKR setting for engineering teams. :::

Quiz

Module 3: Leadership System Design Quiz

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.