Risk Management | Business & Soft Skills | System Design

Technical risk management is the systematic identification, assessment, and mitigation of threats to a system's delivery, operation, and compliance. A risk register captures each risk with probability, impact, mitigation strategy, and an owner. Risks are visualized in a probability × impact matrix to prioritize mitigation effort. Technical debt risk is particularly insidious: unmeasured debt accumulates silently until it causes production incidents or blocks new feature development. Vendor dependency risk (single-vendor for critical infrastructure) and security risk (from the OWASP Top 10 to supply chain attacks) require dedicated registers with regular review cycles.

Key Points

Risk register fields: risk ID, description, category (technical debt, security, vendor, compliance, operational), probability (H/M/L or 1–5), impact (H/M/L or 1–5), risk score (P × I), mitigation strategy, owner, due date, residual risk after mitigation
Risk categories for software systems: architectural (wrong technology choice), delivery (scope creep, key person dependency), operational (runbook gaps, single points of failure), security (OWASP, supply chain), compliance (GDPR breach, HIPAA violation), and vendor (EOL, acquisition, pricing change)
Technical debt risk: quantify debt using tools (SonarQube technical debt ratio, code coverage trends, dependency vulnerability age) rather than qualitative assessments — "we have a lot of tech debt" is not actionable
Key person risk: document tribal knowledge, require code review by >= 2 engineers, rotate on-call and release responsibilities; bus factor < 2 for any critical system component is an active risk
Supply chain risk: software composition analysis (Snyk, Dependabot, Black Duck) scans dependencies for known CVEs; SBOM (Software Bill of Materials) — required by US Executive Order 14028 for federal software vendors — documents all components
Risk acceptance vs mitigation: some risks are accepted (too expensive to mitigate, probability too low); formal acceptance requires sign-off from the appropriate authority level (P-High must be accepted by VP or above)
Residual risk: after applying mitigations, document the remaining risk level; if residual risk is still High, escalate — don't silently accept high residual risk
Risk review cadence: P-High risks reviewed monthly; P-Medium quarterly; P-Low semi-annually; risks are closed when fully mitigated or explicitly accepted with an expiry date

	Low Impact	Medium Impact	High Impact
High Probability	MEDIUM — Monitor and mitigate; schedule remediation in next quarter	HIGH — Immediate mitigation plan; assign owner; weekly tracking	CRITICAL — Escalate to VP/CTO; stop feature work if needed; daily review
Medium Probability	LOW — Document and accept; review quarterly	MEDIUM — Mitigation plan within 30 days; assign owner; monthly review	HIGH — Mitigation plan within 2 weeks; executive awareness required
Low Probability	LOW — Accept and log; review semi-annually	LOW — Accept with documented rationale; review quarterly	MEDIUM — Contingency plan required; review monthly; insure or hedge if possible

Real-World Example

After the 2020 SolarWinds supply chain attack, Microsoft, Google, and the US CISA mandated SBOM generation for all critical software; this transformed supply chain risk from a theoretical concern to a board-level governance requirement with legal consequences for non-compliance.

←PreviousArchitecture Governance NextDocumentation Standards→