Risk-Based Strategy and Metrics
The transition from automation engineer to quality engineer is marked by a shift in questions. Instead of "how do I test this?" the question becomes "what should I test — and how much testing is enough?" This shift requires understanding risk and measuring outcomes, not just activities.
Testing What Matters
Anti-Pattern: Chase code coverage percentages. "We have 80% coverage" sounds good but says nothing about whether the right things are covered. The login page has 100% coverage; the payment error handling has 0%.
Pattern: Risk-based test strategy — allocate testing effort proportional to business risk, not code volume.
The Risk Matrix
Plot features on two axes:
| Low Failure Likelihood | High Failure Likelihood | |
|---|---|---|
| High Business Impact | Monitor (stable but critical) | Test heavily (critical and volatile) |
| Low Business Impact | Deprioritize (stable and low-stakes) | Fix the instability (volatile, even if low-stakes) |
Inputs to the risk matrix:
- Business impact mapping — What happens if this feature breaks? Revenue loss? User churn? Regulatory violation?
- Failure history — What broke in the last six months? Features that broke before are more likely to break again
- Code complexity and change frequency — Complex code that changes often is the highest-risk combination
Metrics That Actually Matter
Anti-Pattern: Vanity metrics that look good in dashboards but do not drive improvement — automation percentage, total test count, bugs found.
Pattern: Outcome metrics that measure whether testing is achieving its purpose.
Vanity vs Outcome Metrics
| Vanity Metric | Why It Misleads | Outcome Metric | Why It Matters |
|---|---|---|---|
| Automation % | 90% automation with wrong tests is worse than 50% with right tests | Escaped defects | Bugs that reach production despite testing — the direct measure of test effectiveness |
| Test count | More tests ≠ better quality; many may be redundant or low-value | MTTR (Mean Time to Recovery) | How fast do you detect and fix production issues? |
| Bugs found | Finding more bugs can mean worse code, not better testing | Signal-to-noise ratio | % of test failures that are real bugs vs flakiness or environment issues |
| Pass rate | 99% pass rate means nothing if the failing 1% are ignored | Change failure rate | % of deployments that cause a production incident |
Key Takeaways
- Allocate testing effort based on risk (business impact x failure likelihood), not code coverage targets
- Use failure history, code complexity, and business impact mapping as inputs to your test strategy
- Replace vanity metrics (automation %, test count) with outcome metrics (escaped defects, MTTR, change failure rate)
- Signal-to-noise ratio is the health metric of your test suite — if most failures are flakiness, not bugs, the suite is sick
- The risk matrix is a living document — update it quarterly as the product and risk landscape evolve