Testing in Production Using Feature Flags
The Core Idea
Feature flags decouple deployment from release. You deploy code to production but control who sees it. This turns every deployment into a testable experiment with measurable outcomes and automated rollback.
The traditional approach -- deploy everything to everyone simultaneously -- is a binary bet. Feature flags transform that bet into a series of small, reversible experiments.
Feature Flag Platforms
| Platform | Type | Targeting | Analytics | Cost |
|---|---|---|---|---|
| LaunchDarkly | SaaS | User, segment, %-based, custom rules | Built-in experimentation | Per-seat SaaS |
| Unleash | Open source / SaaS | Strategy-based (gradual, user ID, IP) | Basic metrics | Free (self-hosted) |
| Flagsmith | Open source / SaaS | Segment, %-based, multi-variate | Built-in analytics | Free tier available |
| Split | SaaS | Attribute-based targeting | Full experimentation suite | Per-seat SaaS |
| OpenFeature | Standard/SDK | Provider-agnostic specification | Depends on provider | Free (specification) |
Choosing a Platform
- LaunchDarkly for enterprise teams needing the most mature targeting and experimentation capabilities
- Unleash for teams wanting open-source, self-hosted control with gradual rollout strategies
- Flagsmith for teams needing a good balance of features and pricing with an open-source option
- OpenFeature as an abstraction layer if you want to avoid vendor lock-in or use multiple providers
Feature Flag Testing Strategy
Quality-Gated Rollout Pattern
# Feature-flagged AI summarization with quality gates and fallback
import ldclient
from ldclient.config import Config
from ldclient import Context
ldclient.set_config(Config("sdk-key-production"))
client = ldclient.get()
def get_ai_summary(document, user_context):
"""Feature-flagged AI summarization with automatic quality gate."""
context = Context.builder(user_context["user_id"]) \
.set("plan", user_context["plan"]) \
.set("region", user_context["region"]) \
.build()
# Check if this user should get the new AI summary feature
if client.variation("ai-summary-v2", context, False):
try:
summary = call_new_ai_summary_endpoint(document)
# Quality gate: verify the summary meets minimum quality bar
quality_score = evaluate_summary_quality(summary, document)
if quality_score < 0.7:
# Track degraded quality as a metric
client.track("ai-summary-quality-degraded", context,
metric_value=quality_score)
# Fall back to old implementation
return call_legacy_summary_endpoint(document)
client.track("ai-summary-v2-success", context,
metric_value=quality_score)
return summary
except Exception as e:
client.track("ai-summary-v2-error", context)
return call_legacy_summary_endpoint(document)
else:
return call_legacy_summary_endpoint(document)
Progressive Rollout Plan
A disciplined rollout progresses through gates, each with specific quality signals:
# progressive-rollout-plan.yaml
feature: ai-summary-v2
rollout_stages:
- name: internal_dogfood
percentage: 0%
targeting: "email ends with @ourcompany.com"
duration: 3 days
quality_gates:
- error_rate < 1%
- p95_latency < 3s
- quality_score_avg > 0.8
rollback_trigger: "any gate fails for 15 minutes"
- name: beta_users
percentage: 5%
targeting: "plan == 'beta'"
duration: 5 days
quality_gates:
- error_rate < 0.5%
- p95_latency < 2.5s
- quality_score_avg > 0.82
- user_satisfaction_score > 4.0
rollback_trigger: "any gate fails for 30 minutes"
- name: gradual_rollout
percentage_stages: [10%, 25%, 50%, 75%, 100%]
advance_interval: 24 hours
quality_gates:
- error_rate < 0.3%
- p95_latency < 2s
- quality_score_avg > 0.85
- no_pager_incidents
rollback_trigger: "any gate fails for 1 hour OR pager fires"
- name: general_availability
percentage: 100%
cleanup: "remove feature flag, delete old code path"
Testing Feature Flag Behavior
Feature flags themselves need testing. A misconfigured flag can cause partial outages or inconsistent user experiences.
Unit Testing Flag Logic
# test_feature_flags.py
import pytest
from unittest.mock import patch
class TestFeatureFlagBehavior:
def test_flag_on_uses_new_implementation(self, mock_ld_client):
"""When flag is ON, the new implementation should be used."""
mock_ld_client.variation.return_value = True
result = get_ai_summary("test document", {"user_id": "u1", "plan": "beta"})
assert result.source == "ai-summary-v2"
mock_ld_client.track.assert_called_with(
"ai-summary-v2-success", pytest.ANY, metric_value=pytest.ANY
)
def test_flag_off_uses_legacy_implementation(self, mock_ld_client):
"""When flag is OFF, the legacy implementation should be used."""
mock_ld_client.variation.return_value = False
result = get_ai_summary("test document", {"user_id": "u1", "plan": "free"})
assert result.source == "legacy-summary"
def test_flag_on_with_quality_degradation_falls_back(self, mock_ld_client):
"""When flag is ON but quality is poor, should fall back to legacy."""
mock_ld_client.variation.return_value = True
with patch("evaluate_summary_quality", return_value=0.4):
result = get_ai_summary("test document", {"user_id": "u1", "plan": "beta"})
assert result.source == "legacy-summary"
mock_ld_client.track.assert_called_with(
"ai-summary-quality-degraded", pytest.ANY, metric_value=0.4
)
def test_flag_on_with_exception_falls_back(self, mock_ld_client):
"""When flag is ON but the new implementation throws, should fall back."""
mock_ld_client.variation.return_value = True
with patch("call_new_ai_summary_endpoint", side_effect=TimeoutError):
result = get_ai_summary("test document", {"user_id": "u1", "plan": "beta"})
assert result.source == "legacy-summary"
mock_ld_client.track.assert_called_with("ai-summary-v2-error", pytest.ANY)
Integration Testing: Both Paths
Every feature-flagged code path must have integration test coverage:
@pytest.mark.parametrize("flag_state", [True, False])
def test_summary_endpoint_works_in_both_states(flag_state, test_client, mock_flags):
"""Both code paths must produce valid responses."""
mock_flags.set("ai-summary-v2", flag_state)
response = test_client.post("/api/summarize", json={"document": "Test content..."})
assert response.status_code == 200
assert "summary" in response.json()
assert len(response.json()["summary"]) > 0
Feature Flag Hygiene
Technical Debt Management
Feature flags that remain in the code indefinitely become technical debt. Implement a cleanup process:
| Flag Age | Action |
|---|---|
| < 7 days | Active rollout -- leave as-is |
| 7-30 days | Should be at 100% or rolled back |
| 30-90 days | Schedule cleanup ticket |
| > 90 days | Flag is technical debt -- prioritize removal |
Flag Naming Conventions
Consistent naming makes flags discoverable and their purpose clear:
{team}-{feature}-{version}
Examples:
checkout-ai-summary-v2
search-semantic-ranking-v1
onboarding-new-flow-q1-2026
Monitoring Feature Flag Impact
Every feature flag should have associated metrics that answer:
- Is the new code path working? (Error rate per flag state)
- Is it performant? (Latency per flag state)
- Is it better for users? (Business metrics per flag state)
- Is it stable? (No degradation trend over time)
# Prometheus metrics for feature flag monitoring
from prometheus_client import Counter, Histogram
flag_requests = Counter(
'feature_flag_requests_total',
'Total requests per feature flag state',
['flag_name', 'flag_state', 'outcome']
)
flag_latency = Histogram(
'feature_flag_latency_seconds',
'Latency by feature flag state',
['flag_name', 'flag_state'],
buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
Feature flags are the foundation of production testing. They transform deployment from a risky event into a controlled experiment with measurable quality outcomes.