Case Study: The OpenObserve Council of Sub-Agents
The Challenge
OpenObserve, an open-source observability platform written in Rust, faced a critical challenge: a codebase with only 380 tests, many of which were flaky. Manual test writing could not keep pace with feature development. Their solution was a multi-agent system that became one of the most well-documented real-world applications of agentic testing.
Architecture: Eight Specialized Agents
+------------------------+
| COUNCIL ORCHESTRATOR |
| (Test Plan Director) |
+-----------+------------+
|
+-------+-------+-------+---+---+-------+-------+-------+
| | | | | | | |
[1] [2] [3] [4] [5] [6] [7] [8]
Code Test Test Test Flaky Cov. Doc PR
Anal. Gen. Run. Fix. Detect Anal. Gen. Rev.
| Agent | Role | Key Capability |
|---|---|---|
| Code Analyzer | Reads source code, identifies testable functions | Maps function signatures, dependencies, error paths |
| Test Generator | Writes new test cases | Uses code analysis + specification to produce Rust tests |
| Test Runner | Executes tests, captures results | Runs cargo test, parses stdout/stderr, reports pass/fail |
| Test Fixer | Repairs failing tests | Reads error messages, identifies root cause, applies fix |
| Flaky Detector | Identifies non-deterministic tests | Runs each test 5x, flags inconsistent results |
| Coverage Analyzer | Measures code coverage, identifies gaps | Runs cargo tarpaulin, reports uncovered lines |
| Documentation Generator | Creates test documentation | Produces test plan documents from the test suite |
| PR Reviewer | Reviews test PRs before merge | Checks quality, coverage, style conformance |
Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Total tests | 380 | 700+ | +84% |
| Flaky tests | -85% flakiness | ||
| Code coverage | 34% | 58% | +24 percentage points |
| Time per test cycle | 45 min (manual) | 8 min (automated) | -82% |
| Test PR review time | 2-3 hours | 30 min (agent pre-review) | -80% |
Key Design Decisions
1. Shared Memory with Bounded Context
Each agent writes to a shared JSON state file, but only reads the sections relevant to its role. This prevents context pollution -- the Test Runner does not need to see the Code Analyzer's full AST output.
{
"code_analysis": {
"module": "src/handlers/search.rs",
"functions": [
{
"name": "execute_search",
"params": ["query: SearchQuery", "org_id: &str"],
"return_type": "Result<SearchResponse, Error>",
"error_paths": ["InvalidQuery", "PermissionDenied", "Timeout"],
"complexity": "high"
}
]
},
"test_generation": {
"generated_tests": ["..."],
"pending_review": ["..."]
},
"flaky_detection": {
"flagged_tests": ["test_search_timeout -- passed 3/5 runs"]
}
}
Why bounded context matters: When Agent 2 (Test Generator) reads the state, it only loads code_analysis and test_generation. It does not load flaky_detection or coverage_analysis. This keeps each agent's prompt focused and within token limits.
2. Human Approval Gates
Tests were not merged automatically. The PR Reviewer agent created a pull request with a structured summary, and a human made the final merge decision. This kept humans in the loop for quality control.
The PR summary format:
## Agent-Generated Test PR
**Module:** src/handlers/search.rs
**Tests Added:** 12
**Tests Modified:** 3
**Coverage Change:** 34% → 41% (+7 pp)
### New Tests
- test_execute_search_valid_query (happy path)
- test_execute_search_invalid_query (error: InvalidQuery)
- test_execute_search_permission_denied (error: PermissionDenied)
- ... (9 more)
### Flaky Tests Fixed
- test_search_timeout: added explicit timeout mock (was relying on real network)
- test_concurrent_search: added mutex for shared test state
### Reviewer Notes
- All tests pass 5/5 runs
- Naming convention matches existing tests
- Fixtures reuse existing test helpers
3. Incremental Execution
Agents did not regenerate the entire test suite each run. They analyzed what changed (new commits), identified what needed new tests, and generated incrementally.
class IncrementalAnalyzer:
def identify_changes(self, since_commit: str) -> list[Change]:
"""Find what changed since the last agent run."""
diff = git_diff(since_commit, "HEAD")
changed_functions = []
for file_change in diff:
if file_change.path.endswith(".rs") and "test" not in file_change.path:
# Source file changed -- needs test update
changed_functions.extend(
self.extract_changed_functions(file_change)
)
return changed_functions
This kept token costs manageable. Instead of analyzing the entire codebase (millions of tokens), each run analyzed only the delta (thousands of tokens).
Lessons Learned
1. Flaky Test Reduction Was the Biggest Win
The Flaky Detector agent caught tests that humans had been ignoring for months. Running each test 5x and flagging inconsistencies eliminated accumulated tech debt.
How the Flaky Detector works:
class FlakyDetector:
def detect(self, test_name: str, runs: int = 5) -> FlakyReport:
results = []
for i in range(runs):
result = run_single_test(test_name)
results.append(result)
pass_count = sum(1 for r in results if r.passed)
fail_count = runs - pass_count
if 0 < fail_count < runs: # Mix of pass and fail = flaky
return FlakyReport(
test_name=test_name,
status="flaky",
pass_rate=pass_count / runs,
failure_reasons=self.analyze_failures(results)
)
return FlakyReport(test_name=test_name, status="stable")
Common flaky causes found:
- Tests depending on real network timeouts
- Tests sharing mutable state through global variables
- Tests depending on HashMap iteration order (Rust HashMaps are not ordered)
- Tests depending on file system timestamps
2. Agent Specialization Matters
Early attempts with a single "do everything" agent produced mediocre results. The single agent tried to analyze code, generate tests, run them, and fix failures all in one loop. It lost context, made inconsistent decisions, and produced lower-quality output than the specialized team.
Single agent (early approach): Average test quality score: 62/100 Eight specialized agents (final approach): Average test quality score: 84/100
3. The Test Fixer Agent Was the Most Complex
It needed to understand:
- Rust compiler errors (type mismatches, borrow checker violations)
- Test framework output (
cargo teststdout/stderr format) - The difference between "test bug" and "application bug"
The critical prompt for the Test Fixer:
You are fixing a failing Rust test. Determine whether this is:
A) A TEST BUG: the test code is wrong (wrong assertion, missing import,
type mismatch, incorrect fixture setup). FIX the test.
B) AN APPLICATION BUG: the production code is wrong and the test correctly
detected it. DO NOT fix the test. REPORT the application bug.
Error output:
{cargo_test_stderr}
Test code:
{test_code}
Production code:
{source_code}
4. Cost Control Required Explicit Budget Management
Without limits, agents would iterate endlessly on edge cases. They implemented a per-module token budget:
MODULE_BUDGETS = {
"src/handlers/": 100_000, # High-complexity, more budget
"src/models/": 50_000, # Medium complexity
"src/utils/": 25_000, # Low complexity, simple functions
}
When a module's budget was exhausted, agents stopped working on it and moved to the next module. This forced prioritization: high-complexity modules got more attention.
Applying OpenObserve's Lessons to Your Projects
| OpenObserve Practice | How to Apply It | Minimum Team Size |
|---|---|---|
| Specialized agents | Start with 3: Analyzer, Generator, Runner | 1 engineer |
| Shared state file | Use a JSON file or SQLite database | 1 engineer |
| Human approval gates | Require PR review for agent-generated tests | 1 engineer |
| Incremental execution | Only analyze changed files since last run | 1 engineer |
| Flaky detection | Run tests 3-5x before merging | 1 engineer |
| Token budgets | Set per-module limits in config | 1 engineer |
Key Takeaway
The OpenObserve case study proves that multi-agent testing works in production at scale. The key factors were agent specialization (eight focused agents outperformed one general agent), human-in-the-loop gates (agents propose, humans approve), incremental execution (analyze deltas, not the whole codebase), and explicit cost controls (per-module token budgets). The 85% reduction in flaky tests was the single most impactful outcome.