Case Study: The OpenObserve Council of Sub-Agents

The Challenge

OpenObserve, an open-source observability platform written in Rust, faced a critical challenge: a codebase with only 380 tests, many of which were flaky. Manual test writing could not keep pace with feature development. Their solution was a multi-agent system that became one of the most well-documented real-world applications of agentic testing.

Architecture: Eight Specialized Agents

                    +------------------------+
                    |  COUNCIL ORCHESTRATOR  |
                    |  (Test Plan Director)  |
                    +-----------+------------+
                                |
    +-------+-------+-------+---+---+-------+-------+-------+
    |       |       |       |       |       |       |       |
   [1]     [2]     [3]     [4]     [5]     [6]     [7]     [8]
  Code    Test    Test    Test   Flaky    Cov.     Doc     PR
  Anal.   Gen.    Run.    Fix.   Detect   Anal.    Gen.    Rev.

Agent	Role	Key Capability
Code Analyzer	Reads source code, identifies testable functions	Maps function signatures, dependencies, error paths
Test Generator	Writes new test cases	Uses code analysis + specification to produce Rust tests
Test Runner	Executes tests, captures results	Runs `cargo test`, parses stdout/stderr, reports pass/fail
Test Fixer	Repairs failing tests	Reads error messages, identifies root cause, applies fix
Flaky Detector	Identifies non-deterministic tests	Runs each test 5x, flags inconsistent results
Coverage Analyzer	Measures code coverage, identifies gaps	Runs `cargo tarpaulin`, reports uncovered lines
Documentation Generator	Creates test documentation	Produces test plan documents from the test suite
PR Reviewer	Reviews test PRs before merge	Checks quality, coverage, style conformance

Results

Metric	Before	After	Improvement
Total tests	380	700+	+84%
Flaky tests	~~60 (~~16%)	~~9 (~~1.3%)	-85% flakiness
Code coverage	34%	58%	+24 percentage points
Time per test cycle	45 min (manual)	8 min (automated)	-82%
Test PR review time	2-3 hours	30 min (agent pre-review)	-80%

Key Design Decisions

1. Shared Memory with Bounded Context

Each agent writes to a shared JSON state file, but only reads the sections relevant to its role. This prevents context pollution -- the Test Runner does not need to see the Code Analyzer's full AST output.

{
  "code_analysis": {
    "module": "src/handlers/search.rs",
    "functions": [
      {
        "name": "execute_search",
        "params": ["query: SearchQuery", "org_id: &str"],
        "return_type": "Result<SearchResponse, Error>",
        "error_paths": ["InvalidQuery", "PermissionDenied", "Timeout"],
        "complexity": "high"
      }
    ]
  },
  "test_generation": {
    "generated_tests": ["..."],
    "pending_review": ["..."]
  },
  "flaky_detection": {
    "flagged_tests": ["test_search_timeout -- passed 3/5 runs"]
  }
}

Why bounded context matters: When Agent 2 (Test Generator) reads the state, it only loads code_analysis and test_generation. It does not load flaky_detection or coverage_analysis. This keeps each agent's prompt focused and within token limits.

2. Human Approval Gates

Tests were not merged automatically. The PR Reviewer agent created a pull request with a structured summary, and a human made the final merge decision. This kept humans in the loop for quality control.

The PR summary format:

## Agent-Generated Test PR

**Module:** src/handlers/search.rs
**Tests Added:** 12
**Tests Modified:** 3
**Coverage Change:** 34% → 41% (+7 pp)

### New Tests
- test_execute_search_valid_query (happy path)
- test_execute_search_invalid_query (error: InvalidQuery)
- test_execute_search_permission_denied (error: PermissionDenied)
- ... (9 more)

### Flaky Tests Fixed
- test_search_timeout: added explicit timeout mock (was relying on real network)
- test_concurrent_search: added mutex for shared test state

### Reviewer Notes
- All tests pass 5/5 runs
- Naming convention matches existing tests
- Fixtures reuse existing test helpers

3. Incremental Execution

Agents did not regenerate the entire test suite each run. They analyzed what changed (new commits), identified what needed new tests, and generated incrementally.

class IncrementalAnalyzer:
    def identify_changes(self, since_commit: str) -> list[Change]:
        """Find what changed since the last agent run."""
        diff = git_diff(since_commit, "HEAD")
        changed_functions = []
        for file_change in diff:
            if file_change.path.endswith(".rs") and "test" not in file_change.path:
                # Source file changed -- needs test update
                changed_functions.extend(
                    self.extract_changed_functions(file_change)
                )
        return changed_functions

This kept token costs manageable. Instead of analyzing the entire codebase (millions of tokens), each run analyzed only the delta (thousands of tokens).

Lessons Learned

1. Flaky Test Reduction Was the Biggest Win

The Flaky Detector agent caught tests that humans had been ignoring for months. Running each test 5x and flagging inconsistencies eliminated accumulated tech debt.

How the Flaky Detector works:

class FlakyDetector:
    def detect(self, test_name: str, runs: int = 5) -> FlakyReport:
        results = []
        for i in range(runs):
            result = run_single_test(test_name)
            results.append(result)

        pass_count = sum(1 for r in results if r.passed)
        fail_count = runs - pass_count

        if 0 < fail_count < runs:  # Mix of pass and fail = flaky
            return FlakyReport(
                test_name=test_name,
                status="flaky",
                pass_rate=pass_count / runs,
                failure_reasons=self.analyze_failures(results)
            )
        return FlakyReport(test_name=test_name, status="stable")

Common flaky causes found:

Tests depending on real network timeouts
Tests sharing mutable state through global variables
Tests depending on HashMap iteration order (Rust HashMaps are not ordered)
Tests depending on file system timestamps

2. Agent Specialization Matters

Early attempts with a single "do everything" agent produced mediocre results. The single agent tried to analyze code, generate tests, run them, and fix failures all in one loop. It lost context, made inconsistent decisions, and produced lower-quality output than the specialized team.

Single agent (early approach): Average test quality score: 62/100 Eight specialized agents (final approach): Average test quality score: 84/100

3. The Test Fixer Agent Was the Most Complex

It needed to understand:

Rust compiler errors (type mismatches, borrow checker violations)
Test framework output (cargo test stdout/stderr format)
The difference between "test bug" and "application bug"

The critical prompt for the Test Fixer:

You are fixing a failing Rust test. Determine whether this is:

A) A TEST BUG: the test code is wrong (wrong assertion, missing import,
   type mismatch, incorrect fixture setup). FIX the test.

B) AN APPLICATION BUG: the production code is wrong and the test correctly
   detected it. DO NOT fix the test. REPORT the application bug.

Error output:
{cargo_test_stderr}

Test code:
{test_code}

Production code:
{source_code}

4. Cost Control Required Explicit Budget Management

Without limits, agents would iterate endlessly on edge cases. They implemented a per-module token budget:

MODULE_BUDGETS = {
    "src/handlers/": 100_000,    # High-complexity, more budget
    "src/models/": 50_000,       # Medium complexity
    "src/utils/": 25_000,        # Low complexity, simple functions
}

When a module's budget was exhausted, agents stopped working on it and moved to the next module. This forced prioritization: high-complexity modules got more attention.

Applying OpenObserve's Lessons to Your Projects

OpenObserve Practice	How to Apply It	Minimum Team Size
Specialized agents	Start with 3: Analyzer, Generator, Runner	1 engineer
Shared state file	Use a JSON file or SQLite database	1 engineer
Human approval gates	Require PR review for agent-generated tests	1 engineer
Incremental execution	Only analyze changed files since last run	1 engineer
Flaky detection	Run tests 3-5x before merging	1 engineer
Token budgets	Set per-module limits in config	1 engineer

Key Takeaway

The OpenObserve case study proves that multi-agent testing works in production at scale. The key factors were agent specialization (eight focused agents outperformed one general agent), human-in-the-loop gates (agents propose, humans approve), incremental execution (analyze deltas, not the whole codebase), and explicit cost controls (per-module token budgets). The 85% reduction in flaky tests was the single most impactful outcome.