Common AI Test Failures

Why AI-Generated Tests Fail

AI-generated tests fail in predictable ways. Understanding these failure modes lets you spot them instantly during review, rather than discovering them later in CI or -- worse -- in production when the test passes despite a real bug.

These failure modes are not random. They emerge from how LLMs work: they predict statistically likely text, not logically correct code. This means they produce patterns that look like tests but do not function as tests.

Failure Mode 1: The Tautology Test

A tautology test tests the mock, not the code. It asserts that a value you explicitly set up is returned -- which will always be true regardless of whether the actual code works.

# THE TAUTOLOGY -- this tests nothing
def test_get_user(mock_db):
    mock_db.get_user.return_value = {"name": "Alice"}
    result = get_user(1)
    assert result["name"] == "Alice"  # This tests the mock, not the code

Why AI generates this: The LLM sees a pattern of "set up data, call function, check data" and fills it in without considering whether the assertion is meaningful.

How to detect it: For every assertion, trace the value backward. If the expected value comes directly from the test's own setup (without passing through production code), it is a tautology.

How to fix it:

# FIXED: test the actual behavior
def test_get_user_returns_formatted_name(mock_db):
    mock_db.get_user.return_value = {"first_name": "Alice", "last_name": "Smith"}
    result = get_user(1)
    # Now we test that get_user() formats the name correctly
    assert result["display_name"] == "Alice Smith"
    assert result["initials"] == "AS"

Tautology Variants

The echo tautology:

def test_create_order(client):
    payload = {"product_id": "abc", "quantity": 2}
    response = client.post("/orders", json=payload)
    body = response.json()
    assert body["product_id"] == "abc"    # Just echoing the input
    assert body["quantity"] == 2          # Just echoing the input
    # Missing: assert body["status"] == "pending"   (actual business logic)
    # Missing: assert body["total"] == 19.98         (calculated value)
    # Missing: assert "id" in body                   (generated value)

The pass-through tautology:

def test_transform_data(mock_api):
    mock_api.fetch.return_value = [1, 2, 3]
    result = transform_data()
    assert result == [1, 2, 3]   # If transform_data just returns the raw data,
                                  # this test catches nothing about the transform

Failure Mode 2: The Happy Path Only

AI generates 10 tests, all for successful cases. No error handling, no edge cases, no auth failures.

# AI generates these:
def test_create_order_valid(): ...
def test_create_order_with_coupon(): ...
def test_create_order_express_shipping(): ...
def test_create_order_multiple_items(): ...
def test_create_order_with_notes(): ...

# Missing: what happens with invalid product ID?
# Missing: what happens with out-of-stock product?
# Missing: what happens with payment failure?
# Missing: what happens with expired auth token?
# Missing: what happens with zero quantity?

Why AI generates this: LLMs are trained on code where the majority of test examples are happy-path tests. The statistical weight of positive examples outweighs negative ones.

How to detect it: Count the ratio of success assertions to error assertions. If more than 60% of tests assert status_code == 200/201, you have happy-path bias.

How to fix it: Add an explicit prompt instruction:

For every happy-path test you generate, also generate:
- 1 test for missing required fields
- 1 test for invalid field types
- 1 test for authentication failure
- 1 test for authorization failure (wrong role)
- 1 test for resource-not-found

Failure Mode 3: The Overly Specific Assertion

The test asserts exact error message text, timestamps, or auto-generated IDs that change between runs.

# BRITTLE: depends on exact error message text
assert error.message == "Invalid email: 'notanemail' does not match RFC 5322 format"

# BRITTLE: depends on exact timestamp
assert order.created_at == "2026-02-09T10:30:00Z"

# BRITTLE: depends on auto-generated ID format
assert user.id == "usr_a1b2c3d4e5f6"

Why AI generates this: The LLM produces concrete values because they look like realistic test assertions. It does not reason about which values are deterministic and which are generated at runtime.

How to fix it:

# RESILIENT: checks type and content, not exact string
assert isinstance(error, ValidationError)
assert "email" in str(error).lower()

# RESILIENT: checks existence and recency
assert order.created_at is not None
assert (datetime.now(UTC) - order.created_at).seconds < 5

# RESILIENT: checks format, not exact value
assert user.id is not None
assert user.id.startswith("usr_")
assert len(user.id) == 16

Failure Mode 4: The Hallucinated API

AI invents methods, endpoints, or parameters that do not exist in your codebase.

# AI invents methods that don't exist
result = client.orders.find_by_status("pending")  # find_by_status doesn't exist
user = UserService.get_or_create(email="test@test.com")  # get_or_create doesn't exist
response = client.patch("/api/v2/orders/1/cancel")  # this endpoint doesn't exist

Why AI generates this: The LLM has seen similar APIs in its training data (Django's get_or_create, Rails' find_by_*) and applies those patterns to your codebase.

How to detect it: Always grep your codebase for any function, method, or endpoint the AI references.

# Quick hallucination check
grep -rn "find_by_status" app/ --include="*.py"
grep -rn "get_or_create" app/ --include="*.py"
grep -rn "/cancel" app/routes/ --include="*.py"

If grep returns nothing, the API is hallucinated. Delete the test or replace the call with the actual API.

How to prevent it: Include the actual function signatures or endpoint list in your prompt context.

Failure Mode 5: The Non-Deterministic Test

The test depends on time, random data, or external state that changes between runs.

# FLAKY: depends on current time
def test_expiry_check():
    token = create_token(expires_in=3600)
    assert not token.is_expired()    # Passes now, fails if test takes > 1 hour

# FLAKY: depends on random data
def test_random_assignment():
    user = create_user()
    assert user.group in ["A", "B"]  # Passes most times, fails if random assigns "C"

# FLAKY: depends on database state from another test
def test_user_count():
    assert User.count() == 5  # Depends on exact state left by previous tests

How to fix it:

# FIXED: mock time explicitly
def test_expiry_check(freezer):
    freezer.move_to("2026-02-09T10:00:00Z")
    token = create_token(expires_in=3600)
    freezer.move_to("2026-02-09T10:30:00Z")  # 30 min later
    assert not token.is_expired()
    freezer.move_to("2026-02-09T11:01:00Z")  # 61 min later
    assert token.is_expired()

# FIXED: seed random
def test_random_assignment():
    random.seed(42)
    user = create_user()
    assert user.group == "A"  # Deterministic with seed

# FIXED: use relative assertions
def test_user_creation_increments_count():
    before = User.count()
    create_user(name="New User")
    assert User.count() == before + 1

Failure Mode 6: The Assertion-Free Test

The test runs code but never asserts anything. It only proves the code does not throw an exception.

# USELESS: no assertion
def test_process_order():
    order = create_order(product_id="abc", quantity=2)
    process_order(order)
    # Test passes as long as no exception is raised
    # But what if the order was not actually processed?

How to detect it: Search for test functions with no assert, raises, or expect statements.

# Find assertion-free tests
grep -rn "def test_" tests/ | while read line; do
    func=$(echo "$line" | sed 's/.*def \(test_[a-z_]*\).*/\1/')
    file=$(echo "$line" | cut -d: -f1)
    if ! grep -A 20 "def $func" "$file" | grep -q "assert\|raises\|pytest.mark"; then
        echo "WARNING: $func in $file has no assertion"
    fi
done

How to fix it: Every test must assert something specific about the outcome.

Failure Mode Summary Table

Failure Mode	Detection Speed	Frequency	Severity
Tautology	Medium (requires tracing)	15-20% of AI tests	High
Happy path only	Fast (count error tests)	30-40% of AI suites	High
Overly specific	Fast (look for literals)	10-15% of AI tests	Medium
Hallucinated API	Medium (requires grep)	5-10% of AI tests	Critical
Non-deterministic	Slow (appears as flakiness)	5-10% of AI tests	High
Assertion-free	Fast (automated check)	5-8% of AI tests	Critical

The One-Line Review Question

For every AI-generated test, ask this single question:

"If the feature under test was completely broken, would this test fail?"

If the answer is "no" or "I'm not sure," the test needs to be revised or deleted. This single question catches tautologies, assertion-free tests, and happy-path-only coverage in one pass.

Key Takeaway

AI test failures are predictable and classifiable. Learn the six failure modes, and you can review 30 AI-generated tests in 15 minutes with high confidence. The most common failures -- tautologies and happy-path bias -- account for over half of all issues and are the easiest to detect.