Terraform Validation

The Validation Pyramid

Infrastructure validation follows a pyramid similar to the traditional test pyramid, but with its own layers. Each layer catches different classes of issues at different costs:

        /\
       /  \        Integration tests (Terratest, real cloud resources)
      /    \
     /------\      Plan-time analysis (terraform plan, drift detection)
    /        \
   /----------\    Static analysis (validate, tfsec, checkov)
  /            \
 /--------------\  Syntax and format (terraform fmt, terraform validate)
/________________\

The bottom of the pyramid is free and fast. The top is expensive and slow but catches issues nothing else can. A mature IaC testing strategy uses all four layers, running the cheap checks on every commit and the expensive checks on critical module changes.

Static Validation: The First Gate

Every Terraform pipeline should start with zero-cost static checks. These run in seconds, require no cloud credentials, and catch a surprising number of issues.

Format and Syntax Checks

# Format check -- enforces consistent style
# -check returns a non-zero exit code if files need formatting
# -recursive scans all subdirectories
# -diff shows what would change
terraform fmt -check -recursive -diff

# Syntax and type validation -- catches typos, missing required fields
# -backend=false skips backend initialization (no credentials needed)
terraform init -backend=false
terraform validate

The distinction between fmt and validate matters. fmt enforces style -- consistent indentation, alignment, and spacing. validate checks structural correctness -- does this HCL parse? Are all required arguments present? Do type constraints match?

What terraform validate Catches

# Example: terraform validate catches this missing required attribute
resource "aws_s3_bucket" "data" {
  # Oops -- forgot the bucket name
  # terraform validate will catch this
  acl = "private"
}

# It also catches type mismatches:
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"
  count         = "three"  # ERROR: count must be a number, not a string
}

# And references to undeclared resources:
resource "aws_security_group_rule" "allow_http" {
  security_group_id = aws_security_group.nonexistent.id  # ERROR
  type              = "ingress"
  from_port         = 80
  to_port           = 80
  protocol          = "tcp"
  cidr_blocks       = ["0.0.0.0/0"]
}

Integrating Static Checks into Pre-Commit Hooks

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-tf
    rev: v1.88.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
        args:
          - --hook-config=--path-to-file=README.md
          - --hook-config=--add-to-existing-file=true
      - id: terraform_tflint
        args:
          - --args=--config=__GIT_WORKING_DIR__/.tflint.hcl

Pre-commit hooks ensure that no developer can commit malformed Terraform. This is your cheapest quality gate.

Plan-Time Analysis: What Will Actually Change?

terraform plan is your integration contract -- it shows the delta between your declared state and the real world. This is where you move from "is the code valid?" to "what will the code do?"

Generating and Analyzing Plans

# Generate a plan file for downstream analysis
terraform plan -out=tfplan -detailed-exitcode

# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes present (this is the important one)

# Convert plan to JSON for programmatic analysis
terraform show -json tfplan > tfplan.json

Programmatic Plan Assertions

The JSON plan output is a goldmine for automated testing. You can write assertions against it just like any other test:

# scripts/validate_plan.py
import json
import sys

def load_plan(path="tfplan.json"):
    with open(path) as f:
        return json.load(f)

def check_no_destroys(plan):
    """No resources should be destroyed without explicit approval."""
    destroys = [
        change["address"]
        for change in plan["resource_changes"]
        if "delete" in change["change"]["actions"]
    ]
    if destroys:
        print(f"BLOCKED: Plan would destroy {len(destroys)} resources:")
        for r in destroys:
            print(f"  - {r}")
        return False
    return True

def check_no_replacements(plan):
    """Flag resources that will be replaced (destroy + create)."""
    replacements = [
        change["address"]
        for change in plan["resource_changes"]
        if change["change"]["actions"] == ["delete", "create"]
        or change["change"]["actions"] == ["create", "delete"]
    ]
    if replacements:
        print(f"WARNING: Plan will replace {len(replacements)} resources:")
        for r in replacements:
            print(f"  - {r}")
        return False
    return True

def check_no_public_s3(plan):
    """S3 buckets must never have public ACLs."""
    for change in plan["resource_changes"]:
        if change["type"] == "aws_s3_bucket":
            after = change["change"].get("after", {})
            acl = after.get("acl", "private")
            if acl != "private":
                print(f"BLOCKED: {change['address']} has ACL '{acl}' (must be 'private')")
                return False
    return True

if __name__ == "__main__":
    plan = load_plan()
    checks = [
        check_no_destroys(plan),
        check_no_replacements(plan),
        check_no_public_s3(plan),
    ]
    if not all(checks):
        sys.exit(1)
    print("All plan checks passed.")

Common Plan Assertions to Implement

Assertion	Why It Matters	Risk Level
No resource destroys	Prevents accidental data loss	Critical
No resource replacements on databases	RDS replacement = downtime + data risk	Critical
No public security group ingress	Prevents open network exposure	Critical
All S3 buckets encrypted	Compliance requirement	High
No oversized instances in non-prod	Cost control	Medium
All resources have required tags	Governance and cost allocation	Medium
No changes to IAM policies without review	Security posture	High

Terratest: Integration Testing with Real Infrastructure

Terratest (written in Go) deploys real infrastructure, runs assertions, and tears it down. This is the most thorough validation because it exercises the actual cloud APIs, but it is also the most expensive.

When to Use Terratest

Use Terratest for reusable modules that many teams depend on. Do not use it for one-off configurations -- the overhead is not justified.

package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/gruntwork-io/terratest/modules/random"
    "github.com/stretchr/testify/assert"
)

func TestS3BucketIsEncrypted(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/s3-data-bucket",
        Vars: map[string]interface{}{
            "bucket_name": "test-" + random.UniqueId(),
            "environment": "test",
        },
    }

    // Deploy real infrastructure
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Get the bucket name from Terraform output
    bucketName := terraform.Output(t, terraformOptions, "bucket_name")
    region := terraform.Output(t, terraformOptions, "region")

    // Verify encryption is enabled on the actual AWS resource
    encryption := aws.GetS3BucketEncryption(t, region, bucketName)
    assert.Equal(t, "aws:kms", encryption)

    // Verify versioning
    versioning := aws.GetS3BucketVersioning(t, region, bucketName)
    assert.Equal(t, "Enabled", versioning)

    // Verify public access is blocked
    publicAccess := aws.GetS3BucketPublicAccessBlock(t, region, bucketName)
    assert.True(t, publicAccess.BlockPublicAcls)
    assert.True(t, publicAccess.BlockPublicPolicy)
}

Terratest Best Practices

Always use t.Parallel() -- Terratest tests are slow. Running them in parallel reduces total execution time dramatically.
Always use defer terraform.Destroy() -- Place the destroy call immediately after creating options, before InitAndApply. This ensures cleanup happens even if the test fails.
Use unique names -- random.UniqueId() prevents naming collisions when tests run in parallel or if a previous cleanup failed.
Set timeouts -- Cloud resource creation can be slow. Set explicit timeouts rather than relying on Go's default test timeout of 10 minutes:

go test -v -timeout 30m ./test/

Use test stages for faster iteration -- Terratest supports skipping the deploy/destroy stages during development:

func TestVPC(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
    }

    // Skip deploy if SKIP_deploy is set
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Validation runs even when reusing existing infrastructure
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    subnets := aws.GetSubnetsForVpc(t, vpcId, "us-east-1")
    assert.Equal(t, 6, len(subnets)) // 3 public + 3 private
}

The Complete Static Validation Pipeline

#!/bin/bash
# scripts/validate-terraform.sh
set -euo pipefail

echo "=== Stage 1: Format ==="
terraform fmt -check -recursive -diff

echo "=== Stage 2: Init + Validate ==="
terraform init -backend=false
terraform validate

echo "=== Stage 3: TFLint ==="
tflint --recursive

echo "=== Stage 4: Security Scan ==="
tfsec . --minimum-severity HIGH

echo "=== Stage 5: Compliance Scan ==="
checkov -d . --framework terraform --compact

echo "=== All static checks passed ==="

This script runs in under 30 seconds and catches the majority of issues before any cloud resources are involved.