Terraform Validation
The Validation Pyramid
Infrastructure validation follows a pyramid similar to the traditional test pyramid, but with its own layers. Each layer catches different classes of issues at different costs:
/\
/ \ Integration tests (Terratest, real cloud resources)
/ \
/------\ Plan-time analysis (terraform plan, drift detection)
/ \
/----------\ Static analysis (validate, tfsec, checkov)
/ \
/--------------\ Syntax and format (terraform fmt, terraform validate)
/________________\
The bottom of the pyramid is free and fast. The top is expensive and slow but catches issues nothing else can. A mature IaC testing strategy uses all four layers, running the cheap checks on every commit and the expensive checks on critical module changes.
Static Validation: The First Gate
Every Terraform pipeline should start with zero-cost static checks. These run in seconds, require no cloud credentials, and catch a surprising number of issues.
Format and Syntax Checks
# Format check -- enforces consistent style
# -check returns a non-zero exit code if files need formatting
# -recursive scans all subdirectories
# -diff shows what would change
terraform fmt -check -recursive -diff
# Syntax and type validation -- catches typos, missing required fields
# -backend=false skips backend initialization (no credentials needed)
terraform init -backend=false
terraform validate
The distinction between fmt and validate matters. fmt enforces style -- consistent indentation, alignment, and spacing. validate checks structural correctness -- does this HCL parse? Are all required arguments present? Do type constraints match?
What terraform validate Catches
# Example: terraform validate catches this missing required attribute
resource "aws_s3_bucket" "data" {
# Oops -- forgot the bucket name
# terraform validate will catch this
acl = "private"
}
# It also catches type mismatches:
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
count = "three" # ERROR: count must be a number, not a string
}
# And references to undeclared resources:
resource "aws_security_group_rule" "allow_http" {
security_group_id = aws_security_group.nonexistent.id # ERROR
type = "ingress"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
Integrating Static Checks into Pre-Commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-tf
rev: v1.88.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
args:
- --hook-config=--path-to-file=README.md
- --hook-config=--add-to-existing-file=true
- id: terraform_tflint
args:
- --args=--config=__GIT_WORKING_DIR__/.tflint.hcl
Pre-commit hooks ensure that no developer can commit malformed Terraform. This is your cheapest quality gate.
Plan-Time Analysis: What Will Actually Change?
terraform plan is your integration contract -- it shows the delta between your declared state and the real world. This is where you move from "is the code valid?" to "what will the code do?"
Generating and Analyzing Plans
# Generate a plan file for downstream analysis
terraform plan -out=tfplan -detailed-exitcode
# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes present (this is the important one)
# Convert plan to JSON for programmatic analysis
terraform show -json tfplan > tfplan.json
Programmatic Plan Assertions
The JSON plan output is a goldmine for automated testing. You can write assertions against it just like any other test:
# scripts/validate_plan.py
import json
import sys
def load_plan(path="tfplan.json"):
with open(path) as f:
return json.load(f)
def check_no_destroys(plan):
"""No resources should be destroyed without explicit approval."""
destroys = [
change["address"]
for change in plan["resource_changes"]
if "delete" in change["change"]["actions"]
]
if destroys:
print(f"BLOCKED: Plan would destroy {len(destroys)} resources:")
for r in destroys:
print(f" - {r}")
return False
return True
def check_no_replacements(plan):
"""Flag resources that will be replaced (destroy + create)."""
replacements = [
change["address"]
for change in plan["resource_changes"]
if change["change"]["actions"] == ["delete", "create"]
or change["change"]["actions"] == ["create", "delete"]
]
if replacements:
print(f"WARNING: Plan will replace {len(replacements)} resources:")
for r in replacements:
print(f" - {r}")
return False
return True
def check_no_public_s3(plan):
"""S3 buckets must never have public ACLs."""
for change in plan["resource_changes"]:
if change["type"] == "aws_s3_bucket":
after = change["change"].get("after", {})
acl = after.get("acl", "private")
if acl != "private":
print(f"BLOCKED: {change['address']} has ACL '{acl}' (must be 'private')")
return False
return True
if __name__ == "__main__":
plan = load_plan()
checks = [
check_no_destroys(plan),
check_no_replacements(plan),
check_no_public_s3(plan),
]
if not all(checks):
sys.exit(1)
print("All plan checks passed.")
Common Plan Assertions to Implement
| Assertion | Why It Matters | Risk Level |
|---|---|---|
| No resource destroys | Prevents accidental data loss | Critical |
| No resource replacements on databases | RDS replacement = downtime + data risk | Critical |
| No public security group ingress | Prevents open network exposure | Critical |
| All S3 buckets encrypted | Compliance requirement | High |
| No oversized instances in non-prod | Cost control | Medium |
| All resources have required tags | Governance and cost allocation | Medium |
| No changes to IAM policies without review | Security posture | High |
Terratest: Integration Testing with Real Infrastructure
Terratest (written in Go) deploys real infrastructure, runs assertions, and tears it down. This is the most thorough validation because it exercises the actual cloud APIs, but it is also the most expensive.
When to Use Terratest
Use Terratest for reusable modules that many teams depend on. Do not use it for one-off configurations -- the overhead is not justified.
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/gruntwork-io/terratest/modules/random"
"github.com/stretchr/testify/assert"
)
func TestS3BucketIsEncrypted(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/s3-data-bucket",
Vars: map[string]interface{}{
"bucket_name": "test-" + random.UniqueId(),
"environment": "test",
},
}
// Deploy real infrastructure
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get the bucket name from Terraform output
bucketName := terraform.Output(t, terraformOptions, "bucket_name")
region := terraform.Output(t, terraformOptions, "region")
// Verify encryption is enabled on the actual AWS resource
encryption := aws.GetS3BucketEncryption(t, region, bucketName)
assert.Equal(t, "aws:kms", encryption)
// Verify versioning
versioning := aws.GetS3BucketVersioning(t, region, bucketName)
assert.Equal(t, "Enabled", versioning)
// Verify public access is blocked
publicAccess := aws.GetS3BucketPublicAccessBlock(t, region, bucketName)
assert.True(t, publicAccess.BlockPublicAcls)
assert.True(t, publicAccess.BlockPublicPolicy)
}
Terratest Best Practices
Always use
t.Parallel()-- Terratest tests are slow. Running them in parallel reduces total execution time dramatically.Always use
defer terraform.Destroy()-- Place the destroy call immediately after creating options, beforeInitAndApply. This ensures cleanup happens even if the test fails.Use unique names --
random.UniqueId()prevents naming collisions when tests run in parallel or if a previous cleanup failed.Set timeouts -- Cloud resource creation can be slow. Set explicit timeouts rather than relying on Go's default test timeout of 10 minutes:
go test -v -timeout 30m ./test/
- Use test stages for faster iteration -- Terratest supports skipping the deploy/destroy stages during development:
func TestVPC(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
}
// Skip deploy if SKIP_deploy is set
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Validation runs even when reusing existing infrastructure
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
subnets := aws.GetSubnetsForVpc(t, vpcId, "us-east-1")
assert.Equal(t, 6, len(subnets)) // 3 public + 3 private
}
The Complete Static Validation Pipeline
#!/bin/bash
# scripts/validate-terraform.sh
set -euo pipefail
echo "=== Stage 1: Format ==="
terraform fmt -check -recursive -diff
echo "=== Stage 2: Init + Validate ==="
terraform init -backend=false
terraform validate
echo "=== Stage 3: TFLint ==="
tflint --recursive
echo "=== Stage 4: Security Scan ==="
tfsec . --minimum-severity HIGH
echo "=== Stage 5: Compliance Scan ==="
checkov -d . --framework terraform --compact
echo "=== All static checks passed ==="
This script runs in under 30 seconds and catches the majority of issues before any cloud resources are involved.