John

Senior Cloud Engineer & Technical Lead

The Terraform Plan Validation Gap That Caused a Production Outage

I’ve been digging into Terraform’s apply lifecycle recently, and I found a failure mode that genuinely scared me. A clean terraform plan – no warnings, no errors, nothing suspicious – can leave you in an irrecoverable state during apply. Not a failed apply you can retry. An apply that destroys a resource, fails to create its replacement, and leaves you with nothing. The plan said everything was fine. It wasn’t.

The Scenario

Here’s the setup. Imagine a Terraform-managed IAM inline policy attached to a role used by an API gateway. Someone adds several new permission statements to the policy document. terraform plan shows a clean update: old policy out, new policy in. Standard stuff.

What’s easy to miss is that Terraform’s update lifecycle for inline policies is destroy-then-create. During terraform apply, Terraform deletes the existing inline policy first, then attempts to create the replacement. But if the new policy document exceeds AWS’s 2,048-byte limit for inline policies – say it’s 2,891 bytes – AWS rejects the creation request.

The result: the IAM role is left with no policy attached. The API gateway can’t authenticate to any downstream service. Production outage. And the worst part? There’s no automatic rollback. The old policy is already gone. The new one never existed. I’ve seen this exact scenario play out in community incident reports, and I’ve reproduced it myself in a test environment. It’s real.

Why This Is Worse Than a Failed Apply

This isn’t the same as “apply failed, just retry it.” For resources where Terraform performs destroy-then-create (forced replacement), a failure on the create step leaves you in a degraded state with no rollback path. The old resource is already gone. The new one failed to materialize. You’re stuck.

And this pattern isn’t limited to IAM policies. Any resource showing -/+ or # forces replacement in the plan output is vulnerable:

Resource Trigger What Goes Wrong
RDS instances Engine version or storage type change Database destroyed, recreation fails – data gone
Security groups Name change All referencing resources lose network access
EKS node groups Disk size, AMI, or instance type change Compute capacity disappears
Launch templates Name change ASG references a deleted template
Lambda functions Package exceeds 50MB zipped / 250MB unzipped Function deleted, new one rejected
S3 bucket policies Policy exceeds 20KB Bucket loses all access controls
SQS/SNS policies Size or format constraint violation Queue/topic loses permissions

The mental model to internalize here is critical:

graph TD A[terraform plan] --> B{Plan Output} B -->|"~ update in-place"| C[Lower Risk] B -->|"-/+ forces replacement"| D[HIGH RISK] D --> E[Step 1: Destroy Old Resource] E --> F{Step 2: Create New Resource} F -->|Success| G[Resource Replaced] F -->|"Failure (constraint violation)"| H[DEGRADED STATE] H --> I["Old resource: GONE"] H --> J["New resource: NEVER CREATED"] H --> K["Rollback: MANUAL"] style D fill:#ff6b6b,color:#fff style H fill:#ff6b6b,color:#fff style I fill:#ffaa00,color:#fff style J fill:#ffaa00,color:#fff style K fill:#ffaa00,color:#fff style C fill:#51cf66,color:#fff style G fill:#51cf66,color:#fff

The Root Cause: Terraform Plans Are Structural Diffs, Not Validation Engines

This is what forced me to reckon with a fundamental misconception about terraform plan. It’s easy to treat it as a safety net – if the plan looks good, the apply will succeed. That’s wrong.

terraform plan tells you what will change, not whether the change is valid. It’s essentially saying: “I will delete X and create Y.” It never asks: “Will Y actually succeed?”

Terraform doesn’t know about AWS service limits. It doesn’t validate policy document sizes, tag count limits, security group rule maximums, or resource name length restrictions. These are provider-specific constraints that only surface at apply time when the AWS API rejects the request.

graph LR subgraph "What terraform plan checks" A[State Diff] --> B[Resource Dependencies] B --> C[Provider Schema Validation] C --> D[HCL Syntax] end subgraph "What terraform plan does NOT check" E[AWS Service Limits] F[Policy Size Constraints] G[Tag Count Maximums] H[Name Length Restrictions] I[Resource Quota Limits] end style E fill:#ff6b6b,color:#fff style F fill:#ff6b6b,color:#fff style G fill:#ff6b6b,color:#fff style H fill:#ff6b6b,color:#fff style I fill:#ff6b6b,color:#fff

What Exists Today (And Why None of It Solves This)

After discovering this failure mode, I went deep on every tool in the ecosystem to see if something could catch it. The answer was sobering.

tflint + AWS ruleset

tflint with tflint-ruleset-aws offers 700+ rules for AWS resources – invalid instance types, deprecated AMIs, GovCloud ARN patterns. But it does not validate resource constraints like policy size limits. More importantly, tflint operates on HCL source code, not the resolved plan. So it can’t catch computed values – policies built with data.aws_iam_policy_document or jsonencode() are invisible to it. If the policy is dynamically assembled from multiple data sources, tflint never has a chance.

OPA / Conftest

OPA is a powerful policy engine that runs against plan JSON output. But it’s entirely DIY – you only have rules for things you’ve already been burned by. There’s no shared library of “AWS resource constraint” policies. Nobody writes the rule until after the outage.

Checkov / tfsec / Regula

These tools focus on security posture and compliance – is encryption enabled, is public access blocked, are tags present. They’re not designed for AWS service limits or resource creation constraints.

Terraform preconditions/postconditions

Built-in validation blocks that must be written per-resource by the module author. You can’t validate provider-level constraints that aren’t exposed as data, and someone has to have anticipated the constraint in advance.

Sentinel (HashiCorp)

Same DIY problem as OPA – you write the rules yourself. And it’s Terraform Cloud/Enterprise only, so if you’re running Atlantis or a self-hosted workflow, it’s not an option.

AWS CloudFormation Guard

Closest concept to what’s needed, but it’s CloudFormation-native, not Terraform plan-native.

The gap

Nobody is systematically validating Terraform plans against known AWS resource constraints. And nobody is auto-generating these validation rules from AWS API specifications.

The Opportunity: AWS Smithy Models Are Now Public

Here’s where things get interesting. As of June 2025, AWS publishes formal Smithy API models on GitHub for all public AWS services. These models contain machine-readable constraint traits: string length limits, list cardinality, regex patterns, required fields.

AWS explicitly states these models can be used to build developer tools like linting and auditing utilities. This means auto-generating validation rules from the API spec is now feasible.

The caveat: not all constraints live in the Smithy models. The IAM inline policy size limit (2,048 bytes), for example, is only documented in AWS documentation, not in the Smithy model. A real solution needs a hybrid approach – Smithy models for the long tail plus curated rules for the undocumented constraints.

What the Solution Should Look Like

Why Conftest (post-plan) over tflint (pre-plan)

This is an important architectural decision. tflint operates on HCL source code – it can’t see computed or dynamic values. Conftest operates on terraform show -json plan output – it sees the fully resolved values that will actually be sent to AWS.

The scenario I described involved a dynamically assembled policy document. Only post-plan validation would have caught it.

The pipeline should look like this:

graph LR A[terraform plan] --> B[terraform show -json] B --> C["conftest validate (HARD GATE)"] C -->|Pass| D[Human Approval] C -->|Fail| E[Block Apply] D --> F[terraform apply] style C fill:#339af0,color:#fff style E fill:#ff6b6b,color:#fff style D fill:#51cf66,color:#fff

Two categories of rules

Category 1: Deterministic constraint violations – “This apply WILL fail”

These are rules where you can say with certainty that the apply will be rejected by AWS:

  • Inline policy document > 2,048 bytes
  • Managed policy document > 6,144 bytes
  • More than 50 tags on a resource
  • Security group rule count exceeding limits
  • S3 bucket policy > 20KB

These are auto-generatable from AWS Smithy models combined with curated documentation.

Category 2: Risky replacement patterns – “This plan COULD leave you broken”

These require more nuanced analysis:

  • Resource is being force-replaced (-/+) and it’s a critical-path resource
  • Resource has dependents that will lose access during the replacement window
  • Resource attributes are near a constraint threshold (e.g., 80%+ of max policy size)

Example output

Here’s what the validation gate should produce:

BLOCKED: aws_iam_role_policy.api_gateway_policy
   Inline policy document is 2,891 bytes (limit: 2,048)
   This resource forces replacement -- failure will DELETE existing policy

WARNING: aws_db_instance.primary
   Forces replacement (engine_version change)
   Risk: destroy-then-create on production database

WARNING: aws_security_group.api
   Forces replacement (name change)
   14 resources reference this security group

PASS: 47 other resources validated

A practical Rego rule to start with

Here’s a concrete example for the IAM inline policy size check that would have caught this scenario:

package terraform.aws.iam

deny[msg] {
    resource := input.planned_values.root_module.resources[_]
    resource.type == "aws_iam_role_policy"

    policy_json := resource.values.policy
    byte_length := count(policy_json)
    byte_length > 2048

    msg := sprintf(
        "BLOCKED: %s - Inline policy is %d bytes (limit: 2,048). This will fail on apply and may leave the role without any policy.",
        [resource.address, byte_length]
    )
}

warn[msg] {
    resource := input.planned_values.root_module.resources[_]
    resource.type == "aws_iam_role_policy"

    policy_json := resource.values.policy
    byte_length := count(policy_json)
    byte_length > 1638  # 80% of 2048
    byte_length <= 2048

    msg := sprintf(
        "WARNING: %s - Inline policy is %d bytes (80%%+ of 2,048 limit). Consider migrating to a managed policy.",
        [resource.address, byte_length]
    )
}

Why an Open-Source Policy Library Is the Right Starting Point

Platform engineers are skeptical of vendor tooling – I know because I’m one of them. An open-source Rego policy library organized by AWS service is the right approach for a few reasons:

  1. The value compounds with community contributions. Every team that gets burned by an undocumented constraint adds a rule.
  2. Conftest already integrates with Atlantis natively via server-side policy checking. The adoption path is short.
  3. Each rule should include the constraint, the Rego policy, a test case, and a link to the AWS documentation. Self-contained and auditable.
  4. Auto-generation from Smithy models creates the long tail. Community curation catches the undocumented constraints.

The Bigger Picture

The IaC market is projected to grow from roughly $1.3B in 2025 to $9.4B by 2034. Companies like Spacelift, env0, and Scalr are raising significant funding rounds. They all offer policy-as-code as a feature. But the policy content is still entirely DIY.

Nobody is framing the problem as “prevent irrecoverable state from failed replacements.” There’s a clear gap between “we have a policy engine” and “we have policies that actually prevent outages.”

Key Learnings

  • terraform plan is a structural diff, not a validation engine – it tells you what will change, not whether the change will succeed against AWS service limits
  • Resources showing -/+ (forces replacement) are your highest-risk changes – a failure on the create step after the destroy step leaves you with no resource and no rollback
  • Write a Rego rule for IAM policy size today – check planned_values in the plan JSON, measure byte length against 2,048 (inline) and 6,144 (managed)
  • Migrate inline policies to managed policies where possible – 3x the size limit and a safer update lifecycle
  • Add a conftest step as a hard gate in your Atlantis workflow – advisory-only mode gives you a false sense of security
  • Use create_before_destroy where possible – but know that many AWS resources don’t support having two instances with the same identifier simultaneously
  • AWS Smithy models are a goldmine for auto-generating validation rules – the API specs are public and machine-readable
  • Stop reinventing the wheel after every outage – contribute to and adopt shared policy libraries instead of writing one-off rules in isolation