The Terraform Plan Validation Gap That Caused a Production Outage
February 13, 2026
I’ve been digging into Terraform’s apply lifecycle recently, and I found a failure mode that genuinely scared me. A clean terraform plan – no warnings, no errors, nothing suspicious – can leave you in an irrecoverable state during apply. Not a failed apply you can retry. An apply that destroys a resource, fails to create its replacement, and leaves you with nothing. The plan said everything was fine. It wasn’t.
The Scenario
Here’s the setup. Imagine a Terraform-managed IAM inline policy attached to a role used by an API gateway. Someone adds several new permission statements to the policy document. terraform plan shows a clean update: old policy out, new policy in. Standard stuff.
What’s easy to miss is that Terraform’s update lifecycle for inline policies is destroy-then-create. During terraform apply, Terraform deletes the existing inline policy first, then attempts to create the replacement. But if the new policy document exceeds AWS’s 2,048-byte limit for inline policies – say it’s 2,891 bytes – AWS rejects the creation request.
The result: the IAM role is left with no policy attached. The API gateway can’t authenticate to any downstream service. Production outage. And the worst part? There’s no automatic rollback. The old policy is already gone. The new one never existed. I’ve seen this exact scenario play out in community incident reports, and I’ve reproduced it myself in a test environment. It’s real.
Why This Is Worse Than a Failed Apply
This isn’t the same as “apply failed, just retry it.” For resources where Terraform performs destroy-then-create (forced replacement), a failure on the create step leaves you in a degraded state with no rollback path. The old resource is already gone. The new one failed to materialize. You’re stuck.
And this pattern isn’t limited to IAM policies. Any resource showing -/+ or # forces replacement in the plan output is vulnerable:
| Resource | Trigger | What Goes Wrong |
|---|---|---|
| RDS instances | Engine version or storage type change | Database destroyed, recreation fails – data gone |
| Security groups | Name change | All referencing resources lose network access |
| EKS node groups | Disk size, AMI, or instance type change | Compute capacity disappears |
| Launch templates | Name change | ASG references a deleted template |
| Lambda functions | Package exceeds 50MB zipped / 250MB unzipped | Function deleted, new one rejected |
| S3 bucket policies | Policy exceeds 20KB | Bucket loses all access controls |
| SQS/SNS policies | Size or format constraint violation | Queue/topic loses permissions |
The mental model to internalize here is critical:
The Root Cause: Terraform Plans Are Structural Diffs, Not Validation Engines
This is what forced me to reckon with a fundamental misconception about terraform plan. It’s easy to treat it as a safety net – if the plan looks good, the apply will succeed. That’s wrong.
terraform plan tells you what will change, not whether the change is valid. It’s essentially saying: “I will delete X and create Y.” It never asks: “Will Y actually succeed?”
Terraform doesn’t know about AWS service limits. It doesn’t validate policy document sizes, tag count limits, security group rule maximums, or resource name length restrictions. These are provider-specific constraints that only surface at apply time when the AWS API rejects the request.
What Exists Today (And Why None of It Solves This)
After discovering this failure mode, I went deep on every tool in the ecosystem to see if something could catch it. The answer was sobering.
tflint + AWS ruleset
tflint with tflint-ruleset-aws offers 700+ rules for AWS resources – invalid instance types, deprecated AMIs, GovCloud ARN patterns. But it does not validate resource constraints like policy size limits. More importantly, tflint operates on HCL source code, not the resolved plan. So it can’t catch computed values – policies built with data.aws_iam_policy_document or jsonencode() are invisible to it. If the policy is dynamically assembled from multiple data sources, tflint never has a chance.
OPA / Conftest
OPA is a powerful policy engine that runs against plan JSON output. But it’s entirely DIY – you only have rules for things you’ve already been burned by. There’s no shared library of “AWS resource constraint” policies. Nobody writes the rule until after the outage.
Checkov / tfsec / Regula
These tools focus on security posture and compliance – is encryption enabled, is public access blocked, are tags present. They’re not designed for AWS service limits or resource creation constraints.
Terraform preconditions/postconditions
Built-in validation blocks that must be written per-resource by the module author. You can’t validate provider-level constraints that aren’t exposed as data, and someone has to have anticipated the constraint in advance.
Sentinel (HashiCorp)
Same DIY problem as OPA – you write the rules yourself. And it’s Terraform Cloud/Enterprise only, so if you’re running Atlantis or a self-hosted workflow, it’s not an option.
AWS CloudFormation Guard
Closest concept to what’s needed, but it’s CloudFormation-native, not Terraform plan-native.
The gap
Nobody is systematically validating Terraform plans against known AWS resource constraints. And nobody is auto-generating these validation rules from AWS API specifications.
The Opportunity: AWS Smithy Models Are Now Public
Here’s where things get interesting. As of June 2025, AWS publishes formal Smithy API models on GitHub for all public AWS services. These models contain machine-readable constraint traits: string length limits, list cardinality, regex patterns, required fields.
AWS explicitly states these models can be used to build developer tools like linting and auditing utilities. This means auto-generating validation rules from the API spec is now feasible.
The caveat: not all constraints live in the Smithy models. The IAM inline policy size limit (2,048 bytes), for example, is only documented in AWS documentation, not in the Smithy model. A real solution needs a hybrid approach – Smithy models for the long tail plus curated rules for the undocumented constraints.
What the Solution Should Look Like
Why Conftest (post-plan) over tflint (pre-plan)
This is an important architectural decision. tflint operates on HCL source code – it can’t see computed or dynamic values. Conftest operates on terraform show -json plan output – it sees the fully resolved values that will actually be sent to AWS.
The scenario I described involved a dynamically assembled policy document. Only post-plan validation would have caught it.
The pipeline should look like this:
Two categories of rules
Category 1: Deterministic constraint violations – “This apply WILL fail”
These are rules where you can say with certainty that the apply will be rejected by AWS:
- Inline policy document > 2,048 bytes
- Managed policy document > 6,144 bytes
- More than 50 tags on a resource
- Security group rule count exceeding limits
- S3 bucket policy > 20KB
These are auto-generatable from AWS Smithy models combined with curated documentation.
Category 2: Risky replacement patterns – “This plan COULD leave you broken”
These require more nuanced analysis:
- Resource is being force-replaced (
-/+) and it’s a critical-path resource - Resource has dependents that will lose access during the replacement window
- Resource attributes are near a constraint threshold (e.g., 80%+ of max policy size)
Example output
Here’s what the validation gate should produce:
BLOCKED: aws_iam_role_policy.api_gateway_policy
Inline policy document is 2,891 bytes (limit: 2,048)
This resource forces replacement -- failure will DELETE existing policy
WARNING: aws_db_instance.primary
Forces replacement (engine_version change)
Risk: destroy-then-create on production database
WARNING: aws_security_group.api
Forces replacement (name change)
14 resources reference this security group
PASS: 47 other resources validated
A practical Rego rule to start with
Here’s a concrete example for the IAM inline policy size check that would have caught this scenario:
package terraform.aws.iam
deny[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_iam_role_policy"
policy_json := resource.values.policy
byte_length := count(policy_json)
byte_length > 2048
msg := sprintf(
"BLOCKED: %s - Inline policy is %d bytes (limit: 2,048). This will fail on apply and may leave the role without any policy.",
[resource.address, byte_length]
)
}
warn[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_iam_role_policy"
policy_json := resource.values.policy
byte_length := count(policy_json)
byte_length > 1638 # 80% of 2048
byte_length <= 2048
msg := sprintf(
"WARNING: %s - Inline policy is %d bytes (80%%+ of 2,048 limit). Consider migrating to a managed policy.",
[resource.address, byte_length]
)
}
Why an Open-Source Policy Library Is the Right Starting Point
Platform engineers are skeptical of vendor tooling – I know because I’m one of them. An open-source Rego policy library organized by AWS service is the right approach for a few reasons:
- The value compounds with community contributions. Every team that gets burned by an undocumented constraint adds a rule.
- Conftest already integrates with Atlantis natively via server-side policy checking. The adoption path is short.
- Each rule should include the constraint, the Rego policy, a test case, and a link to the AWS documentation. Self-contained and auditable.
- Auto-generation from Smithy models creates the long tail. Community curation catches the undocumented constraints.
The Bigger Picture
The IaC market is projected to grow from roughly $1.3B in 2025 to $9.4B by 2034. Companies like Spacelift, env0, and Scalr are raising significant funding rounds. They all offer policy-as-code as a feature. But the policy content is still entirely DIY.
Nobody is framing the problem as “prevent irrecoverable state from failed replacements.” There’s a clear gap between “we have a policy engine” and “we have policies that actually prevent outages.”
Key Learnings
terraform planis a structural diff, not a validation engine – it tells you what will change, not whether the change will succeed against AWS service limits- Resources showing
-/+(forces replacement) are your highest-risk changes – a failure on the create step after the destroy step leaves you with no resource and no rollback - Write a Rego rule for IAM policy size today – check
planned_valuesin the plan JSON, measure byte length against 2,048 (inline) and 6,144 (managed) - Migrate inline policies to managed policies where possible – 3x the size limit and a safer update lifecycle
- Add a
confteststep as a hard gate in your Atlantis workflow – advisory-only mode gives you a false sense of security - Use
create_before_destroywhere possible – but know that many AWS resources don’t support having two instances with the same identifier simultaneously - AWS Smithy models are a goldmine for auto-generating validation rules – the API specs are public and machine-readable
- Stop reinventing the wheel after every outage – contribute to and adopt shared policy libraries instead of writing one-off rules in isolation