The Terraform repos we maintain across clients look broadly similar. No two are identical — we'd be lying if we said so — but four rules have held with practically no exceptions since 2019. We call them 'boring Terraform'. Literally. When a new engineer adds something elegant, our first question is whether we can do it slightly less elegantly.
Here is the short list of what we do — and the one thing we tried for a year and quietly rolled back.
Rule 1: one state file per environment
No shared state between staging and production. No workspace tricks. Each environment has its own state file in S3 with a DynamoDB lock. Yes, it means a few hundred lines of duplicated code. Yes, we tried smarter setups. Yes, they all woke us up at night once.
The cost is the duplication. The benefit is that no morning has ever started with 'why did that apply against staging when I had production selected?'
Rule 2: no custom module until a second project asks for it
The temptation to write a generic module after the first use is huge. Especially when the problem is 80% the same. That other 20% is where all the bugs live.
The rule: copy twice, abstract once. A second project always shows us the differences we couldn't see after the first. A third project might become a module. If we never get to a third one, we don't need the module.
Rule 3: secrets never live in Terraform
No API key, no database password in a `.tf` file — not even in `terraform.tfvars`. Secrets always live in a secret manager (AWS Secrets Manager, GCP Secret Manager, Vault) and Terraform only reads their ARN or path. Provisioning a database password works like this: Terraform creates the resource, hands a random password to the secret manager, and a separate step writes it into the database.
This rule once saved a client from a real leak. Someone accidentally pushed a `.tfvars` file to a public GitHub repo. The file contained nothing sensitive — because nothing sensitive had ever been allowed into it.
Rule 4: the plan goes in the PR
No apply without code review. The pull request must include `terraform plan` output in the description — automatically, via GitHub Actions. The reviewer reads the plan, not the file diff. The diff is how it's written. The plan is what it will actually do.
A lot of bad changes look innocent in the diff. The plan exposes them immediately: drop_resource, force_replace, IAM role re-mapping. The twenty seconds CI spends generating that plan is the best-spent time in the whole workflow.
The experiment we rolled back
Through 2022 we spent a year on Terragrunt. The idea was attractive: DRY for Terraform, shared backends, automatic provider blocks, environment hierarchies.
After a year we quietly walked away. Not because Terragrunt is bad — it isn't. We had a team of six senior engineers and only two of them really understood Terragrunt. When one was on holiday and the other on a different project, debugging hurt. 'Boring Terraform' was the only shared language everyone actually spoke.
The rule we derived: we don't add a tool that fewer than half the team can fully operate. It's very conservative. It saves us in precisely the moments when conservative rules pay off.
Elegant infrastructure is elegant right up until you have to debug it at three in the morning. After that, it's just infrastructure you're debugging at three in the morning.
When a new engineer wants to add a fancier helper or a custom Terraform provider, we don't say no by reflex. We say: you carry it on-call. If the answer is yes, there's usually a good reason. If the answer is 'um', we go back to the boring code.