~/blog/terraform-vs-terragrunt-at-scale

Terraform and Terragrunt: Why We Added Terragrunt at Scale

8 min read

Introduction

Terraform was not the thing that failed. Our repository did.

By the time we had more than 50 environments and more than 100 services, Terraform was mostly doing what it was asked to do. The problem was what we kept asking of it: duplicate backend blocks, copy-pasted module wiring, inconsistent environment inputs, and root modules that carried more responsibility than anyone wanted to review.

That is the useful way to frame Terraform and Terragrunt in 2025.

It was never really Terraform versus Terragrunt for us. That framing is tidy, but wrong. Terragrunt did not replace Terraform. Terraform still planned and applied the resources. Terragrunt gave us a way to organise the work so every new service did not start with another round of paste, edit, squint, and hope.

If you have a small estate with a few modules and a small number of environments, plain Terraform is probably fine. Adding Terragrunt there can be ceremony in search of a problem.

If you run a shared platform across many teams, environments, and service shapes, the question changes. It becomes: how do we keep Terraform units small, repeat the right conventions, and avoid one giant root module that nobody wants to touch on ?

That was the trade-off. We accepted another layer because the layer removed a worse one: hidden operational complexity spread across the repo.

What Actually Hurt

Writing Terraform was not the painful part. Operating a growing Terraform repo without enough boundaries was.

Before Terragrunt, our layout looked roughly like this:

infrastructure/
├── dev/
│   └── gke/
│       ├── main.tf
│       ├── variables.tf
│       └── backend.tf
├── staging/
│   └── gke/
│       ├── main.tf
│       └── backend.tf
└── prod/
    └── gke/
        ├── main.tf
        └── backend.tf
infrastructure/
├── dev/
│   └── gke/
│       ├── main.tf
│       ├── variables.tf
│       └── backend.tf
├── staging/
│   └── gke/
│       ├── main.tf
│       └── backend.tf
└── prod/
    └── gke/
        ├── main.tf
        └── backend.tf

This layout looks harmless when there are three folders. It becomes expensive when the pattern is copied across dozens of environments and services.

Repeated backend configuration

Every environment had its own backend configuration. Most of it was identical. The state path changed, maybe the bucket or prefix changed, but the shape was the same.

That is tolerable at small scale. At 50 environments, it turns into bookkeeping. Rename a bucket, change a prefix convention, or split state differently, and the work is not intellectually hard. It is just the kind of dull edit that causes production incidents because the tenth file looks exactly like the ninth.

Repeated module wiring

Spinning up a standard service stack should have meant choosing the module version and passing the environment-specific inputs.

Too often, it meant copying a large module block, changing a few values, and hoping the old provider settings, module version, tags, or naming convention were still correct.

This is where infrastructure as code becomes a bit ridiculous. The syntax is declarative, but the workflow is manual. You end up managing infrastructure with find and replace.

Too much blast radius

The real cost of repetition is not that the repo looks messy. The cost is that small changes become hard to reason about.

Reviewers stop reading repeated blocks carefully. Plans include unrelated noise. Engineers become conservative around routine changes because they cannot tell which part of the repo is actually in play.

When your team starts treating routine changes as high-risk events, that is the signal. The tooling has stopped giving you useful boundaries.

What Terragrunt Changed

Terragrunt made sense because it addressed the boring parts around Terraform. Those boring parts were exactly where our risk lived.

The useful changes were:

  1. We could define common configuration once and inherit it.
  2. We could keep environment specific files focused on inputs and intent.
  3. We could split infrastructure into smaller units with clearer state boundaries.
  4. We could model dependencies between units without building one giant root module.

None of those features is magic on its own. Together, they changed the default shape of the repo.

One place for shared configuration

We moved shared backend configuration into a root terragrunt.hcl and let child units inherit it:

remote_state {
  backend = "gcs"
  config = {
    bucket = "my-terraform-state"
    prefix = "${path_relative_to_include()}/terraform.tfstate"
  }
}
remote_state {
  backend = "gcs"
  config = {
    bucket = "my-terraform-state"
    prefix = "${path_relative_to_include()}/terraform.tfstate"
  }
}

The backend is not the interesting part here. The convention is.

Instead of repeating backend shape everywhere, we made it a platform default. A new unit inherited the boring pieces and only had to declare what was different.

Environment files became smaller and more honest

Instead of full Terraform entrypoints in every folder, we ended up with smaller terragrunt.hcl files that mostly expressed intent:

include {
  path = find_in_parent_folders()
}
 
terraform {
  source = "git::git@github.com:my-org/infra-modules.git//gke-cluster?ref=v1.2.0"
}
 
inputs = {
  environment = "dev"
  node_count  = 1
}
include {
  path = find_in_parent_folders()
}
 
terraform {
  source = "git::git@github.com:my-org/infra-modules.git//gke-cluster?ref=v1.2.0"
}
 
inputs = {
  environment = "dev"
  node_count  = 1
}

That is a cleaner boundary.

The environment file no longer pretends to be the implementation. It points at the implementation and declares the local choices: environment name, sizing, feature flags, dependency outputs, and whatever else genuinely differs.

Smaller states, clearer ownership

The biggest improvement was state isolation.

Breaking large states into smaller units changed the behaviour of the whole workflow. Plans became shorter. Reviews became sharper. A change to one service stopped dragging half the platform into the conversation. Rollout order had to be explicit instead of implied by a huge root module.

This is the part people underweight. DRY is useful. Blast radius is more useful.

Why This Still Makes Sense in Today

The ecosystem has moved since the first wave of "use Terragrunt at scale" advice.

Terraform has Stacks support. OpenTofu continues to evolve. Module registries are better. Teams also have more CI/CD patterns for running infrastructure changes safely.

That does not make Terragrunt obsolete, but it does make the lazy argument weaker. "Use Terragrunt because scale" is not enough.

The question is whether your existing repo has a Terragrunt-shaped problem, or a different one. Terragrunt is still the right answer when your pain is repo structure, state boundaries, and orchestration. It is not the right answer when your pain is something else like team process, module design, or the fact that your CI can't reliably run terraform plan without timing out.

Knowing which problem you actually have is most of the work. Terragrunt is a solution to a specific problem, not a tax you pay for growing up.

That distinction matters.

The Migration Approach That Worked For Us

This kind of migration is mostly a sequencing problem.

Trying to redesign the whole estate in one pass is how you end up with two systems, two sets of conventions, and no clear owner for either. We took the dull route because dull was safer.

Stop creating new debt

First, new services had to use the new pattern.

There is no point cleaning up old duplication while still adding fresh duplication. This also gave us real usage quickly. If the pattern was awkward for a new service, it was going to be unbearable during migration.

Move the safer environments first

We started where mistakes were cheaper and feedback was faster.

Lower environments are not perfect replicas of production, but they are useful for testing the workflow: state paths, dependency outputs, CI behaviour, plan readability, and whether engineers can understand the new folder shape without a meeting.

Refactor state carefully

Splitting large states was the delicate part. We used terraform state mv and moved block heavily while moving resources and modules into cleaner boundaries.

The command is simple. The operation is not.

You need a quiet window, clear ownership, careful review, and a plan for what happens if someone is changing the same area while state is being reshaped. The risk is not that terraform state mv is hard to type. The risk is losing track of which state owns which resource.

When I Would Not Reach for Terragrunt

Terragrunt is useful, but it is still another tool in the path between an engineer and an infrastructure change.

That has a cost. People need to understand terragrunt.hcl, includes, dependency blocks, generated config, and whatever wrapper scripts you add around it. CI has to run it correctly. Debugging now involves Terraform and Terragrunt behaviour.

So I would not add it because the internet likes saying "use Terragrunt at scale." I would add it when the symptoms are already visible:

  • Your environments mostly repeat the same module wiring.
  • Shared configuration is duplicated everywhere.
  • Your plans are noisy because state boundaries are too broad.
  • Engineers are afraid of routine changes for reasons that have nothing to do with the cloud resources themselves.

If none of that is true, plain Terraform is probably the better choice. The simplest layer is the one you do not add.

Conclusion

So, is it Terraform versus Terragrunt?

Not in any useful sense.

The honest framing is Terraform with Terragrunt, where Terraform remains the provisioning engine and Terragrunt handles the repo mechanics around it.

That trade-off was worth it for us because the old cost was already being paid. We were paying it through duplication, wide state boundaries, noisy plans, and slow reviews. Terragrunt made that cost explicit and gave us a better place to manage it.

That is why I still think Terragrunt has a place. Not because Terraform is weak. Not because every team needs another wrapper. Because once a Terraform estate gets large enough, the missing piece is often not more infrastructure code. It is a cleaner operating model for the infrastructure code you already have.