Terraform and Terragrunt in 2025: Why We Added Terragrunt at Scale

Introduction

Terraform was not the part that broke. Our repo shape was.

Once we had more than 50 environments and more than 100 services, plain Terraform started amplifying every bad decision we had made early on. Each new service meant more repeated module wiring, more backend configuration, and more chances to make a small change in dev that had no business touching half the repo.

That is the framing I would use in 2025.

This is not really a story about Terraform versus Terragrunt. That comparison sounds clean, but it is slightly dishonest. Terragrunt did not replace Terraform for us. It sat on top of it and gave us the structure that plain Terraform, in our repository layout, was missing.

If you run a small estate with a handful of modules, Terraform on its own is usually enough. If you run a shared platform across many environments, teams, and services, the question changes. It is no longer "Should I use Terraform?" It becomes "How do I stop repeating myself without creating a giant root module nobody wants to touch?"

For us, the answer was Terragrunt.

What Actually Hurt

The problem was not writing Terraform. The problem was operating it at scale without enough structure.

Before Terragrunt, our layout looked roughly like this:

infrastructure/
├── dev/
│   ├── main.tf
│   ├── variables.tf
│   └── backend.tf
├── staging/
│   ├── main.tf
│   └── backend.tf
└── prod/
    ├── main.tf
    └── backend.tf

infrastructure/
├── dev/
│   ├── main.tf
│   ├── variables.tf
│   └── backend.tf
├── staging/
│   ├── main.tf
│   └── backend.tf
└── prod/
    ├── main.tf
    └── backend.tf

On paper, this looked simple. In practice, it created three problems.

Repeated backend configuration

Every environment needed its own backend configuration. Even when the only meaningful difference was the state path, we still had to carry the same shape of config over and over again. That is manageable with three environments. It gets old very quickly with fifty.

If you need to rename a bucket, change a prefix convention, or adjust how state is laid out, the work is not conceptually hard. It is just repetitive, noisy, and error prone.

Repeated module wiring

Spinning up a standard service stack should have felt like passing a few inputs into a known module. Instead, it often meant copying a large module block, editing a few values, and hoping nobody forgot to update the version or provider configuration in one environment.

This is where IaC gets weird. The code is declarative, but the maintenance pattern becomes manual and procedural. You end up with infrastructure by find and replace.

Too much blast radius

The real cost of repetition is not ugliness. It is risk.

When too much infrastructure is wired together in the same shape, a routine change starts feeling bigger than it should. Review quality drops. Plans get noisy. People become conservative in the wrong places and reckless in the familiar ones. That is usually the sign that the tooling is no longer enforcing good boundaries.

What Terragrunt Changed

Terragrunt made sense for us because it solved the operational problems around Terraform, not because it replaced Terraform itself.

Its value was simple:

We could define common configuration once and inherit it.
We could keep environment specific files focused on inputs and intent.
We could split infrastructure into smaller units with clearer state boundaries.
We could model dependencies between units without building one giant root module.

That combination mattered more than any single feature.

One place for shared configuration

We moved common backend configuration into a root terragrunt.hcl and let child units inherit it:

remote_state {
  backend = "gcs"
  config = {
    bucket = "my-terraform-state"
    prefix = "${path_relative_to_include()}/terraform.tfstate"
  }
}

remote_state {
  backend = "gcs"
  config = {
    bucket = "my-terraform-state"
    prefix = "${path_relative_to_include()}/terraform.tfstate"
  }
}

The point is not the exact backend. The point is that the convention lives once.

That alone removes a surprising amount of repo noise. New units start with the platform defaults instead of with a copy of whatever the last engineer happened to paste.

Environment files became smaller and more honest

Instead of full Terraform entrypoints everywhere, we ended up with smaller terragrunt.hcl files that mostly expressed what changed between environments:

include {
  path = find_in_parent_folders()
}
 
terraform {
  source = "git::git@github.com:my-org/infra-modules.git//gke-cluster?ref=v1.2.0"
}
 
inputs = {
  environment = "dev"
  node_count  = 1
}

include {
  path = find_in_parent_folders()
}
 
terraform {
  source = "git::git@github.com:my-org/infra-modules.git//gke-cluster?ref=v1.2.0"
}
 
inputs = {
  environment = "dev"
  node_count  = 1
}

That is a better abstraction boundary.

The file is no longer pretending to be the implementation. It is declaring how this environment differs from the standard shape.

Smaller states, clearer ownership

The biggest improvement was not cosmetic. It was state isolation.

Breaking large states into smaller units changed how safe the repo felt. A plan became easier to reason about. A change touched less unrelated infrastructure. Rollout order became more explicit. We were no longer depending on one oversized root layout to coordinate everything.

That is the part people often underestimate. DRY matters, but blast radius matters more.

Why This Still Makes Sense in 2025

Yes, the landscape has moved.

Terraform now has Stacks support. OpenTofu continues to evolve. Module registries are better than they used to be. The ecosystem is not standing still.

But these tools are not answering exactly the same question.

Terraform Stacks are about managing stack configurations and deployments. Module registries are about publishing and versioning reusable modules. Those are useful capabilities, but they do not automatically clean up a repo that has grown around repeated environment wiring, repeated backend patterns, and awkward cross unit coordination.

Terragrunt still earns its place when your main pain is operational structure around Terraform or OpenTofu:

You want inheritance for shared configuration.
You want clearer separation between module implementation and environment inputs.
You want dependency aware orchestration between infrastructure units.
You want hooks and workflow glue around infra operations.

I would not call Terragrunt universally "the best" option. I would call it a very pragmatic option when your Terraform codebase is suffering from repetition and oversized execution boundaries.

That distinction matters.

The Migration Approach That Worked For Us

This kind of migration is mostly a sequencing problem.

Trying to redesign the whole estate in one go is how you create a second mess while still owning the first one. We took a more boring approach, which was also the correct one.

Stop creating new debt

First, all new services had to follow the new pattern. There is no point cleaning the old house while still extending the worst room.

Move the safer environments first

We started with lower risk environments where mistakes were cheaper and feedback was faster. That gave us enough confidence to refine the pattern before it reached production.

Refactor state carefully

Breaking apart large states was the most delicate part of the migration. We used terraform state mv heavily while moving resources and modules into cleaner boundaries.

This is one of those operations that is straightforward in the docs and stressful in real life. The command itself is not the hard part. The hard part is coordination, review discipline, and making sure nobody is changing the same area while you are reshaping state ownership.

When I Would Not Reach for Terragrunt

Terragrunt is useful, but it is still another layer.

I would not add it just because the internet likes saying "use Terragrunt at scale." I would add it when the symptoms are clear:

Your environments mostly repeat the same module wiring.
Shared configuration is duplicated everywhere.
Your plans are noisy because state boundaries are too broad.
Engineers are afraid of routine changes for reasons that have nothing to do with the cloud resources themselves.

If none of that is true, plain Terraform is probably still the simpler and better choice.

Conclusion

So, does "Terraform versus Terragrunt" make sense?

Not really, at least not for this story.

The more accurate framing is Terraform with Terragrunt.

Terraform remained the provisioning engine. Terragrunt gave us the repo structure, inheritance model, dependency handling, and state boundaries that made the engine usable at our scale.

That is why I still think Terragrunt is relevant in 2025. Not because Terraform is weak, and not because Terragrunt is fashionable. Because once a codebase grows past a certain point, the missing thing is often not more IaC. It is better structure around the IaC you already have.