Mind the (infrastructure) Gap

Terraform is widely known for its declarative syntax and its stateful nature. Sometimes people assume this to mean that Terraform configurations serve as a perfect, codified reflection of your cloud environments. This is not the case — and such a misunderstanding is actually somewhat dangerous.

See, Terraform will only identify drift of resources it already manages. This is by design. Otherwise, what defines the boundary of what Terraform should scan? Probably an account, in the case of AWS. But it's quite common to use multiple state files for a single account. Perhaps you have a multi-tenanted environment with multiple IaC deployment pipelines. Perhaps certain resources can only be managed by certain privileged principals, and that creates a separation between your Terraform deployments by necessity. Either way, it would be above Terraform's station to assume such implementation details. Thus, if Terraform doesn't already manage it, it doesn't see it.

The trouble with that is that if a developer provisions something in your cloud environment by hand — or indeed, if a bad actor does (though preventive controls would help you much more in this case!) — Terraform won't conveniently let you know in its next terraform plan.

Quite a glaring governance hole, right?

There actually aren't many solutions to this. The two viable options I've come across seem to be cloud-concierge and driftctl, though the former seems to still be somewhat in its infancy, and not overly maintained (it does offer a lot more bells and whistles than Driftctl, such as automatic pull requests, but it's far more heavyweight as a solution, not least in that it runs as a container rather than as a CLI). The latter has been acquired by Snyk, though it's apparently both "in maintenance mode", and also "in beta". However, I have done some testing on real environments with Driftctl myself, which I will discuss below — so hopefully you can go into it with a bit more confidence.

Driftctl

Driftctl is a CLI tool for scanning your Terraform state files against your live infrastructure. I've only used it in the context of AWS, but it works for that, Azure, GCP, and the GitHub provider as well. It's pretty straight-forward to kick off. As such, this article will use AWS examples, but it should make sense in the context of other providers, too. One such simple example:

driftctl scan \
 --from "tfstate+s3://my-bucket/my-statefile.tfstate" \
 --tf-provider-version 1.23.4 \
 --output json://my-driftctl-output.json

That will scan your whole account, and produce a JSON file that captures everything that isn't managed by Terraform, has drifted in configuration from Terraform, or is present in the state file but doesn't have a corresponding resource in your environment. Have multiple state files for a single account? No problem; you can specify those as separate --from flags, or use wildcarding. The example above uses S3 as the state source, but you can change that as required. You can specify multiple output mediums, too. The docs go into far more depth than this article ever intends to, but hopefully this makes apparent that it is fairly flexible.

Filtering

The other facet I'd like to talk a bit about is filtering/ignoring resources. Obviously, you want to minimise this as much as possible — generally, everything in your cloud environment should be matched by your IaC. However, there are times that you need to ignore resources. For example, AWS might provision service-linked roles/policies for you. That might be something you choose to ignore. So, in your .driftignore file (which you should keep in the directory from which you run Driftctl), you can add the following:

# Ignores policies/attachments managed by AWS. `/service-role/` is reserved by AWS, as is `AWSServiceRoleFor` at the beginning of a role name.
# Thus, a bad actor couldn't spoof an AWS-managed role/policy and match the patterns below.
aws_iam_policy.arn:aws:iam::*:policy/service-role/
aws_iam_policy_attachment.AWSServiceRoleFor*

You can find more information on .driftignore files and the generation of them here.

You can also use a --filter flag when you call Driftctl. This isn't as "neat" as a config file, but it's very powerful. You can use JMESpath expressions to filter resources based not just on their IDs/names, but on granular attributes, as well. As another example of what might be valuable to filter — Driftctl will flag EC2 instances and EBS volumes managed by an autoscaling group, since in your Terraform configuration, you've technically only declared the latter. So you might choose to filter those out:

# The below filters out any EC2 instance managed by an autoscaling group, as well as any EBS volumes that are attached to an EC2 instance.
# It'll still flag instances not in an ASG, or detached EBS volumes. You could use a similar trick for EIPs and other resources as required.
driftctl scan \
 --from "tfstate+s3://my-bucket/my-statefile.tfstate" \
 --tf-provider-version 1.23.4 \
 --filter \
 "!(Type=='aws_instance' && Tags.\"aws:autoscaling:groupName\"==null) && \
 !(Type=='aws_ebs_volume' && Attachments==null)" \
 --output json://my-driftctl-output.json

Under the hood, Driftctl is running describe API calls (in the case of AWS). So, if you see an attribute when you run, for example, aws ebs describe-volumes, you should be able to use it in your filter. There's some more information on filtering here.

Caveats

There's definitely a handful of issues with Driftctl. They're minor/circumnavigable enough that I'd still endorse the tool, but they have led to some headaches — I wouldn't say forking it is off the cards if the need ever arises.

Maintenance Mode/Beta Shenanigans

That's the first and most pressing thing. This tool is being maintained, yes, though I wouldn't know for how long (hence the mention of forking it). It's not being actively developed, which is a shame, because with some tidyup, it could really be quite impeccable.

AWS Security Group Drift

One limitation I stumbled upon quite quickly was that Driftctl misdiagnoses security group rules as having drifted. This is a known limitation. It is preventable — the problem lies with using inline ingress and egress blocks to define rules in aws_security_group resources. If you use aws_security_group_rule instead, it doesn't detect drift. Refactoring this from a code perspective is reasonably simple (albeit tedious). I've encountered such problems before, and I wrote a blog on it a little while ago, here. Do feel free to pinch and tweak the example I discuss there. The bigger challenge is the state surgery you'll need to do. Changing security groups is most certainly something that could take your live environments down, so if you're doing this on anything in production, you will need to run terraform state rm/terraform state mv/terraform import commands (or use their corresponding configuration blocks) to make sure the change of configuration doesn't result in any changes to your live infra.

Throttling

This is my biggest gripe, to be honest. Driftctl hits the AWS APIs aggressively, and presumably the APIs of any other provider you're working with. Both when running locally and in CI, I get ThrottlingExceptions thrown back at me during a Driftctl scan — which fails fast. It doesn't appear to have an exponential backoff mechanism. Realistically, you've got a few options. You can wrap your Driftctl scan in a retry mechanism of your own. This is fine, but given it fails fast, you'll be retrying an entire scan on account of potentially a single API call being throttled. The other option I've seen is to use cpulimit to restrict the CPU utilisation of the Driftctl process. Combining the two might be your best shot, but expect longer scans. This isn't mentioned in their known limitations page, but it effectively is a known problem — there's a reasonably busy GitHub issue on the topic.

How to (Really) Identify Drift

There's still one gotcha here that you may have noticed. Driftctl identifies drift between your live infrastructure and your state files. This isn't quite the same as identifying drift between your live infrastructure and your desired configuration. Fortunately, identifying drift between your configuration and your state files is very simple — that's literally what terraform plan does. So the final piece of the puzzle then is to string these together; identify the drift between, say, the main branch of your repository against your state file(s) for the account with a terraform plan (maybe do terraform plan -out=tfplan, and then terraform show -json tfplan > plan.json to convert into JSON). Then identify the drift between your state file(s) and your live infrastructure with driftctl. With a bit of jq to string these results together, you can conjure yourself an exhaustive map of the drift between your desired configuration, and what actually exists in your cloud environment.

And that's about all there is to it. While you shouldn't use Driftctl in place of preventive measures (like SCPs, IAM, etc.) or more proactive tools (like AWS Config), Driftctl is a solid way to quickly and exhaustively identify manual configurations that need straightening out. Try building it into a regularly-executed (e.g., daily) CI job, and I reckon you'll feel a lot more in control of your cloud environments.

Article Content