A Whistle-Stop Tour of Internal DNS Options on AWS

Happy Halloween. It wouldn't be right for me to not make a Halloween-themed post, which is why today we're talking about 👻 networking on AWS 👻. If you need me, I'll be hiding in the cupboard, illuminated only by the faint glow of my Route 53-themed pumpkin.

When establishing networks, both on AWS or in any other context, it's common to want to link networks together to privately share utilities, establish intranet-wide network appliances, and/or to consolidate your internal traffic. On AWS, this can even save you a significant amount of money — for example if you'd like to centralise your VPC interface endpoints. AWS can certainly accommodate that — VPC peering, Transit Gateway (TGW), or in some cases, subnet sharing, can all be used to achieve some manner of inter-network communication, depending on your use case. Often what springs to mind then are route tables, security groups, and network access control lists (NACLs), the combination of which would let you securely route traffic between peered or TGW-attached VPCs. And for IP-level stuff, that's all good. But you might also want to use DNS, to enable your systems to resolve your internal services with a hostname, instead of having to use static IPs everywhere.

Most people are familiar with Route 53 public hosted zones — if you've ever run a public-facing web app, service, or anything else behind your own domain on AWS, you've probably set one up. Relatively less people have worked with private hosted zones (PHZs — side note: yes, you could argue "public hosted zones" should have the same acronym, but there seems to be a general assumption that if "private" isn't explicitly specified, it's a public hosted zone. So PHZ explicitly means "private"). Rather than being internet-facing, these are attached to your VPC, and its records can only be resolved by the DNS server for the VPC(s) it is associated with.

Some quick Q&As before we continue:

Q: The DNS server for the VPC?
- A: Indeed. Any DNS queries made by resources in your VPC are routed here. It can resolve any rules and private records you've associated with it (i.e., records in any associated PHZ), and falls through to public DNS name servers if you don't have your own internal record for it.
Q: Is this actually in the VPC?
- A: Yes. Well, sort of. the VPC resolver is accessible at 169.254.169.253 (link-local), at the VPC CIDR base + 2 (e.g., 10.0.0.2 for a 10.0.0.0/16 VPC), and at fd00:ec2::253 for IPv6. Also note that for VPCs with multiple CIDR blocks the resolver IP is in the primary CIDR; AWS also reserves CIDR base + 2 for each subnet. See more info here.
Q: Wait, so then why can DNS queries reach the DNS server? Do I need to configure security group/NACL rules to enable this?
- A: No. AWS abstracts that. You don't need to configure security groups or NACLs for the resolver, but you do for resolver endpoints (which we'll talk about shortly).
Q: Can I use my own DNS server?
- A: Yes, by changing your VPC's associated DHCP option set.
Q: You said "VPC(s)". You can associate a PHZ with multiple VPCs?
- A: Yes, and that'll be a major focal point for the remainder of this article.

So, going back to the original point — when you link your networks together using VPC peering, TGW and co., can you resolve DNS hostnames across VPCs as well? Yes, you can, by associating PHZs with multiple VPCs. You can even associate multiple PHZs with a single VPC (i.e., there's a many-to-many relationship between them), provided:

The VPC has DNS support enabled.
Your associated PHZs don't have overlapping namespaces (though you can use subdomains; resolver will pick the most specific).

There's a few different ways you can do this, though:

PHZ Association

I'd liken this approach to your "VPC peering" in the way of linking networks. Simple; free (well, the hosted zones cost $0.50/mo themselves, but the association of them is free); good for a few networks; probably more of a faff to manage at scale. To offer something of an example in Terraform:

resource "aws_route53_vpc_association_authorization" "vpc_x_can_associate_phz_y" {
  vpc_id  = var.vpc_id
  zone_id = var.route_53_phz_id
}

Very simple. Your provider for this would be the account that owns the PHZ. It's basically saying "I'll let this VPC associate me with it, so it can resolve my DNS records". So then to establish the association itself:

resource "aws_route53_zone_association" "vpc_x_phz_y_association" {
  vpc_id  = var.vpc_id
  zone_id = var.route_53_phz_id
}

From a security standpoint, it would make sense to run these two blocks with separate IAM roles (i.e., completely isolated privilege boundaries). To be completely honest, PHZs aren't an access control mechanism in themselves; just because you can resolve a DNS record within one doesn't mean you can actually reach the target IP/resource. That's where your security groups/NACLs come in. So you could technically use a cross-account IAM role to do this, using the assume_role configuration block in the AWS provider, and have a depends_on block to ensure the authorisation comes before the association.

flowchart TD
    subgraph AccountA["Account A"]
        PHZ["PHZ: example.com"]
        VPC_A["VPC A (Hub)"]
        VPC_B["VPC B"]
        IAM_A["IAM Role A; has association permissions in Account A"]

        VPC_A -->|Associated by IAM Role A| PHZ
        VPC_B -->|Associated by IAM Role A| PHZ
    end

    subgraph AccountB["Account B"]
        VPC_C["VPC C"]
        IAM_B["IAM Role B; has association permissions in Account B"]
    end

    PHZ -->|Association Authorisation sent by IAM Role A| VPC_C
    VPC_C -->|Association Established by IAM Role B| PHZ

Personally, I would say that in most (i.e., not mega-scale) instances, you can get away with making clever use of IaC tools to implement this in a relatively simple hub-and-spoke style way. However, it's not as simple to manage as the alternative which I discuss below...

Route 53 Profiles

Profiles essentially serve as a "box" in which to place various DNS bits — PHZs, interface VPC endpoints, resolver rules, and DNS firewall rule groups — and share them across accounts as a single, self-contained unit. Again, I'll emphasise: this just means you can resolve them. Interface VPC endpoints are, of course, interfaces — which have security groups and subnet NACLs. Sharing them with profiles doesn't mean a resource can route through to or use them — just that they can resolve queries against their associated DNS records.

If you have a lot of these sorts of resources that you'd like to share to a lot of VPCs, or you'd just like the simplest way of sharing/centralising DNS configurations, this is the way to go. Amazon specifically recommends that if you're looking at >300 PHZ associations you might want to look at using profiles instead (source). That said, they are somewhat costly (see below), so you can still request a quota increase if you have that many "spoke" VPCs and you'd rather keep things cheap at the cost of some more complexity.

Profiles cost $0.75 per hour for the first 100 VPC associations, then every additional VPC association over 100 is an extra $0.0014/hour ($0.14 per hour per 100).

So, let's say you've hit your soft ceiling of 300 VPC associations:

$0.75/hour for your first 100 VPCs.
$0.14/hour * 2 for your next 200 VPCs = $0.28/hour.
$0.75/hour + $0.28/hour = $1.03/hour.
$1.03 * 730 (average hours in a month) = ~$751.90/mo, pre-tax.

If you're looking at a large-scale cloud architecture, that may be well within affordability range. Where I wouldn't recommend it is if you're only looking at a few VPCs.

Let's say this time you have 5 VPCs you'd like to associate your profile with:

$0.75/hour for your first 100 VPCs. Even if you only use 5 of that 100, it's still $0.75/hour.
$0.75 * 730 (average hours in a month) = ~$547.50, pre-tax.

It really only becomes justifiable with scale. Because remember what we discussed in the last section — IaC can be used to automate a lot of the tedium in handling peer-to-peer VPC associations (and from my own experimentation, that works reasonably nicely).

Some things to highlight here, though:

You can use AWS RAM to share a profile with other accounts to use within the same region. The VPCs you associate with the profile count towards your billed number — e.g., 50 associated VPCs in account A, and 50 associated VPCs in account B adds up to 100 associated VPCs towards your billed amount. So that's still $0.75/hour.
If you wanted multiple profiles (for example, if certain accounts needed different DNS configurations), they also count towards your billed amount. So, if you have 5 profiles, each with 20 associations, you just get charged $0.75/hour for the total of 100.
If you want profiles in multiple regions, that's where the tally resets. Even if you have 20 associations in region A and 80 associations in region B, you'll get charged $1.50/hour — $0.75/hour for each region.

For more info, see here.

flowchart TD
    subgraph AccountA["Account A"]
        R53Profile["Route 53 Profile: example-profile (PHZs + Resolver Rules + VPC DNS Records)"]
        VPC_A["VPC A (Hub)"]
        VPC_B["VPC B"]
        IAM_A["IAM Role A; Has profile association & RAM sharing permissions from Account A"]

        VPC_A -->|Associated by IAM Role A| R53Profile
        VPC_B -->|Associated by IAM Role A| R53Profile
    end

    subgraph AccountB["Account B"]
        VPC_C["VPC C"]
        IAM_B["IAM Role B; Has profile association permissions in Account B"]
    end

    R53Profile -->|IAM Role A Shares Profile via RAM| AccountB
    VPC_C -->|IAM Role B Associates RAM-shared Profile| R53Profile

Inbound/Outbound Resolver Endpoints

This one's a little different, since you're not sharing the hosted zone directly. The query isn't actually being resolved in your own VPC; it's passed along.

I wouldn't really recommend this one for pure AWS workloads — the two above are probably much simpler and/or cheaper if that's the case. This is more appropriate (and is really your only AWS-native option) for hybrid workloads, e.g., on-prem/AWS (where you have either a Site-to-Site VPN, or a Direct Connect connection). In this example:

If you'd like your on-premises workloads to be able to resolve DNS queries against your VPC (i.e., records in a PHZ), you want to configure inbound resolver endpoints in your VPC.
If you'd like your AWS workloads to be able to resolve DNS queries against your on-premises DNS servers, you want to configure outbound resolver endpoints in your VPC.

These are just ENIs that sit in your VPC, and are by extension associated with your VPC's DNS resolver. However, for availability, you must specify two separate IP addresses, and you must therefore have two ENIs for any given inbound or outbound endpoint. The IP will either be allocated for you, or you can choose one from the pool available for the subnet you specify for the ENI (you can support IPv4, IPv6, or dual-stack on a single ENI). Make sure you provide subnets in two separate AZs, otherwise it's a bit pointless!

Like essentially any other ENI, you can associate a security group with it, and associate a NACL with its parent subnet. As such, unlike the VPC resolver itself, you do need to configure them to enable DNS queries to flow through from your resources, and likewise, your resources must allow outbound traffic on the security groups associated with the resources that make those queries. The rules you specify depends on the protocol you choose for your endpoints, those being:

Do53: Standard DNS on TCP/UDP port 53. So, you'll need to whitelist those.
DoH: DNS wrapped in HTTPS; more secure, but also slightly slower and harder to monitor without TLS inspection and such. So, you'll need to support connections on TCP port 443.
DoH-FIPS: DoH, but if you require FIPS compliance. Same config as DoH essentially (but probably a bit slower).

Okay, but how do my resources know when to send queries to a resolver endpoint?

That's where your resolver rules come in, which determine what should go where, and how. For instance, "I'd like queries for example.com to go to my outbound resolver endpoint, which forwards the request on to my on-premises DNS resolver". There are three different types of resolver rules:

Forwarding Rules: Forward the query along to a resolver endpoint, if it ends in the domain specified in the rule.
System Rules: These override forwarding rules. So, if you wanted to forward queries to example.com to a resolver endpoint (using a forwarding rule), but not queries to test.example.com, you would create a system rule for the latter.
Delegation Rules: This one's a bit tricky. Rather than telling your resolver to point any queries for a particular domain to a specific endpoint, instead, it's more like saying, "if you ever need to resolve this nameserver, i.e., because you discovered it during resolution, then go through this endpoint". This might be useful if, for example, your parent zone is in Route 53, but you've delegated a subdomain to on-prem name servers.

Price-wise, it's $0.125/hour/ENI — so $0.25/hour, given the requirement for 2 ENIs/endpoint. That's ~$182.50/mo, pre-tax, for a single outbound or inbound resolver endpoint.

Fortunately, you can share these to multiple VPCs across multiple accounts within one region, using AWS RAM, so if you only need to point to one resolver location, and operate in one region, one endpoint will be enough.

Bear in mind you still have to configure security groups/NACLs to allow traffic from wherever you share the endpoint, and you will need a corresponding resolver rule in the VPCs that point to it (yes indeed, you could associate a Route 53 profile, as we talked about above, to share these resolver rules, rather than creating a new one for every VPC 😉).

In sum:

Your resource makes a DNS query for example.com.
Your VPC's resolver consults any rules it has for example.com.
That rule might then (in the case of a forwarding rule for example.com) forwards the query to an outbound resolver endpoint.
The endpoint forwards the DNS query over the Site-to-Site VPN/Direct Connect connection to your on-premises resolver.
The resolved query comes back to your requesting resource.

If you wanted to do the same in reverse, you'd have an inbound resolver and a forwarding rule on your on-premises resolver to point your DNS queries to it. Basically the same in reverse.

An architecture diagram by AWS demonstrating the flow of traffic from a resource, to the VPC resolver, to outbound endpoints, and to on-premises, or vice-versa, as described above.

Source for this image is here. Frankly, no diagram I could produce demonstrates it as well as this does 😅, but it essentially captures what I've been describing.

...This has been a bit verbose. While all of this would absolutely work between VPCs in a pure AWS context, I'm not sure it would make as much sense as the first two options.

Miscellanea

Some supplemental information for completeness:

Remember that you cannot have overlapping domain space on any VPC resolver. Further, some resources — such as VPC endpoints — can establish their own DNS records. When you create them, AWS adds entries to your VPC's DNS resolver under the hood. If you're using profiles, you can take advantage of this (because they allow you to share VPC endpoint DNS entries), but if not, you can't share those entries, because you don't control the underlying PHZs. You will want to specifically disable DNS hostnames on those resources, create your own PHZs with records that point to them, and then share those PHZs. It's a bit of a gotcha.
DNS firewall rule groups: These are also shareable with Route 53 profiles. They're a little out of scope for this post, but in essence, they allow you to control what DNS queries can leave your VPC (by default, anything can, even if the originating resources don't actually have a path to wherever the record resolves to). For instance, if there are certain domains you know to be malicious, you can block or alert on those queries being made.
Outposts: Very much its own topic, but outposts let you run AWS infrastructure and services on-premises. Route 53 is among many of the services this offers, and its primary offerings here are Route 53 Resolver, and resolver endpoints. Both mirror what you'd see with your VPC's resolver and resolver endpoints, so handling hybrid DNS setups is much more seamless.
Query logging: Exactly as it sounds, you can log all of the DNS queries made from your VPC's DNS resolver, and send them to CloudWatch/S3/Kinesis Data Streams.

Comparison

Here's a very quick summary of how these options stack up against each other:

Solution	Implementation Complexity	Maintainability	Cost	Use Case
PHZ Associations	🌤️ Simple	🌦️ Could get painful at scale	🌤️ Free*	Relatively small-scale, IaC-managed workloads
Route 53 Profiles	🌤️ Simple	🌤️ Fairly simple, regardless of scale	⛈️ Quite costly	Large-scale workloads where simplicity and speed of implementation trumps cost
Resolver Endpoints	⛈️ Comparatively complex	🌦️ Scalable, but still a bit complicated	🌦️ Somewhat costly	Hybrid workloads; resolving DNS queries against your VPCs from on-prem, or vice-versa

*Hosted zones themselves cost money ($0.50/mo), but not the association of them

If you'd like a more tangible example of how each of these options would work in practice, I've compiled a repository that demonstrates the latter two (PHZ associations is very simple, as you saw in the snippet above), with Terraform to spin it up yourself (though do bear in mind, profiles and inbound/outbound resolvers will run you up a bill). You can find it here.

And that's about it. There's plenty of other fun DNS bits I'd love to talk about. Cloud Map is one for another day, and I also haven't spoken much about how setting up your own DNS resolver on AWS would work (that's very much its own beast). For now, have a think about architectures you manage: How and where does DNS resolution happen? What about your traffic in general — could you consolidate it in any way? I'm willing to bet you could — and not only improve control and observability, but save some big bucks along the way 😉