Mastering DNS Resilience: How to Assess and Ensure Network Survivability – Security Boulevard

The Home of the Security Bloggers Network
Home » Security Bloggers Network » Mastering DNS Resilience: How to Assess and Ensure Network Survivability
”It’s not DNS.”
“There’s no way it’s DNS”
”It was DNS.”
It’s as old as the internet. My age, basically.
For most people, DNS is like electricity. We don’t celebrate its very existence. We just moan and complain when it’s mysteriously absent for even a second. Such first world problems are met with less and less attention at the core, it seems, to get it right.
However, this is where we (ADAMnetworks) live. We protect people. Part of that protection is the emotion and disruption that happens when this stuff doesn’t work.
Let’s take a step back. In the DNS world, the original design was about recursing right back to the root servers if, and when needed.
Since most end users didn’t have recursive resolvers of their own, it became a standard that is in place to this day: ISPs stand up caching recursive DNS servers for their subscriber base to use. One dynamic that we witnessed at the onset of ubiquitous connectivity and the rise of social media was nation-state level disruption where ISPs were required to selectively block DNS queries to social media platforms, which gave rise to Google Public DNS (8.8.8.8/8.8.4.4), and while that was a very temporary route around the blockage, it served as good marketing for fast, anycast DNS servers, and now every sysadmin defaults to one of these centralized non-ISP DNS servers.
As the internet grew in scale and we saw a healthy tension between not-for-profit and for-profit enterprise, we saw this space evolve. See DNS encryption, for example:
Root DNS servers do not, and will not offer DoT, DoH or DoQ as their official statement states.
And yet, every commercially-operated “Protective DNS Resolver”, including those that are free to use, offer DNS encryption of one form or another.
Back to the purpose of this article: DNS resilience. The healthy tension described above is what has us thinking about the best approach to making sure DNS is resilient all of the time for the people we love.
Ideally, we can keep working in any of these scenarios:
None of these issues are addressed with one silver bullet, but a proven approach to the bulk of these issues can be addressed with these redundancies:
A lot of these resilience efforts are within the control of the enterprise or outsourced technology partner. However, what if your DNS supply chain is attacked?
The approach is to ensure there isn’t a single point of failure anywhere at any time. Let’s use the character trait of the original internet design, to be able to route around any obstacle to ensure there’s redundancy everywhere. To that end, here’s a simple checklist you can use as an audit, if everything works correctly, including a mature bug-free DNS resolver environment. We have taken the liberty of allocating a thoughtful weighting of points per check:
Let’s dive into each of the above with more detail and examples:
The syntax used in many networks is to use the last octet as a way of identifying the real vs virtual hosts for a typical /24 network:
10.128.1.1 is NODE1
10.128.1.2 is NODE2
10.128.1.254 is the VIP (and the VIP is the designated DNS server at the endpoints)
Using this syntax, we see the endpoint is assigned like this:
*NIX terminal shows us this:
Similarly in Windows, we see this:
Notice that the DHCP-assigned resolver is the VIP, yet all 3 of them offer DNS answers from macOS/Linux:
Let’s start by doing the query to the one and only DHCP-assigned resolver:
For good measure, let’s check directly with the real nodes behind the VIP:
And the same goes for Windows:
Since all of them responded as expected, we have a 30-point score so far.
An important design element in the enterprise is that endpoints never reference Active Directory DNS directly. Instead, designated resolvers, know to consult AD DNS only for domains where AD DNS is authoritative. With that understanding, here’s how this check can be validated, depending on the environment:
We use an SRV query to validate that such records exist and are answered via our DHCP-assigned single DNS server:
The important observation is to see if both AD servers were queried for the DNS query itself, which is captured in the logs of adam:ONE and/or centralized SIEM, notice that the query was forwarded to two (2) domain controllers:
In DNSharmony, multiple resolver sets can be created and then used to harmonize (if any protective resolver blocks an FQDN, the answer is blocked). Here’s an example:
Here’s how it is then applied to the policy itself:
Let’s use VyOS as an example where a load-balanced WAN setup can be verified:
Using the above example, WAN1 runs over eth0 and WAN2 runs over eth1, and as long as the upstream resolver pairs are split-routed, no DNS outage is experienced.
When more than one ISP is being used, the DNS resolver sets should be distributed also (in most cases). Here’s an example of how they are split between two ISPs for the same resolver set (9.9.9.9 first):
Second, let’s review the path to 149.112.112.112:
Finally, if and when such redundancies fail at any stage, there must be a monitoring instance that alerts the network engineering team to a failure. What better way to test than to repeatedly confirm that the DNS services are running on each node.
For this reason, our managed clients have a listener on localhost which can then be systematically monitored for domain mytools.management that will resolve to the LAN interface from which it was queried, and any non-answer, or public IP answer means the DNS services are failing. This can be done by integrating tools such as zabbix or cronitor alerts.
One more feature of adam:ONE, the on-premise caching resolver service is an automatic back-off for non-responsive resolvers. This allows for redundancy without sacrificing performance. Most importantly, this facilitates resilience even in typical supply-chain problems (ISPs, public resolvers, etc).
UPDATE: including a diagram that shows a sample of a fully-resiient network that meets all the requirements above:
4 posts – 3 participants
Read full topic
*** This is a Security Bloggers Network syndicated blog from ADAMnetworks® Blog – ADAMnetworks authored by David. Read the original post at: https://support.adamnet.works/t/mastering-dns-resilience-how-to-assess-and-ensure-network-survivability/1267
Security Boulevard Logo White