DNS issues? Five practical strategies to remove single points of failure
Michael Ade-Kunle
Posted on November 19, 2019
June/July 2019’s Cloudflare incidents got the world thinking about additional safeguards against ‘unlikely’ DNS failure. The article below sheds light on five strategies for coping with these unlikely - but possible - DNS failures, as well as general advice for service reliability.
You may think a domain name system (DNS) outage is a rare possibility for your company. However, recent statistics paint a different picture. For example, a survey published in October 2018 found that 68% of the top 50 Fortune 500 companies used only a single provider to serve their DNS records.
The research concluded that nearly three-quarters (72%) of companies polled were vulnerable to DNS attacks. Besides businesses using only one provider, some only have DNS servers on an internal network.
Political issues can also cause DNS problems. This happened in 2004 with the national domain in Slovakia, which got compromised.
The article below describes five things you can do to get rid of single points of failure associated with a DNS.
Strategy 1: Understand root server issues
While root server issues are not really possible to resolve, the likelihood of them occurring is incredibly low. Knowing what could go wrong aids decision-making.
The first thing we need to understand about DNS is that, at the core, there are 13 root name servers that are ultimately responsible for delegation of every single domain. If these go down, once the time-to-live (TTL) dates for domains expire, the domain name system as we know it is down. The TTL is like an expiration date that tells a local resolver or recursive server how long to keep a DNS record in its cache.
The Internet Corporation for Assigned Names and Numbers (ICANN) is the group that oversees the interconnected link of unique identifiers that lets all internet-connected computers find each other. However, as you can see from the list below, it’s responsible for providing service for only one of these 13 root name servers. Otherwise, it has delegated resolution to 12 independent organizations spread around the globe.
List of Root Servers
Source https://www.iana.org/domains/root/servers, visually displayed at https://root-servers.org/
While the risk of the entire domain name system being unavailable is low, it is not impossible due to the following reasons:
i) Governmental control
While theoretically as of 2016 this is no longer possible since the US passed over control, given 10 out of 13 root servers are in the US, it’s still plausible. Only Netnod (Swedish), RIPE (European), and WIDE (Japanese) are domiciled outside the US.
ii) DDoS attacks
Given the scale of commercial organisations such as Verisign and Cogent, and the fact they have American government backing, this seems unlikely to be achievable without tremendous cost. All IP addresses support multicast which means each IP may represent any number of server endpoints. This allows traffic to be serviced close to the request, and also allows for attacks to spread across multiple servers. In 2007, when Anycast was only partially implemented on root servers, servers running with Anycast saved the day.
iii) Spoof / IP hijacking
This is hard to do at scale, easier in smaller closed networks. In these cases, such problems are not feasibly resolvable at an organizational level. Similarly, the likelihood of them happening is far lower than other possibilities.
Given the above, the likelihood of a root-level domain resolution for a gTLD (generic) such as .com or .info or a ccTLD (countries, two-letter codes) .uk or .jp domain being resolved to an authoritative DNS server is pretty much guaranteed.
Apart from closed network attacks, having any strategy to circumvent an attack on the global root servers is probably not worth it, as all the other services a user/server relies on will probably be unavailable anyway, i.e., the end-to-end service will still be down even if your service is somehow still up.
To summarize Strategy 1: The best approach to take in this case is to be aware of the root server issues that are possible. Outside of the broad categories for such attacks, stay mindful of specific trends that may pose higher-than-average risks for this kind of failure.
Strategy 2: Know your issues with TLD (top-level domain) authoritative servers
The top-level domain is the part of an internet address that appears after the period in the address, such as .com, .gov or .edu. In terms of availability, not all domains are equal - Chinese root level domain went down in Aug 2013. The key thing to bear in mind is the smaller the domain, the less likely it has infrastructure robust enough to defend against attacks.
Conversely, Verisign operates .com and has clearly invested heavily in its Internet service infrastructure, for the two root servers (A&J), and as well as .com, .name, .net
Once again, you can’t really do much about a TLD going down, but you can choose a TLD that is more likely to remain up under a large-scale attack, or even as a result of a software fault.
In this case, you can and should put your resources toward researching which TLDs are most likely to go down versus which ones are most reliable. The .io domain has a few horror stories about going down, being compromised etc. The domain is also controversial politically. In fact, researching this article has led us to purchase the ably.com and ably.net domains, and we’ll be migrating over domains from *.ably.io to these domains in due course.
In general terms, it's worth remembering that you are in control of your DNS, and decisions you make may affect reliability and availability of your domain. Historically, companies ran their own authoritative name servers. Given reliability issues of running DNS servers yourself, it was common for companies to run a primary name server within their network, and run a secondary for another business. That business, in turn, would run your secondary. Once cloud infrastructure emerged, DNS-as-a-service provided DNS resolution and authoritative name servers.
By using Anycast, DNS providers were able to dramatically improve the reliability of the authoritative name servers by resolving DNS queries close to the originating DNS client and also resolving a few authoritative name server IPs to numerous backend servers. Anycast, used for both root and authoritative servers means performance and reliability if implemented correctly.
Choosing a TLD that is unlikely to change hands and is powered by some of the biggest domain infrastructure providers in the world would be a wise safeguard. Moreover, you have the option of using two TLDs as another precautionary measure against this kind of failure.
To sum up Strategy 2: Consider that the likelihood of this kind of DNS failure is incredibly low as long as you pick a reliable TLD. If you don’t take the time to learn about the options and go with trustworthy TLDs, the chances of an outage go up.
Strategy 3: Understand name server issues
Here, we’ll take a closer look at how your DNS management choices can make a difference regarding your likelihood of DNS failure. Being mindful of some characteristics can help you make smarter DNS provider choices.
First, your DNS provider must use Anycast. It allows routers to choose the desired path based on several factors. They include the least-congested route, the closest one or the option with the least latency. Anycast speeds up the DNS resolution process for users. Besides using Anycast, your DNS provider must have tremendous scale and be large. Then, it can manage increased web traffic as your internet presence grows.
Secondly, your DNS provider should not be the same company that services your endpoints. Cloudflare’s incident showed us what can happen with that setup. Plus, aim to use a multi-DNS provider with APIs to keep the two synchronized. If you’re an endpoint provider that offers client software, your goal is to have multiple endpoints on different domains.
Finally, don’t overlook the need to regularly check for upcoming domain expiration or SSL certificates going out of date.
When you outsource your DNS management to a provider that uses Anycast, and ideally work with multiple providers, you are better likely to prevent failures. Remember that the likelihood of them happening depends on which choices you make. The suggestions above should help you avoid costly mistakes.
Strategy 4: Take a preventive approach to avoid renewal issues
If your domain name expires, it could open an assortment of other problems that cause headaches and wasted time. For example, there is normally a grace period of approximately a month when you can proceed with the renewal.
However, if you don’t, another person could buy the domain once it expires. After all, you no longer own it then, so it’s up for grabs. That could mean the website people have associated with your internet destination for years is no longer under your control. Thus, all the time and money spent building your brand and gaining name recognition in the marketplace become useless.
If your SSL certificate expires, it could make your website look illegitimate. For example, if a person using the Google Chrome browser navigates to a website with an expired SSL certificate, they see a warning that says it’s not advisable to continue to the site.
Google advises the internet user to go “back to safety.” However, they can override that choice and continue to look at the website by selecting an advanced option provided on the same page. Many people understandably get scared off and decide not to proceed, especially if the site collects payment information or other sensitive details.
It should be clear why it’s necessary to devote significant resources to stopping these kinds of renewal issues from happening. Fortunately, the biggest thing you must make available is commitment rather than time. That’s because there are handy domain name monitoring services that let you track expiration dates and prevent hijacking.
Expirations represent a kind of DNS failure that’s wholly preventable, provided you stay on top of the relevant dates and decide to renew them in time. On the other hand, the risk is substantial if you don’t take expiration dates and renewal timeframes seriously.
In addition to the expirations of your externally facing domains and security certificates — the ones members of the public see — don’t forget to keep tabs on any internal domains and certificates. Your company may have some of those for its intranet.
Strategy 4 in summary: Keep track of domain name expiry dates.
Strategy 5: Safeguard your ecosystem
We’ve looked at some specific things that can be done to reduce failure points with a DNS. In this final main section, let’s explore some of the reliable things you can do to safeguard your ecosystem.
As a start, use Anycast for everything. We’ve already gone over how and why Anycast shortens the DNS resolution time. It also provides a better experience for end-users by reducing bottlenecks.
Next, remember that coupling your endpoints and DNS zone control in one provider is not a good idea. Yes, Cloudflare, we’re looking at you and will learn lessons from the associated incidents.
If you must use Cloudflare and their DNS, then use their CNAME setup approach. Note that this has the downside of one additional DNS lookup, but that’s arguably a fair compromise.
You may be wondering about using a DNS service that manages it through one or more providers. It’s difficult to give a clear-cut answer regarding that option that applies to every situation. That’s because any such service would come with numerous features. You then have a tradeoff where you decide whether the features are worth the possible risks of a company that lets outside providers assume management responsibilities.
The alternative is an Anycast provider that routes DNS to one of two providers. The issue with that, though, is that it introduces a new single point of failure...
Safeguarding your ecosystem should be an ongoing concern instead of something you prioritize for a while and then forget about soon afterward. When you take these suggestions to heart, you’re making meaningful progress in reducing your risk of experiencing single points of failure with your DNS.
DNS crisis prevention: Three ideas for service providers
The information above relates to consumer sites. However, it’s not necessarily applicable to service providers. If you’re in that category, here are some specific tips.
1) Use multiple DNS endpoints with failover strategies in your clients. Complexity is high, as you need to deal with additional logic in all your client software to handle failures and know what failures require routing to alternative endpoints. Plus, other endpoints need to be hardcoded, and clients need a retry and backoff strategy.
This is the approach we have taken at Ably for Ably SDKs, but it does not work for open protocols. Stream.io had a similar issue and came to the same conclusion.
Stream originally mentioned it would implement client-side failover, which would mean clients would skip failing servers. However, according to the data from GitHub, it looks like it decided against that approach after all.
2) Having a third party manage failover DNS is not really a viable solution in all cases. How confident are you that the system that controls the failover will be available and operational at the time it’s needed?
3) You should also think about offering your service without a DNS lookup at all, and instead use Anycast to route traffic by IP to the closest healthy server. You will almost certainly end up with issues with devices that connect over TLS and don’t support custom server name.
Conclusion
You now know that single points of failure with your DNS are possible, but also largely preventable. Doing thorough research about your domain name options and providers is an excellent early step to take as you determine which companies are best equipped to meet your needs. Then you just need to devote resources to consistently minimising your risk, as suggested in the list above.
If you’re interested in learning more about any of the topics mentioned here, browse the section below.
The above article was written based on Ably Realtime's experience providing cloud infrastructure and APIs that help developers simplify complex realtime engineering. Ably makes it easy to power and scale realtime features in apps, or distribute data streams to third-party developers as realtime APIs. To get started create a free Ably account or talk to sales.
Further Reading
Resources about major domains or provider issues:
- A list encompassing literally hundreds of issues (Ianix)
- Cloudflare's detailed post-mortem of the 2019 incidents
- A topline overview of the .de domain outage (Sans Technology Institute)
- Details of the .st domain outage (Blorn.com)
- Some coverage of the puri.sm domain outage that occurred in 2018 (from puri.sm)
- A UK tech blogger describes his experience with a .io outage (haydenjames.io)
- Overview of a DNS outage that affected Microsoft Azure in May 2019 due to a migration mistake (Build Azure)
- Details about another Microsoft issue that resulted in the deletion of - Microsoft Azure server records
- Information on The Dyn Attack - probably biggest in history in terms of impact (Wikipedia as starting point)
- Two articles about safeguards through a multi-DNS strategy: perspectives from GlobalDots and InfoQ
- A perspective on why using two DNS providers could help with surviving an attack, and why it’s not a widely adopted practice - yet (Internet Society)
- Some generally good advice about DNS tips to avoid pitfalls (although the one about increasing the TTL length is not ideal) (Canopy.co blogpost)
Other reading:
- Most companies still vulnerable to DNS attacks - a view expressed in articles in Silicon Republic and The Register. Interestingly, 44% of SaaS platforms still use one DNS provider. Looking at BBC for example, all DNS servers are in their own network, HubSpot uses Cloudflare exclusively (two DNS endpoints), Salesforce has their own DNS, which route to Dyn and Neustar, Amazon apparently use Dyn and UltraDNS.
- A good source of DNS stats. What’s interesting of course is the number of reported issues with TLDs is disproportionately by the tiny TLDs i.e. anecdotally, the smaller the TLD, the more likely you will be affected.
- Details on Anycast roll out with RIPE and impact (Ripe.net)
Posted on November 19, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.