On Sept. 10, 2012, many Go Daddy customers experienced intermittent outages that lasted for several hours. There was immediate speculation about whether we were hacked. It was being reported as “fact” before our engineers had identified the root cause. The service disruption was not the work of an external source, but rather an internal network event triggered by a number of factors.
Now that we have analyzed the data and conducted a postmortem, I want to share an explanation with our customers about what happened and what we have done to ensure it doesn’t happen again. Our goal is to provide transparency and detail the specific elements we have implemented to prevent another such occurrence. This article may also serve our industry colleagues by providing insight into lessons learned.
Go Daddy DNS Service
First a little background. Go Daddy’s DNS infrastructure is deployed at key peering locations around the world. The infrastructure at each location is substantial – it consists of many servers all connected to a redundant, high-end routing and switching infrastructure. Each DNS center is co-located at our Internet edge router and connects directly into our own service provider backbone infrastructure. Each of our DNS centers has direct access to numerous points of ingress and egress to the Internet. This enables us to on-ramp and off-ramp our DNS queries as efficiently as possible. For example, inbound queries are typically diverted to the nearest regional DNS pod, serviced locally, and responded to directly to the best network interconnection point possible. This architecture enables us to provide a highly available and highly responsive DNS service.
Our global DNS infrastructure answers, on average, approximately 10 billion DNS queries per day, every day, across 41 million DNS zones (that’s 115,000 queries every second). It is distributed globally and peer-connected through anycast BGP routing. Anycast allows the servers in all data centers to exist on the same IP addresses. When a DNS query is sent from a client to one of our DNS servers, the packet is automatically routed to the data center that is closest to that client.
Anycast also provides protection and isolation from many technical and Internet–related issues. If the DNS servers in one data center are unable to respond to queries, the problem is contained within a region rather than affecting us globally. This allows us to route clients to other data centers and continue operating normally. For example, we periodically take a DNS center offline for system maintenance and all traffic is transparently routed to the next closest DNS center. We built this multi-redundant global infrastructure to provide a high level of service availability for our customers.
What Happened Sept. 10
There was not a single issue that caused the service disruption. Rather, it was the combination of multiple factors. The combined factors that contributed to the service disruption were:
- Router memory exhaustion
- Router hardware failure modes
Network hardware, like any server or software, has limitations. A given device has a finite amount of memory and hardware resources. Careful planning goes into the design and configuration of each device and how it works within the network so the limitations are not exceeded. This is core to providing a highly available network. On Sept. 10, we experienced an event that pushed many of our routers beyond their capabilities.
Once routes are learned, the CPU in the “brain” of the router, called a Route Processor (RP), will program the hardware with the selected routes. These selected routes are kept by the hardware memory in a table called a Forwarding Information Base, or FIB. This is done to maximize performance – all forwarding is thus done at the hardware level, and the CPU is rarely, if ever, involved. In the event of Sept. 10, however, the hardware was not able to fit the entire forwarding table, which was 210x their normal routes into the FIB memory, and the routers fell back to “software switching mode.” At this point, the hardware was minimally involved in the forwarding decisions, and the CPUs could not keep up with the task of deciding where every packet transiting the router should go.
Within minutes of the beginning of the event, a recovery procedure was executed and the errant routes were removed from the routing protocol of all of our routers. The procedure relied on a standard response from the routers’ software – remove the routes from the FIB and begin forwarding in hardware again. This coupled with normal tiered DNS caching should have minimized any service disruption that could possibly have been caused by the change. This timeout mechanism did not execute.
Our network is equipped with extensive filters and partitioned to proactively prevent service disruptions from spreading beyond a single location. In this case, our BGP route reflector filters viewed the routes as legitimate and advertised them to the network. As our routers fell into software switching mode, they were unable to forward incoming and outgoing DNS traffic fast enough.
To remove the errant routes that were flooding our routers, we implemented additional filters to suppress the route advertisements. Shortly after the filters were in place and the additional routes were removed from the network, we identified the routers in our global network that were unable to recover gracefully. They were still relying on their CPUs to forward all traffic. We resolved this through a combination of restoring the routing table, followed by the rebooting of the impacted routers.
The next step was to resolve the large spike in DNS queries coming into our network due to cache timeouts. When a DNS pod was brought online, DNS queries around the world immediately attempted to resolve against that DNS center. DNS, as a system, is resilient and was designed at the outset to be highly tolerant of failures. A DNS request tries repeatedly against all possible authoritative servers. As our network infrastructure began failing, DNS resolutions against our infrastructure began to slow. Clients timed out, and retries began. As the caches expired and queries escalated, our routers and systems became overwhelmed and slowed further. Retries increased, and so on. Over the next several hours, we saw a DNS traffic spike above 14x normal loads. We needed to throttle the DNS traffic demand as we brought up our DNS centers around the world.
We brought our systems back online by throttling the DNS queries with traffic rate-limiters on all of our Internet connection points around the world. As the limiters took effect, we started to bring up each DNS data center and continually increased traffic with each new DNS pod coming online.
The analogy here is similar to electricity failing in a city and the need to manage the demand (everybody wants their electricity back at the same time) when a single power station is restored. If the entire town attempts to regain power at the same time against only one part of the grid infrastructure, often times this overwhelms that component and the system fails to regain full strength. We brought our systems back online by throttling the DNS queries with traffic rate-limiters on all of our Internet connection points around the world.
Shortly after 4:00 p.m., our monitoring systems confirmed a full recovery had been achieved.
Lessons Learned & Actions
Our internal BGP (IBGP) infrastructure leverages something called a “route reflector.” These route reflectors receive routes from each of our edge devices, apply policies as necessary, and propagate the changes to all other routers that need the route information. These route reflectors can act as a “firewall” mechanism, in that the policies they apply to incoming routes will restrict and or eliminate unwanted changes to the downstream IBGP peers. Specific to our topologies, we now impose additional limits on the number of routes allowed at the route reflector layer. This is a key lesson we learned from this event and we recommend it for all similar networks.
In addition, we are implementing enhancements to our technology in order to prevent such a “perfect storm” from impacting our network again. Our commitment to our customers is the key principle on which we have built our business and we will use the lessons from this incident to strengthen all facets of the services we provide.
For more than a decade, we have provided an uptime of “five-9s” in our DNS infrastructure. We view any disruption as a serious concern and we are confident we have improved our system on the heels of this event. We know we have a responsibility to our customers and the entire Internet ecosystem with the volume of DNS traffic we handle every day.
At Go Daddy, we take pride in delivering high-quality services for our customers and we sincerely apologize for the service disruption. We want you to know we are continuously working to better serve our customers.
Chief Infrastructure Officer