On Sept. 10, 2012, many Go Daddy customers experienced intermittent outages that lasted for several hours. There was immediate speculation about whether we were hacked. It was being reported as “fact” before our engineers had identified the root cause. The service disruption was not the work of an external source, but rather an internal network event triggered by a number of factors.
Now that we have analyzed the data and conducted a postmortem, I want to share an explanation with our customers about what happened and what we have done to ensure it doesn’t happen again. Our goal is to provide transparency and detail the specific elements we have implemented to prevent another such occurrence. This article may also serve our industry colleagues by providing insight into lessons learned.
Go Daddy DNS Service
First a little background. Go Daddy’s DNS infrastructure is deployed at key peering locations around the world. The infrastructure at each location is substantial – it consists of many servers all connected to a redundant, high-end routing and switching infrastructure. Each DNS center is co-located at our Internet edge router and connects directly into our own service provider backbone infrastructure. Each of our DNS centers has direct access to numerous points of ingress and egress to the Internet. This enables us to on-ramp and off-ramp our DNS queries as efficiently as possible. For example, inbound queries are typically diverted to the nearest regional DNS pod, serviced locally, and responded to directly to the best network interconnection point possible. This architecture enables us to provide a highly available and highly responsive DNS service.
Our global DNS infrastructure answers, on average, approximately 10 billion DNS queries per day, every day, across 41 million DNS zones (that’s 115,000 queries every second). It is distributed globally and peer-connected through anycast BGP routing. Anycast allows the servers in all data centers to exist on the same IP addresses. When a DNS query is sent from a client to one of our DNS servers, the packet is automatically routed to the data center that is closest to that client.
Anycast also provides protection and isolation from many technical and Internet–related issues. If the DNS servers in one data center are unable to respond to queries, the problem is contained within a region rather than affecting us globally. This allows us to route clients to other data centers and continue operating normally. For example, we periodically take a DNS center offline for system maintenance and all traffic is transparently routed to the next closest DNS center. We built this multi-redundant global infrastructure to provide a high level of service availability for our customers.
What Happened Sept. 10
There was not a single issue that caused the service disruption. Rather, it was the combination of multiple factors. The combined factors that contributed to the service disruption were:
- Router memory exhaustion
- Router hardware failure modes
- Containment
Network hardware, like any server or software, has limitations. A given device has a finite amount of memory and hardware resources. Careful planning goes into the design and configuration of each device and how it works within the network so the limitations are not exceeded. This is core to providing a highly available network. On Sept. 10, we experienced an event that pushed many of our routers beyond their capabilities.
Once routes are learned, the CPU in the “brain” of the router, called a Route Processor (RP), will program the hardware with the selected routes. These selected routes are kept by the hardware memory in a table called a Forwarding Information Base, or FIB. This is done to maximize performance – all forwarding is thus done at the hardware level, and the CPU is rarely, if ever, involved. In the event of Sept. 10, however, the hardware was not able to fit the entire forwarding table, which was 210x their normal routes into the FIB memory, and the routers fell back to “software switching mode.” At this point, the hardware was minimally involved in the forwarding decisions, and the CPUs could not keep up with the task of deciding where every packet transiting the router should go.
Within minutes of the beginning of the event, a recovery procedure was executed and the errant routes were removed from the routing protocol of all of our routers. The procedure relied on a standard response from the routers’ software – remove the routes from the FIB and begin forwarding in hardware again. This coupled with normal tiered DNS caching should have minimized any service disruption that could possibly have been caused by the change. This timeout mechanism did not execute.
Our network is equipped with extensive filters and partitioned to proactively prevent service disruptions from spreading beyond a single location. In this case, our BGP route reflector filters viewed the routes as legitimate and advertised them to the network. As our routers fell into software switching mode, they were unable to forward incoming and outgoing DNS traffic fast enough.
Restoring Service
To remove the errant routes that were flooding our routers, we implemented additional filters to suppress the route advertisements. Shortly after the filters were in place and the additional routes were removed from the network, we identified the routers in our global network that were unable to recover gracefully. They were still relying on their CPUs to forward all traffic. We resolved this through a combination of restoring the routing table, followed by the rebooting of the impacted routers.
The next step was to resolve the large spike in DNS queries coming into our network due to cache timeouts. When a DNS pod was brought online, DNS queries around the world immediately attempted to resolve against that DNS center. DNS, as a system, is resilient and was designed at the outset to be highly tolerant of failures. A DNS request tries repeatedly against all possible authoritative servers. As our network infrastructure began failing, DNS resolutions against our infrastructure began to slow. Clients timed out, and retries began. As the caches expired and queries escalated, our routers and systems became overwhelmed and slowed further. Retries increased, and so on. Over the next several hours, we saw a DNS traffic spike above 14x normal loads. We needed to throttle the DNS traffic demand as we brought up our DNS centers around the world.
We brought our systems back online by throttling the DNS queries with traffic rate-limiters on all of our Internet connection points around the world. As the limiters took effect, we started to bring up each DNS data center and continually increased traffic with each new DNS pod coming online.
The analogy here is similar to electricity failing in a city and the need to manage the demand (everybody wants their electricity back at the same time) when a single power station is restored. If the entire town attempts to regain power at the same time against only one part of the grid infrastructure, often times this overwhelms that component and the system fails to regain full strength. We brought our systems back online by throttling the DNS queries with traffic rate-limiters on all of our Internet connection points around the world.
Shortly after 4:00 p.m., our monitoring systems confirmed a full recovery had been achieved.
Lessons Learned & Actions
Our internal BGP (IBGP) infrastructure leverages something called a “route reflector.” These route reflectors receive routes from each of our edge devices, apply policies as necessary, and propagate the changes to all other routers that need the route information. These route reflectors can act as a “firewall” mechanism, in that the policies they apply to incoming routes will restrict and or eliminate unwanted changes to the downstream IBGP peers. Specific to our topologies, we now impose additional limits on the number of routes allowed at the route reflector layer. This is a key lesson we learned from this event and we recommend it for all similar networks.
In addition, we are implementing enhancements to our technology in order to prevent such a “perfect storm” from impacting our network again. Our commitment to our customers is the key principle on which we have built our business and we will use the lessons from this incident to strengthen all facets of the services we provide.
For more than a decade, we have provided an uptime of “five-9s” in our DNS infrastructure. We view any disruption as a serious concern and we are confident we have improved our system on the heels of this event. We know we have a responsibility to our customers and the entire Internet ecosystem with the volume of DNS traffic we handle every day.
At Go Daddy, we take pride in delivering high-quality services for our customers and we sincerely apologize for the service disruption. We want you to know we are continuously working to better serve our customers.
Auguste Goldman
Chief Infrastructure Officer
GoDaddy.com
kudos to godaddy team to handle this issue so wisely
Thanks for the information on what exactly happened. I did have a follow up question though, it appears that the root issue was the uncommonly large (210x larger than normal) trying to be transferred into the hardware. I am curious what event created this routing table in the first place?
Good to have some more details – but this write-up does not actually state the root cause and source of the extra routing information that overflowed the FIB. This is a critical piece of information for allowing customers to understand the actual root cause and what is being done to avoid triggering a similar event.
I’m assuming in an environment your size full mesh is out of the question?
I am curious what the 210x the normal routes was caused by? Being these must be public route servicing router reflectors and the public v4 table being around 420k routes, how did you get 210x that? Ovbiously we have all seen route leaks which is why we apply max-prefix limits but 210x 420,000 prefixes is insane.
Thank you for sharing this information.
We really appreciate hearing the details.
It is so good to hear this from the horse’s mouth, Godaddy. Rumors were rampant but I knew you would resolve everything. Yay for you all!
I’ve never seen so many words used to describe “something leaked, and we blew CEF max” before. I don’t suppose you have any details about what caused the extra routes in the first place?
An honest answer is a good answer. I think we’ve all flooded the TCAM at one point in our careers to understand the impact. Theres no easy way to recover it once you get affected too. Especially on the BGP RR that spread it to other nodes. Yes, max-prefix is your new friend now. Thanks for sharing.
I assume it was 210x on the IPv6 table, or?
I wonder if the RR’s reflected all routes learned at the edge rather than running a best path and only advertising the ‘best route’ to all other edge devices within the iBGP?
I thought: “Why GoDaddy don’t sent me an email warning me of this issue?” but it was resolved quickly… then I thought: “Houston, we HAD a problem… but it’s solved”
If that were my company, I would give a pay rise to this great technical team for the quick response to that big problem.
Great job, great service.
Congratulations from Colombia.
Excellent article. I appreciate the details. Will have to re-read to take it all in!
“The service disruption was not the work of an external source.”
“On Sept. 10, we experienced an event that pushed many of our routers beyond their capabilities.”
So what this event? Unless you reveal that, one can only assume that it was indeed a distributed denial of service attack initiated by a hacker group. In other words, an external source.
One more request for info about how you ended up with 210x your normal routes!
Another person wondering how your routing table grew to 210x as well? Sounds like a careless mistake made by someone?
Sorry, “router memory exhaustion” doesn’t just happen. especially not with thousands of devices that apparently got installed as you pointed out: “Careful planning goes into the design and configuration of each device”.
Did you carefully set up all the routers to be exactly the same, so that they fail all at the same time when the routing table got too big? What happened to alerts when memory is low?
I appreciate the long article, but when you finally get to the point, it jus sounds unbelievable – or let’s put it that way: DDoS sounds much more believable than your story.
Count me and my team in for wanting to know the root cause of the 210x as well! The explanation given here is bloated and yet woefully inadequate at the same time.
Dear Auguste!
Thank you for giving us this explanation, I’m one of the Godaddy’s customer that was very concerned on September 10 when our website was totally down since we started on June 2010, hope this problem never appears again. Thank you.
Greetings,