Ask any system administrator how they feel about rebuilding a production server and you’ll probably hear a groan. Now imagine asking them to rebuild about 1,000 production servers that are hosting customer content, within 3 months, with zero downtime. You might expect pitchforks and torches to come out. I certainly did, because I had to ask my team to do just that. Here’s the crazy part: They didn’t complain. They got busy and actually pulled it off!
An abridged history of Go Daddy Linux Hosting
Back in 2008, when Go Daddy was developing our new clustered web hosting platform, we started using CentOS 5 on our shared hosting web servers. This served us well for several years, but it wasn’t without its challenges. Even though our web servers are built with 48+GB of RAM, we were running a 32 bit Linux kernel. This meant the amount of memory the kernel could address was limited to 4GB. To overcome this, we used a Physical Address Extension (http://en.wikipedia.org/wiki/Physical_Address_Extension) version of the 32 bit kernel. Over time, we found that some of the updates to the PAE kernel experienced bugs under heavy load or on high transaction servers (which hosting servers are), and thus would cause instability. This meant that we had to keep using older versions of the Kernel longer, hoping for bug fixes to come. Obviously, from a security and a performance standpoint, this was not an ideal situation.
Luckily, our Hosting Development and Engineering teams were already busy at work preparing us for the (then) new CentOS 6 distribution. Going with the 64 bit version was a given, so while developers updated our provisioning systems, our engineers looked at all the packages we used and updated them all to work with a 64 bit operating system. There was, however, some bad news. Going from CentOS 5 to CentOS 6 isn’t an upgrade; it required a complete rebuild of the server. And we had around 1,000 web hosting servers* to upgrade.
*In 4GH web hosting. Our legacy 2GH hasn’t yet been upgraded.
How to rebuild servers without disturbing anyone
So there was our challenge: rebuild all our web hosting servers, as fast as possible, without creating any downtime for our customers. A difficult hand for an operations team to be dealt, but we had a few aces up our sleeve.
Our Ace of Clubs: Among the many advantages Go Daddy’s web hosting being comprised of clustered, load balanced servers is that we can pull a server out of the load balancer without disrupting traffic on any of our customers’ web sites. So downtime was not going to be an issue. But, rebuilding a server takes time and effort. Multiply that by 1,000 and you’ve got a task that sounds impossible to complete in 90 days or less.
Our Ace of Hearts: Go Daddy internally developed a robust server management database that not only tracks the status, location, and state of all our servers, but also handles build and retirement automation. With a little bit of work, we were able to leverage this system to, in effect, “roll back” the build process of a server and start it over as if it were a freshly racked server, with a new operating system installed, and ready to configure. But still, taking all those servers and configuring them wasn’t going to be easy.
Our Ace of Diamonds: Configuration management in a large environment is important, and rebuilding all of the web servers gave us the opportunity to upgrade and revamp our Puppet environment with some consultation help from Puppet Labs. Once that infrastructure was updated, it was “all hands on deck” for our operations and engineering teams to write new puppet manifests with the intent of being able to take a blank slate server and automatically configure it as a web server. With a lot of time, caffeine, tears, and a couple of puppet hack-a-thons behind us, we were ready to build boxes en masse.
Our Ace of Spades: We were left with only one more challenge; the human factor. To rebuild these servers, we’d need to orchestrate resources from our data center, networking, monitoring, and operations teams. As anyone who has run a cross-team project can attest, resource wrangling is often the most difficult part. So, we reached out to all these teams and got access to the tools and systems needed to complete the process, saving them resources and us time. With this access, our Ace of Spades was in play and his name is Guillermo Lopez, Linux administrator extraordinaire and server-rebuild guru. With an aggressive timeline and all the tools in hand, Guillermo set about rebuilding servers, first pulling 1 node from each of 5 clusters and shepherding the process along. Day by day, as our confidence in the process grew, so too did our pace. Soon, we were rebuilding 10 servers a day, then 15, then 20, and finally settling on a pace of 25 production servers rebuilt as CentOS 6, configured automatically via Puppet and put back into active service every single day!
Performance of our hosting servers is very important to us at Go Daddy and we knew that CentOS 6 would help improve our performance, but to what extent? We wanted a real life, relevant test. So, we decided to build a website that mirrors our typical hosting customer. We used WordPress, some common plugins, filled it with content and images, and placed a copy on a large sample of our servers. We then leveraged Gomez Networks to poll these sites from various geographical locations (Seattle, Los Angeles, New York, and Chicago), every hour. The results were significant. A WordPress page that took about 3 seconds on CentOS 5 now loaded in 1.7 seconds on CentOS 6!
But, why was there such a big improvement? Was it simply because we moved to a 64 bit operating system? Without getting overly technical, CentOS 6 is a newer Linux kernel with improvements to the TCP stack for better networking. Also, since we could now use all of the server’s RAM for low memory uses, we were able to leverage NFS client caching. All around, it provides a much better performing, more stable platform for our shared hosting customers.
We had a much easier time with this project because we started with an aggressive, yet realistic, project plan. In the beginning, we spent a good amount of time identifying every possible activity that would be needed, what teams would need to participate, and what tools or other resources we’d need to use. Having access to the tools and knowledge necessary to handle the entire project from start to finish helped us keep up with a tight schedule. But, unforeseen challenges did arise and we met with our fair share of roadblocks. When that happened, we got all hands on deck and didn’t rest until the issues were overcome. At one point, we ran into an issue where the new 64 bit version of a package that we’ve used for years in 32 bit, wasn’t behaving well and was causing some kernel panics. Getting all of our engineers, developers, and administrators into a room to brainstorm provided us a workaround that kept us on schedule while the package was being updated by the vendor. It’s this type of tenacity and perseverance that helped us deliver on time.
Then came the most important part: celebrating our win! Is there a better way to reward hard work than with beer and sausages? You’re right, there isn’t. We rounded up the teams and headed to BrautHaus in Scottsdale, AZ for our fill of delicious German Bratwurst (my favorite was the duck and date, though the rabbit and hops was popular as well), amazingly-good German pretzels, some of the best Belgian style fries ever, and of course plenty of good beer.