Cloud Processing with Gearman

Why Doesn’t PHP have Threads?

If you’ve ever had a zealous conversation about PHP as a programming language with another engineer, PHP’s lack of thread support has come up. Regardless of which side of the debate you were on, you were definitely talking about how technologies like Java, C, .Net, and others support concurrent processing with threads.

This article won’t be about PHP’s lack of threads. However, before you get started reading, I want to stipulate to a couple of things up front:

  1. PHP doesn’t have threads, but that’d be cool.
  2. Gearman is not the only message queue, event queue, or job queue technology out there.

And I ask that you, as the reader, keep an open mind. Consider that threads add complexity and Gearman brings some things to the table that threads don’t.

The Problem Statement

Some problems are just easier when you have more resources to throw at them. Maybe your project isn’t processing a data set worthy of a 40-node Hadoop cluster, but maybe your project could benefit from a little concurrent processing. Examples include:

  • Fetching data from multiple sources
  • Crawling a website
  • Processing log files
  • Sending text messages or e-mails
  • Batch resizing or watermarking images
  • Job / server / process / status monitoring
  • Cache warming
  • Executing unrelated background tasks (maybe your app doesn’t use the system cron)

In the very real and literal halls of Go Daddy, we’ve had these discussions about how to get concurrent processing into our PHP apps cleanly, but we never had a consensus until Gearman.

The Aggregator App

We have an internal hosting support application that presents just this type of problem. The core function of the app is to search for accounts and display them to the user. But, the account information is scattered across many different systems. The aggregator app pulls data from all of these disparate sources, applies some business logic, and presents everything in a unified interface. However, the search piece of the app was painfully slow, so we started a project to speed it up in October 2009.

When someone searches for a hosting account in our internal hosting support system, a series of searches are performed. For example, if the user enters 1234567, this could identify a customer, a dedicated server, a virtual server, or a shared hosting account. So that we may present the user with a complete result set, the app must make up to 13 queries to search APIs across disparate systems.

The first option we explored was to reduce the number of queries. If the user enters a number, up to 11 queries are sent just to the billing system search API. After some research, we found that reducing these queries was not a viable option because of our system architecture.

Our second thought was a caching layer. We threw this out because of two reasons.
First, we always needed up-to-date results. Second, when we looked at the search queries, they were not repeats. There would not be enough cache hits to justify a cache at all.

Our next thought was speeding up how the data got to our app. From our research with the billing search API team, we found that the API could scale horizontally, but not vertically. This meant we could hit their service with several concurrent connections under our volume and the service would not get slower.

This led very nicely into our last thought: load the searches in parallel. We started looking around for a good solution and found that many engineers have tried getting parallel processing in PHP by spawning separate processes using pcntl_fork (which is not supported in mod_php), proc_open, calling exec(‘php script.php &’); or using curl_multi_exec() to load multiple URLs.

And Then There was Gearman

The lead engineer on this project came across a blog post by Rasmus Lerdorf (the PHP guy) entitled Playing with Gearman. This was the answer to everything we wanted and more. Within a few days we had a prototype for a search backend up and running, powered by Gearman workers, that was noticeably more responsive. Some of our test customers had more than 100 instances of each product in their accounts. Previously, the search page would time out. With Gearman-enabled concurrency, though, the page loaded in about 10 seconds.

Our process now looks like this:

Gearman required remarkably few code modifications to the search code. It’s been over 18 months since we’ve deployed the new search back end. The kudos and “attaboys” have been relatively sparse. Not many people actually noticed the change (nothing looked different, after all), but we noticed usage going up and search times going down.
We’ve converted a few long-running tasks to Gearman and it’s now seamlessly processing over 40,000 jobs per day, with no hiccups.

Gearman is Mega-fork

“The way I like to think of Gearman is as a massively distributed, massively fault tolerant fork mechanism.”
Joe Stump

We already had a cluster for the aggregator app. Gearman gives you FREE multi-server processing. We run multiple workers on each server, a Gearman job server on each server, and then tell the app to connect to both Gearman job servers. We can scale the load up or down, depending on the number of workers that were running. If a server goes down, or workers die, Gearman will adapt. In the future, we can expose APIs via Gearman’s built in http module. We can even write workers in other languages, and extend our app if other teams adopt Gearman.

In Case You Missed It

I said we “start workers” on each server. That means we have a process to start and manage a group of processes that maintain a connection to the Gearman job servers. We chose to write this in PHP to take advantage of the existing framework (i.e., logging, and configuration) in our app. If you’re interested in using Gearman in your project, I would suggest checking out GearmanManager. It’s a full featured Gearman worker manager you can plug your code into.

The Sales Pitch

Gearman has some really cool features.

  • Cloud processing. Just make sure the workers and clients know about each other. Gearman handles failover and load balancing. If you need to add more workers, more clients, or more servers, just update everybody’s configuration and (maybe) restart a few services.
  • Job retries. Failed jobs won’t get lost until they fail beyond the threshold you set.
  • Persistent job queue. If your server has a sudden reboot, your jobs will restart where they left off. You can use different storage engines for the persistent queue, too. Gearman supports drizzle, tokyo cabinet, sqlite, memcached (or anything that’s protocol compliant), postgresql, or MySQL.
  • MySQL UDFs. You can start Gearman jobs from MySQL. Imagine working with an SSO system where you want to create / validate tokens for users right from the database query:
  • mysql> SELECT gman_do('generate_sso_token', '12345|testuser');
    +-----------------------------------------------------------------+
    | gman_do('generate_sso_token', '12345|testuser')                                              |
    +-----------------------------------------------------------------+
    | AUAAv/8Q8MsQsNUg7QV8mBotpcmVpSbnJLAJ8gmJSysi8QHZTlj/bsJmq/oixPEpj95n99Anf7v5m2HdQGNjb/gn+4fU |
    +-----------------------------------------------------------------+
    1 row in set (0.01 sec)
  • HTTP module. Gearman even exposes functions via REST.
  • curl -i -XPOST http://10.1.1.5:8120/generate_sso_token -d '12345|testuser'
    
    HTTP/1.0 200 OK
    X-Gearman-Job-Handle: H:10.1.1.5:22
    Content-Length: 92
    Server: Gearman/0.22
    AUAAv//UqMSkP9U6cRURM/KuPimqyv+gb9vHw/JiH3U/5f6/k7PX86CqcfJvbuODiXD3TQorpJkhxceisyNbsw8/H6lq
  • Gearman command. You can start Gearman jobs right from bash!
  • [user@host]$ gearman -h 10.1.1.5 -p 4730 -f decrypt_sso_token "AUAAv/904L9uOyVRm6hF0R0gGCgx3QyhWMcajhWiuXklXm0z4yL/Xyn6hy8wdCBUvD5nSFPIS5P/afCP5uI0mgvJciWi"
    
    12345|testuser
  • Job queues can have maximum sizes. You control resource utilization.
  • Workers can put new jobs on the queue. You can do neat things like map / reduce with Gearman.

You can use this pattern to answer questions like: What are the 10 most common 6+ character words in Moby Dick, taking case sensitivity into account?

As your demand increases, you can just spin up new workers.

But I Have to Warn You

Gearman doesn’t have a security model. This can be a feature (it’s fast!) or a pain (it’s not enterprisey) depending on your project and environment. Anybody can connect and issue any command to the server. There is no authentication (and subsequently no authorization) and no encryption.

What does this mean? Anybody with a connection can…

  • Connect, get a list of workers, then
    • Register as a listed worker and intercept work (for spying)
    • Register as a listed worker and repeatedly fail jobs (for denial of service)
    • Register as a listed worker and return bad data (poison worker)
    • Send bad data to all listed workers (poison job)
    • Set “maxqueue” on all listed workers to 0
  • Issue the “shutdown” command

You can limit access with iptables and limit eavesdropping with an stunnel proxy.

The Limitations

Gearman is limited by a few factors, though. Before you invest too heavily, you should be aware of the following:

  • There are few built in administration tools and we haven’t found statistics that we like. If you want a point and click interface, be prepared to make one.
  • Jobs don’t expire. Consider a job that needs to be completed in the next 60 seconds or else it’s invalid (e.g. warm up the cache before the maintenance mode is disabled). Gearman won’t handle this properly. If your job is number 200 in line and won’t be handled for 300 seconds, it will still be run. There is no way to flag it as “handle within 60 seconds or discard.”
  • The server can have a maximum number of retries per job, but this is server wide and can’t be adjusted per job. Some jobs, like deleting temp files, can be abandoned after one try because there’s a high chance that it will be taken care of later. Other jobs, like backing up the database, should be retried several times before declaring failure.
  • The http module isn’t a full featured web server. If you ever find yourself in a situation where you need to set a header (like content-type) or use authorization, you’re going to need to route traffic through a proxy first.
  • If a job is dropped (e.g. queue is full, retry limit is reached) the client is never notified
  • It’s a young project. The lead dev on our integration project found a bug during our testing. Later versions of Gearman don’t build easily on CentOS, but Nagios has posted a very easy workaround for this. Some of the documentation still doesn’t exist yet.

And for the pecl-gearman extension specifically:

  • Assigning a unique job handle doesn’t work.
  • Throwing exceptions doesn’t work, so be careful here. Using sendException() and setExceptionCallback() doesn’t seem to work, either. Post a comment here if you have a fix!
  • Class methods like do and echo are impossible to extend.

The best way we found to deal with some of these limitations was to use a standard data structure for input that included an expiration timestamp and to use a different standard data structure for output that included an exception flag / message.
Check out the competition

If Gearman is interesting to you, but you want to check out other technologies in the arena, take a look at these other message queuing technologies.

Conclusion

Next time you start arguing about PHP and its lack of thread support, just pause, stroke your beard (or bald chin) thoughtfully, and pretend to give the situation some thought. Then announce that, PHP can use Gearman, which is way more awesomer than threads!

Kurt started at Go Daddy in April 2007 in the hosting department as an internal hosting tools developer. He quickly moved up to lead developer and eventually became the team's Software Development Manager. Kurt has also worked on internal tools for the productivity apps team. In June 2011, Kurt started contributing to the WordPress core. Connect with Kurt on Google+

6 Comments on "Cloud Processing with Gearman"

  1. This post on implementing Gearman is a good example of the build vs integrate off the shelf solution. In this case getting an off the shelf solution worked out well. I have been in several discussions where most lean towards building in-house.

    I think both solutions work, but Gearman is a great example of why it was successful. It does one thing and it does it well. So often vendors sell products that try to do everything and does then all mediocre. Granted it does lack some critical features such as a security model, but as you stated there are some workable solutions to this (all of which have been around for decades, I might add).

    One last comment, when I first started reading this, the first thing that came to my mind was “what about map-reduce”. I am glad you touched on this, because when usually when I think about aggregating disparate data from multiple sources and searching through it… map-reduce pops into my mind.

  2. Ellsworth Mega says:

    Excellent detailed explanation

  3. The fact that gearman is still at version 0.33 can be a potential downer for many people. There are bugs and all. We downloaded the 0.33 version and found that libmysql was not compiled because of an autoconf issue (https://bugs.launchpad.net/gearmand/+bug/1001362) and then the unit tests were failing (https://bugs.launchpad.net/gearmand/+bug/1009148).

    It is good to know that huge installations like GoDaddy are using gearman in production. That gives lot of folks like us belief in gearman. I believe it is an excellent piece of software that will get better with time.

  4. Alexis Padget says:

    Great write-up! You definitely got my SU thumbs upwards!

  5. It’s really a nice and useful piece of information. I am glad that you just shared this helpful info with us. Please keep us informed like this. Thank you for sharing.

  6. Odelia Renick says:

    Really impressive

 
Traffic Log Image