Why Doesn’t PHP have Threads?
If you’ve ever had a zealous conversation about PHP as a programming language with another engineer, PHP’s lack of thread support has come up. Regardless of which side of the debate you were on, you were definitely talking about how technologies like Java, C, .Net, and others support concurrent processing with threads.
This article won’t be about PHP’s lack of threads. However, before you get started reading, I want to stipulate to a couple of things up front:
- PHP doesn’t have threads, but that’d be cool.
- Gearman is not the only message queue, event queue, or job queue technology out there.
And I ask that you, as the reader, keep an open mind. Consider that threads add complexity and Gearman brings some things to the table that threads don’t.
The Problem Statement
Some problems are just easier when you have more resources to throw at them. Maybe your project isn’t processing a data set worthy of a 40-node Hadoop cluster, but maybe your project could benefit from a little concurrent processing. Examples include:
- Fetching data from multiple sources
- Crawling a website
- Processing log files
- Sending text messages or e-mails
- Batch resizing or watermarking images
- Job / server / process / status monitoring
- Cache warming
- Executing unrelated background tasks (maybe your app doesn’t use the system cron)
In the very real and literal halls of Go Daddy, we’ve had these discussions about how to get concurrent processing into our PHP apps cleanly, but we never had a consensus until Gearman.
The Aggregator App
We have an internal hosting support application that presents just this type of problem. The core function of the app is to search for accounts and display them to the user. But, the account information is scattered across many different systems. The aggregator app pulls data from all of these disparate sources, applies some business logic, and presents everything in a unified interface. However, the search piece of the app was painfully slow, so we started a project to speed it up in October 2009.
When someone searches for a hosting account in our internal hosting support system, a series of searches are performed. For example, if the user enters 1234567, this could identify a customer, a dedicated server, a virtual server, or a shared hosting account. So that we may present the user with a complete result set, the app must make up to 13 queries to search APIs across disparate systems.
The first option we explored was to reduce the number of queries. If the user enters a number, up to 11 queries are sent just to the billing system search API. After some research, we found that reducing these queries was not a viable option because of our system architecture.
Our second thought was a caching layer. We threw this out because of two reasons.
First, we always needed up-to-date results. Second, when we looked at the search queries, they were not repeats. There would not be enough cache hits to justify a cache at all.
Our next thought was speeding up how the data got to our app. From our research with the billing search API team, we found that the API could scale horizontally, but not vertically. This meant we could hit their service with several concurrent connections under our volume and the service would not get slower.
This led very nicely into our last thought: load the searches in parallel. We started looking around for a good solution and found that many engineers have tried getting parallel processing in PHP by spawning separate processes using pcntl_fork (which is not supported in mod_php), proc_open, calling exec(‘php script.php &’); or using curl_multi_exec() to load multiple URLs.
And Then There was Gearman
The lead engineer on this project came across a blog post by Rasmus Lerdorf (the PHP guy) entitled Playing with Gearman. This was the answer to everything we wanted and more. Within a few days we had a prototype for a search backend up and running, powered by Gearman workers, that was noticeably more responsive. Some of our test customers had more than 100 instances of each product in their accounts. Previously, the search page would time out. With Gearman-enabled concurrency, though, the page loaded in about 10 seconds.
Our process now looks like this:
Gearman required remarkably few code modifications to the search code. It’s been over 18 months since we’ve deployed the new search back end. The kudos and “attaboys” have been relatively sparse. Not many people actually noticed the change (nothing looked different, after all), but we noticed usage going up and search times going down.
We’ve converted a few long-running tasks to Gearman and it’s now seamlessly processing over 40,000 jobs per day, with no hiccups.
Gearman is Mega-fork
“The way I like to think of Gearman is as a massively distributed, massively fault tolerant fork mechanism.”
We already had a cluster for the aggregator app. Gearman gives you FREE multi-server processing. We run multiple workers on each server, a Gearman job server on each server, and then tell the app to connect to both Gearman job servers. We can scale the load up or down, depending on the number of workers that were running. If a server goes down, or workers die, Gearman will adapt. In the future, we can expose APIs via Gearman’s built in http module. We can even write workers in other languages, and extend our app if other teams adopt Gearman.
In Case You Missed It
I said we “start workers” on each server. That means we have a process to start and manage a group of processes that maintain a connection to the Gearman job servers. We chose to write this in PHP to take advantage of the existing framework (i.e., logging, and configuration) in our app. If you’re interested in using Gearman in your project, I would suggest checking out GearmanManager. It’s a full featured Gearman worker manager you can plug your code into.
The Sales Pitch
Gearman has some really cool features.
- Cloud processing. Just make sure the workers and clients know about each other. Gearman handles failover and load balancing. If you need to add more workers, more clients, or more servers, just update everybody’s configuration and (maybe) restart a few services.
- Job retries. Failed jobs won’t get lost until they fail beyond the threshold you set.
- Persistent job queue. If your server has a sudden reboot, your jobs will restart where they left off. You can use different storage engines for the persistent queue, too. Gearman supports drizzle, tokyo cabinet, sqlite, memcached (or anything that’s protocol compliant), postgresql, or MySQL.
- MySQL UDFs. You can start Gearman jobs from MySQL. Imagine working with an SSO system where you want to create / validate tokens for users right from the database query:
mysql> SELECT gman_do('generate_sso_token', '12345|testuser'); +-----------------------------------------------------------------+ | gman_do('generate_sso_token', '12345|testuser') | +-----------------------------------------------------------------+ | AUAAv/8Q8MsQsNUg7QV8mBotpcmVpSbnJLAJ8gmJSysi8QHZTlj/bsJmq/oixPEpj95n99Anf7v5m2HdQGNjb/gn+4fU | +-----------------------------------------------------------------+ 1 row in set (0.01 sec)
curl -i -XPOST http://10.1.1.5:8120/generate_sso_token -d '12345|testuser' HTTP/1.0 200 OK X-Gearman-Job-Handle: H:10.1.1.5:22 Content-Length: 92 Server: Gearman/0.22 AUAAv//UqMSkP9U6cRURM/KuPimqyv+gb9vHw/JiH3U/5f6/k7PX86CqcfJvbuODiXD3TQorpJkhxceisyNbsw8/H6lq
[user@host]$ gearman -h 10.1.1.5 -p 4730 -f decrypt_sso_token "AUAAv/904L9uOyVRm6hF0R0gGCgx3QyhWMcajhWiuXklXm0z4yL/Xyn6hy8wdCBUvD5nSFPIS5P/afCP5uI0mgvJciWi" 12345|testuser
You can use this pattern to answer questions like: What are the 10 most common 6+ character words in Moby Dick, taking case sensitivity into account?
As your demand increases, you can just spin up new workers.
But I Have to Warn You
Gearman doesn’t have a security model. This can be a feature (it’s fast!) or a pain (it’s not enterprisey) depending on your project and environment. Anybody can connect and issue any command to the server. There is no authentication (and subsequently no authorization) and no encryption.
What does this mean? Anybody with a connection can…
- Connect, get a list of workers, then
- Register as a listed worker and intercept work (for spying)
- Register as a listed worker and repeatedly fail jobs (for denial of service)
- Register as a listed worker and return bad data (poison worker)
- Send bad data to all listed workers (poison job)
- Set “maxqueue” on all listed workers to 0
- Issue the “shutdown” command
You can limit access with iptables and limit eavesdropping with an stunnel proxy.
Gearman is limited by a few factors, though. Before you invest too heavily, you should be aware of the following:
- There are few built in administration tools and we haven’t found statistics that we like. If you want a point and click interface, be prepared to make one.
- Jobs don’t expire. Consider a job that needs to be completed in the next 60 seconds or else it’s invalid (e.g. warm up the cache before the maintenance mode is disabled). Gearman won’t handle this properly. If your job is number 200 in line and won’t be handled for 300 seconds, it will still be run. There is no way to flag it as “handle within 60 seconds or discard.”
- The server can have a maximum number of retries per job, but this is server wide and can’t be adjusted per job. Some jobs, like deleting temp files, can be abandoned after one try because there’s a high chance that it will be taken care of later. Other jobs, like backing up the database, should be retried several times before declaring failure.
- The http module isn’t a full featured web server. If you ever find yourself in a situation where you need to set a header (like content-type) or use authorization, you’re going to need to route traffic through a proxy first.
- If a job is dropped (e.g. queue is full, retry limit is reached) the client is never notified
- It’s a young project. The lead dev on our integration project found a bug during our testing. Later versions of Gearman don’t build easily on CentOS, but Nagios has posted a very easy workaround for this. Some of the documentation still doesn’t exist yet.
And for the pecl-gearman extension specifically:
- Assigning a unique job handle doesn’t work.
- Throwing exceptions doesn’t work, so be careful here. Using sendException() and setExceptionCallback() doesn’t seem to work, either. Post a comment here if you have a fix!
- Class methods like do and echo are impossible to extend.
The best way we found to deal with some of these limitations was to use a standard data structure for input that included an expiration timestamp and to use a different standard data structure for output that included an exception flag / message.
Check out the competition
If Gearman is interesting to you, but you want to check out other technologies in the arena, take a look at these other message queuing technologies.
- Beanstalkd – (Not to be confused with Amazon Beanstalk) It lacks a persistent job queue, but adds a “process at” time
- Apache ActiveMQ – Enterprise solution message broker
- RabbitMQ, 0MQ (ZeroMQ), and other AMQP clients – alternatives to ActiveMQ
- Peafowl, Starling – Use memcached protocol
Next time you start arguing about PHP and its lack of thread support, just pause, stroke your beard (or bald chin) thoughtfully, and pretend to give the situation some thought. Then announce that, PHP can use Gearman, which is way more awesomer than threads!