Thursday, October 20, 2011

Parallel Processing benchmark in PHP ( CURL version )

Introduction
There are so many tools that gives PHP the capabilities of parallel processing.  Well, not multiple threads, but multiple processes, the outcome is very similar though, it makes PHP do several things at the same time without blocking.

Tools you can find today that I can think of :-
  • Gearman
  • Fork (pcntl_fork)
  • curl_multi
  • rolling curl ( just another flavor of curl_multi )
  • various Message Queue, like Rabbit MQ.

But this post is focusing on comparing / benchmarking "curl_multi" and "rolling curl",  since in general "rolling curl" is believed to be more efficient than "curl_multi" without doing the rolling. However, the benchmark result is quite confusing me.

Note: "Rolling Curl" at the end is still using curl_multi, but the difference is it swap out (roll out) finished job instead of waiting for the longest job to return.  For example, in this way, you don't need to wait for everything to finish before processing your returned data.

Benchmark Starts Here
I'm trying to do a benchmark between using Rolling Curl v.s. normal implementation of curl_multi (without the rolling).  The result is very close, but quite a number of time it shows that Rolling Curl is actually slower.

What I did is having both doing the same thing, "curl" 20 different urls and count 0 to 500 after the job is done. I try to simulate 5 concurrent users to call the script for 10 times.   And here is the result I get back.

Rolling Curl

Transactions:            50 hits
Availability:        100.00 %
Elapsed time:         44.57 secs
Data transferred:         0.00 MB
Response time:          4.08 secs
Transaction rate:         1.12 trans/sec
Throughput:          0.00 MB/sec
Concurrency:          4.58
Successful transactions:          50
Failed transactions:            0
Longest transaction:         9.54
Shortest transaction:         1.78

curl_multi without rolling

Transactions:            50 hits
Availability:        100.00 %
Elapsed time:         39.11 secs
Data transferred:         0.00 MB
Response time:          3.51 secs
Transaction rate:         1.28 trans/sec
Throughput:          0.00 MB/sec
Concurrency:          4.49
Successful transactions:          50
Failed transactions:            0
Longest transaction:         8.96
Shortest transaction:         1.94

At first I'm thinking the Rolling Curl will be faster since it will do the counting whenever a particular "curl" is done and rolled out.  However, it looks like the benchmark is telling us that it is faster without rolling the curl.  The following is the testing script I'm using to compare with Rolling Curl.

Is there something I'm missing when I use this testing script??  Any thoughts??



// testing function
function multiple_curl_request($nodes){ 
        $mh = curl_multi_init(); 
        $curl_array = array(); 
        foreach($nodes as $i => $url) 
        { 
            $curl_array[$i] = curl_init($url); 
            curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true); 
            curl_multi_add_handle($mh, $curl_array[$i]); 
        } 
        $running = NULL; 
        do { 
            usleep(1000); 
            curl_multi_exec($mh,$running); 
        } while($running > 0); 
        
        $res = array(); 
        foreach($nodes as $i => $url) 
        { 
            $res[$url] = curl_multi_getcontent($curl_array[$i]); 
            for($i = 0; $i < 500; $i++) {
                // just counting and do nothing
            }
        } 
        
        foreach($nodes as $i => $url){ 
            curl_multi_remove_handle($mh, $curl_array[$i]); 
        } 
        curl_multi_close($mh);        
        return $res; 
} 



Reply from Josh ( Author of Rolling Curl )
I bench marked Rolling Curl when I first wrote it and it was significantly faster.  Of course, since you're measuring things on the open web, there are lots of variables that come into play...  is the network just slow, are you overloading your own server, etc.  Keep in mind, the benefits of rolling curl mostly show up when you are dealing with large data sets.

I'd be interested to see the full code you used for your bench mark, although I won't have time to debug it for you.  

For a good benchmark, I would suggest you download the Alexia top 1,000 sites using regular curl_multi and compare the results with downloading the same list using rolling curl.  I think you will see the difference -- ie. regular curl_multi will probably choke on you.

No comments:

Post a Comment