Unix HOWTOs and Tips Short unix command line administration tips and scripts


PHP and curl: download multiple URLs using multiple http proxies, in parallel

Sometimes you may want to download some content from an http server that has a too restrictive banning policy (you may want to scrap/crawl/data mine the content for later processing).

If the above mentioned http server, has a policy to ban your IP, after too many requests in a short time, or a policy to allow only "Human" clients (modern browsers), and rejecting all others, you will quickly be stopped. In that case, to shortcut the ban, you may want to use public http proxies to anonymize your requests.

The problem with using proxies, is that most such public http proxies are either:
1) with restricted public bandwidth.
2) may sometimes return ads instead of the content that you want.
3) located in a foreign country to you, so ordinary the latency of the requests and the responses from them is too large, and that may cause timeouts, and failed downloads, which you have to process and try later the same URL with a different proxy server.

If you have to process many URLs, you can do it sequentially, but then the total processing time will be the sum of all the processing times, and as already said, each processing time can be quite large.

The solution of course is to download URLs in parallel. In that case the total processing time will be equal to the maximum of the processing times of the individual downloads, which may be many times smaller than the sum of them.

PHP and its CURL extension can help you with this task. Here is a project that does just that: Efficient PHP Parallel Downloader using multiple public http proxies

It capsulates the logic of retrying when a download was incomplete, and handles the parallel downloading of the URLs using an efficient CURL loop for processing of the proxy responses (using curl_multi_select, and not just polling). Using it, you can process hundreds of URLs downloading in parallel with a CPU usage < 1%.

Filed under: Uncategorized No Comments