Unix HOWTOs and Tips Short unix command line administration tips and scripts


PHP and curl: download multiple URLs using multiple http proxies, in parallel

Sometimes you may want to download some content from an http server that has a too restrictive banning policy (you may want to scrap/crawl/data mine the content for later processing).

If the above mentioned http server, has a policy to ban your IP, after too many requests in a short time, or a policy to allow only "Human" clients (modern browsers), and rejecting all others, you will quickly be stopped. In that case, to shortcut the ban, you may want to use public http proxies to anonymize your requests.

The problem with using proxies, is that most such public http proxies are either:
1) with restricted public bandwidth.
2) may sometimes return ads instead of the content that you want.
3) located in a foreign country to you, so ordinary the latency of the requests and the responses from them is too large, and that may cause timeouts, and failed downloads, which you have to process and try later the same URL with a different proxy server.

If you have to process many URLs, you can do it sequentially, but then the total processing time will be the sum of all the processing times, and as already said, each processing time can be quite large.

The solution of course is to download URLs in parallel. In that case the total processing time will be equal to the maximum of the processing times of the individual downloads, which may be many times smaller than the sum of them.

PHP and its CURL extension can help you with this task. Here is a project that does just that: Efficient PHP Parallel Downloader using multiple public http proxies

It capsulates the logic of retrying when a download was incomplete, and handles the parallel downloading of the URLs using an efficient CURL loop for processing of the proxy responses (using curl_multi_select, and not just polling). Using it, you can process hundreds of URLs downloading in parallel with a CPU usage < 1%.

Filed under: Uncategorized No Comments

How to conveniently serve a folder of files over http

It is very simple actually (although not as fast as using a dedicated web server like nginx). Just use python:

cd YourFolder
python -m SimpleHTTPServer

That's it ! After running this line, the files in your current folder will be accessible over port 8000, so you can send your pears a link like this: http://YourIPHere:8000/ , and they will be able to read the shared folder.

You can change the port very easily too - just append it at the end of the last command, like this :

python -m SimpleHTTPServer 11001

That will run a webserver for the current folder over port 11001, instead of the default port 8000.

Stopping the python webserver process is simple too - just type Ctrl-C, like you would do for any other long running shell process.

Filed under: Uncategorized No Comments

How to make tab switching snappy and fast again in Google Chrome

How to make switching of tabs in google chrome or chromium fast again after upgrading to version 14:

Switching of tabs is something that you need to do many times each day, if you are a power user and read many web pages. So slow tab switching  really affects the perception of Google Chrome as a snappy, fast and bugfree usefull browser.

So whithout further ado, here is the quick solution for the slow tab switching:

Move your bookmarks from the 'Bookmarks Bar' to a new folder under 'Other Bookmarks' and now the tab switching should be  snappy again in both Google Chrome and Chromium.

The 'Bookmarks Bar' is the default folder in which new bookmarks are added, and people like me, having the habit to bookmark every interesting page they visit for improving customization and later searchability, soon will collect a large number of bookmarks in this folder, thus slowing chrome more and more.

I can not remember the exact version where the slowdown started, but I think it was around Google Chrome version 14. Previous to that version, I had aproximately the same large amount of bookmarks, but chrome was snappy. After that, I thought that it was a problem with the development version of chrome, and I've used chromium for a while, because it remained fast (and it lags several versions behind on my machine). A couple of weeks ago, I've updated chromium to version '14.0.835.202 (Developer Build 103287 Linux) Ubuntu 11.04' and it started to lag, just like google chrome.

Moving the bookmarks to a different folder, as suggested here: http://code.google.com/p/chromium/issues/detail?id=87235

really helped, and google chrome is fast again :-) .


Filed under: Uncategorized No Comments