80legs Is a Pain in the Neck

Andrew Stephens, Friday the 24th of August, 2012

I This post was automatically imported from my old sandfly.net.nz blog. It may look a little weird since it was not originally written for this format. got an email from my hosting company (OpenHost - I like them) a couple of days ago telling me that this site had exceeded it's allocated bandwidth for the month. I found that very unlikely since I pay for 2gig a month and never even get close to that. But investigation revealed it to be true:

193.107.176.92 - - [24/Aug/2012:05:59:03 +1200] "GET /blog/wp-content/uploads/2012/07/muriwai_gannets-1024x494.jpg HTTP/1.1" 200 19040 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"
94.124.246.106 - - [24/Aug/2012:05:59:04 +1200] "GET /experiments/sketchthispage HTTP/1.1" 301 492 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"
188.134.41.144 - - [24/Aug/2012:05:59:04 +1200] "GET /blog/tag/48hours/ HTTP/1.1" 200 13939 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"
77.120.47.229 - - [24/Aug/2012:05:59:04 +1200] "GET /blog/tag/programming/ HTTP/1.1" 200 33193 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"
77.120.47.229 - - [24/Aug/2012:05:59:06 +1200] "GET /blog/tag/wii/ HTTP/1.1" 200 9771 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"
188.134.41.144 - - [24/Aug/2012:05:59:06 +1200] "GET /blog/tag/horror/ HTTP/1.1" 200 8668 "-" "Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620"

This is a small fragment of a logs generated by a web crawler that I had never heard of before - 80legs. A large proportion of the traffic this site gets is via bots, and normally I don't mind. It is usually search engines indexing the contents - something I want to encourage. But 80legs decided for some reason to repeatedly download this entire site repeatedly, for no discernible reason as far as I can see.

They are very proud of their "grid computing platform", but I think that the individual members don't talk to each other much since I see different machines downloading the same information simultaneously. Most bots take the trouble to slowly spider a site over a few days, but 80legs just seems to charge ahead. Worse, it re-downloads data it has already seen such as images linked from multiple pages. What ever the reasons, 80legs might end up costing me money and I can live without it.

User-agent: 008
Disallow: /

So long, 80legs.