You may have noticed that Clagnut was down for most of last week. The culprits were referrer spam robots maxing out my database connections, resulting in my ISP putting my account on hold. Unfortunately my ISP also prevented email and FTP access which didn’t exactly help the situation.
The main culprits were pharmaceutical pedlars using sub-domains such as buy-fioricet.drop.to. Fortune City owns the drop.to domain (and others like it), and flogs the sub-domains to spammers. No doubt Fortune City would claim they don’t knowingly sell to spammers, but they are well aware these lowlifes buy their services as the offending web sites have been shut down. Unfortunately shutting the sites down doesn’t stop the robots effectively inflicting a denial of service attack on sites like mine.
Clagnut suffers from the robots because it has dynamic pages – each blog entry is pulled from a database when you view the page. The queries and database tables are well optimised so it’s not normally a problem, but if an army of robots turns up, as happened last week, then they use up all the available database connections – in my case maxing out at 200 simultaneous connections per second. The irony is that all referrals listed here carry rel='nofollow' attributes so they won’t even gain any benefit from being shown.
As of this weekend I’m successfully fending off referrer spam robots with a blacklist of referrers which is checked before a database connection is made. In addition, a 403 Forbidden is given to all user agents claiming to have come from a .to domain, by using an .htaccess rule like this:
RewriteCond %{HTTP_REFERER} \.to\/? [NC]
RewriteRule .* - [F]
Clearly the blacklist approach is not scalable, but what else can I do? The robots usually identify themselves as IE6 so I can’t filter that way, and I wouldn’t want to keep out legitimate robots such as search engines, so I’m not really sure what my next steps can be. Is there something I should be getting my ISP to do as well? Any help gratefully received…
Jon Hicks wrote:
Maybe have a word with Shaun Inman. For Mint, he was able to block out spam robots in the visits stats.
Mats Lindblad wrote:
Or we could hire some bloodhounds and hunt them down… ;)
No seriously, there should be a blacklist that one could subscribe to to block half-wits like this.
Roger Johansson wrote:
My .htaccess file contains a blacklist of about 150 referrers that are redirected back to wherever they came from. It works well and I haven’t had any performance problems. Yet ;-).
Btw, my referrals aren’t even made public, so spamming me is completely pointless. But the robots don’t know that.
Chris Winfield wrote:
http://www.modsecurity.org/
Only a handful of hosts will have it installed but you could always suggest they give it a whirl.
bpt wrote:
If I am not mistaken, Mint “blocks” referer spam by using JavaScript to track visitors. Spambots don’t support JavaScript, so they aren’t counted as visitors (neither are Lynx/w3m/etc users, people with JavaScript disabled, etc). This usefully blocks referer spam from logs (if almost all users have JavaScript enabled), but it doesn’t do anything to reduce server load from spambot traffic.
If the bots are coming from a few IP addresses, you could implement surge protection.
Dustin Diaz wrote:
Just call the mob and have ‘em off’d.
Jens Meiert wrote:
Huh, wth. And then a nice provider – I’d go bananas.
Roger, I’m curious about your .htaccess ;)
Rich wrote:
Me too Roger. How exactly does one redirect a spammer whence they came?
Tom wrote:
Could you not create a funky cache, so the pages don’t have to be pulled from the database every time?
Roger Johansson wrote:
The last few lines from my block of referrer spammers:
RewriteCond %{HTTP_REFERER} zindagi [OR]RewriteCond %{HTTP_REFERER} zoker9 [OR]
RewriteCond %{HTTP_REFERER} zone-b51
RewriteRule ^.* h-t-t-p://%{REMOTE_ADDR}/ [L]
(http replaced with h-t-t-p to avoid auto-linking in this comment)
I think I found the technique at Caveat lector: http://cavlec.yarinareth.net/archives/2005/01/11/killing-referrer-spam/ .
More links to ways of dealing with referrer spam:
http://www.456bereastreet.com/movabletype/mt-search.cgi?IncludeBlogs=1&search=referrer+spam
Jim wrote:
> The robots usually identify themselves as IE6 so I cant filter that way
Are you sure about that? How many visitors do you get that use Internet Explorer?
In any case, I concur with Tom; hitting the database each time when you don’t need to seems very wasteful.
Rich wrote:
Yes, because I have been tracking the user agent strings.
That’s entirely my point. There’s nothing wrong with database-driven dynamic pages – plenty of sites work that way including hugely popular ones like Multimap and everything is hunky dory for normal amounts of traffic. The attack I was under from robots effectively increased my traffic by 1000 times!
Even if I had static pages I would have shot through my bandwidth limit in about an hour, so that’s why I was asking for advice for keeping out the robots at a higher level.
Ben wrote:
The real problem as someone said is that you hit the db each time. There is no such thing as ‘optimized’ code that relies on redundant db querying.
Caching would be one solution, another would be to simply write your posts after X days so they become flat files. You could simply dump them out as XML so you still have the data seperate.
You could even just dump them out as XML from the very start, when someone posts a comment you republish ‘1596_comments.xml’ from the data in the db. That would mean the majority of visitors won’t even touch your database.
Rich wrote:
Ben – thanks for your input, but using flat files instead of dynamic files would just push the problem somewhere else. Sure it means the site would be less likely to go down due to database failure, but it would mean I get hammered by bandwidth fees. This robot attack meant that I my normal monthly bandwidth usage would be consumed in an hour – that’s nearly 1000 times the expected amount of traffic.
And like I said huge sites like Multimap query the database for each of their 8 million daily page views (although admittedly there won’t be a new db connection opened for each of those).The root cause is therefore not that I hit the db each time, but the robots. Keep the robots out and the site behaves. Don’t get me wrong – I do understand that flat files are more server-efficient – that’s why my RSS files are static pages – but changing that won’t solve the problem.