Callimachus,
When your response was quoted "AFAIK" wasn't omitted. This thread has become very informative for site owners, editors and anyone reading who is willing to try and understand our use of tools. Neither use of our respective tools is perfect, nor is anyone claiming they are. They do however free up needed resources for those using them.
"I have a bot trap on my own website .....". I started with a bot trap in the early stages of the site. However, as the site grew we were pickup by Google News, Slash dotted many times, Digg and others. The log files revealed that regular visitors were only using 1/4 of the total bandwith used each month. 3/4s of the bandwith was being used by bots, scraper, proxies, botnets and other undesirables. This was too much for me and the hosting provider wanted the equivalent of an entry level dedicated server or, the account was going to be closed. A server was selected to provide the site with the tools and resources needed to combat what the undesirables had done. The site had been copied, was available on many different data centers with my adsense code replaced with that of someone else. In my eyes this ment war, nothing short of full scale combat
.
Time For Tools (Combat Fatigues Optional
A script that was designed to stop comment spam but, does have many other capabilities was selected first. Another script was implemented to combat scrapers and site rippers ( scrapers/rippers copy all and or part of a page or site). The scripts were and are a very good starting point but, protection/security/combat is an ongoing building process and security was identifies as the next level that needed to be addressed. Mod_Security, an Apache module was added to the arsenal.
Log Files
The log files from all the scripts and module were compared. The comparison would surprise those who thaught they knew, me included. 90% of the recurring problems were all coming from select few easy to identify locations.
- Hosting data centers world wide (Hosting includes, website hosts, co-location servers and dedicated servers)
- Non democratic countries (that isn't a political statement)
- Countries with "Shakey" (I can't think of another way to discribe this group) enviroments.
The Arsenal Expands
The last addition has been a firewall with a very basic configuration. The server(hardawre) was now going to be protected. A decession was made, by me, that the site was for regular/human visitors. The scripts and module mentioned above don't stop server access. Successful attacks are still possible if a server can be reached. IP ranges of the recurring problems were loaded in to the firewall. The configuration of the firewall is to "Drop Connection" when an undesirable IP ranges try to gain access. Drop Connection, no server response, is like you hit an unknown blackhole in cyberspace, full scale combat.
Oops, The End Result Of Full Scale Combat Includeds Collaterial Damage
The scripts blocked "zilla" and later the firewall blocked pmoz and anything that originated from a known recurring problem IP range. Can I change the past? No. Should an exception be made for a small group of sites that have restrictive/non-human access controls. That isn't my decession to make. Did I block more than just the "directory" trying to do the best they can? Yes, take a look:
Mod_Security log file:
Request:
www.mysite.com 204.146.25.202 - - [24/May/2008:06:12:55 +0000] "HEAD /node/85$
----------------------------------------
HEAD /node/852 HTTP/1.1
Accept: application/octet-stream, application/*, audio/*, image/gif, image/jpeg, image/pjpeg, image$
Accept-Language: en-us
Cache-Control: no-cache
Connection: Keep-Alive
Content-Length: 0
Host:
www.mysite.com
Pragma: no-cache
Referer:
http://www.ibm.com/developerworks/blogs/page/woolf?tag=j2ee
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
mod_security-action: 403
mod_security-message: Access denied with code 403. Pattern match "!^$" at HEADER("Content-Length") $
HTTP/1.1 403 Forbidden
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
--06a71e5c--
_____________________________________________________________________________________
The above is IBM trying to verify if a page linked to from one of their developer's blogs is available.
Conclusion
I learn, listen and go forward knowing that one day I am going to "shoot myself in the foot" again. I hope the non-tech oriented who have chosen to read the above understand it.
Regards