ODP link checker and site removal

Stern123 · Jul 17, 2008

pvgool said:
For me I prefer a few sites to be removed uncorrectly above a few sites to be not removed uncorrectly.

You evade the point. You make it sound like it's a choice of either a) a few sites incorrectly removed, or b) a few sites not removed correctly, and that's misleading.

The point is that developers can create a well-designed bot that is less likely to get banned. By doing so, you *still* remove all the incorrect sites *and* reduce the false positives.

As I've discussed ad nauseum, these measures include (but not limited to) moving away from spammy IPs, moving to AOL/DMOZ IPs, and reading robots.txt like a polite bot. The chance of the bot actually being blocked by robots.txt is probably miniscule, it's not a big deal like you make it out to be. In that very rare case, there could be a manual review or a 'Plan B' bot, the link checkers that pretend to be browsers, that check up on the sites that pmoz can't visit. I don't know, these are suggestions, but the point is even a single one (1) of these measures can be enacted without making it out to be some sort of degradation of the directory.

But I think you know that, and what you're really trying to say is that you won't put the extra time to upgrade the bot, because you don't care about the few false positives. Let's be honest.

I would invite you to come over to a webmaster forum to discuss the lazy design of pmoz bot and make your arguments and you'll probably be overwhelmed by a tidal wave of disagreement and you'll be forced to retreat to the safe haven of this forum.

hutcheson · Jul 17, 2008

You are, of course, always welcome to take whatever means think think necessary to protect themselves from spam. You're even welcome to describe those means in this forum (at least, so far as they relate to the ODP tools.) And in return, you're likely to get some information about non-confidential aspects of the ODP tools.

Both you and the ODP tool developers are free to use that information however they wish.

It is not, however, reasonable to expect you to justify your failure to cooperate with any particular ODP tool. Nor is it reasonable for you to expect anyone to justify their tool design to. Both your website and their tools are what they are. And everyone can choose for themselves to use them or cooperate with them, or not.

There's no basis for an "argument" (here or anywhere else), because my priorities are probably as irrelevant to you as yours are to me. Here, it's just an opportunity to share information.

Tech Owner · Jul 17, 2008

I don't have any problem with letting whoever is responsible for "http://pmoz.info/doc/botinfo.htm" (the info page isn't up to date) know how the bot is being detected. Something is leading me to believe it is known to the developer and the developer also knows the difference between a 301 (moved permantently) and a 403 (access denied). Rob am I correct ( if you are maintianing this link checker)?

Can anyone let me know where a dmoz policy states that I must allow bots to access my site, other than "zilla" inorder to ensure my site won't be removed.

hutcheson · Jul 17, 2008

No, there's no such policy.

And there's no policy against ODP'ers using tools that unremove sites "highly likely to be among the dead".

Nobody is telling you a site won't ever be listed if it don't allow robot visitors. We can't know that. And nobody can tell you a site will be listed so long as it allows robots X, Y, and Z to visit. We can't know that either.

This isn't a courtroom, where it's better to let a hundred serial killers free than to execute one innocent man. This is a question of -- what's the best way to allocate ODP resources? Experience has shown, it's better to unreview a live site than to leave 19 dead sites listed; and it's better to delete a live site outright than to leave 99 sites listed. Because the automatic tools free the editors from constant patrolling of existing sites, to review more unlisted sites (including, of course, the live sites unnecessarily removed.)

plantrob · Jul 18, 2008

Nope, I'm not about to double the number of http requests I make, just to read robots.txt for each URL. My bot does not crawl sites, does not cause a high server load, or otherwise make a nuisance of itself. I don't see any purpose whatsoever being served by reading robots.txt.

As techinfo noted, I use two different IP addresses (not just the perfora.net one) to check bad responses; only because of an overly restrictive access strategy did all three attempts yield an error response. Nearly a thousand listings go bad every day in the directory. In removing those, mistakes are inevitable - and reversible, by category editors. Overall, the system works quite well

Conversely, you could configure your site so that simple access requests are not blocked, but state-changing requests (e.g., POSTs) are subjected to greater scrutiny.

Tech Owner · Jul 18, 2008

Thanks for the reply Rob;

The use of ours respective tools is clashing even though they are designed to achieve very similar end results. The tools utilised on my site are designed to block unknown bots. If the bot originates from a known problem data center the connection is dropped at the firewall. Most attempts of "defacing" a site, hacking, spam and botnets have been tracked to these data centers. A bot originating from an ISP will get a 403 (access denied) page, this isn't the generic apache 403 response page, the page states the reason for denial of access and includes an email for those willing to proceed.

Log Files: (hutcheson, I know "His Honor" isn't present and the basics of the analogy actually applies to both sides

grep 216.15.74.85 /home/web/tech-httpd-access.log

216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Jan/2008:00:29:06 +0000] "GET / HTTP/1.1" 403 897
216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Mar/2008:18:00:40 +0000] "GET /node/7119 HTTP/1.1" 403 906
216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [19/Apr/2008:02:20:04 +0000] "GET /node/feed HTTP/1.1" 403 906

The bot in the log files received the "reason for denial of access" page. If a human checked the urls using a tool that masks their indentity, eg;http://passport.dmoz.org/?pp_cat=pmoz.info&pp_tool=/index.php5, the human would have received the "reason for denial of access" page.

Should an editor verifying a url "jump through the hoops" I've put in place. (this ones a basic no brainer) No! You shouldn't. The tools in place were never designed and/or used to incumber regular visitors or editors. The tools are being used to free up resources, human and hardware.

(Let's see if we can reduce the these drops in the bucket) There is open source software available that is almost impossible to detect when used to access a site to "GET" a few urls providing it doesn't originate from a known problem data center. It does leave a detectable "foot print" when used for malicious purposes.

Re: "Conversely, you could configure your site so that simple access requests are not blocked, but state-changing requests (e.g., POSTs) are subjected to greater scrutiny."

Rob most of the tools I have in use are open source and are becoming popular on open source Blogs and CMSs. The one trapping the "bot" is actually designed to catch comment spam. I could make changes as per your suggestion but, I'm just "a drop in the bucket

and I'm trying to reduce many drops in the bucket.

Regards

pvgool · Jul 18, 2008

plantrob said:
Nearly a thousand listings go bad every day in the directory. In removing those, mistakes are inevitable - and reversible, by category editors. Overall, the system works quite well

From my experience looking at many of those removed sites my estimation is that we don't even have one mistake every day and probably not one every week.
I have also seen that if a site is marked several times by the tool as unavailable but available to a human editor that rob takes special mesures for the tool not to mark it anymore.
As from the ca 4.500.000 listed sites only some 1000 per day go bad and are removed it seems to me that most sites don't have those extreme protections which would remove them wrongly. The same conclusion can I deduct from the very low number of wrongly removed sites amongs all the rightly removed sites.

and what you're really trying to say is that you won't put the extra time to upgrade the bot, because you don't care about the few false positives. Let's be honest.

Yes, for me personaly (but I can't speek for any other editor or for the DMOZ as an organsiation) I don't care about a few false positives.

Tech Owner · Jul 18, 2008

Tools In Action

This thread is actually about the tools used by websites and, in this case, dmoz. I'm not claiming my use of tools is the only way they can be utilised. Having said that, I'd like you to see the output of a script that is designed to combat scrapers and site rippers. The IP used in this attempt will be denied access of a period of time. The time period is a random number of hours between 12 and 24. When and if the IP in question tries to access the site during the denied time period they will recieve a page letting them know when they will be allowed to access the site.

Scraper File:

18/07/2008 12:32:18 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:22 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:22 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:22 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:23 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:23 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:26 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:26 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:26 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:27 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:27 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:27 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:28 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:28 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:28 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:32 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:32 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:32 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:33 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:33 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:36 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper

plantrob · Jul 19, 2008

Rob most of the tools I have in use are open source and are becoming popular on open source Blogs and CMSs. The one trapping the "bot" is actually designed to catch comment spam.

Interesting... I would think that malicious bots are the ones that try to be sneaky and assume a common browser's useragent - while above-board bots provide a means for tracing/explaining their purpose. But I've never ventured into comment spam practices, so what do I know

Tech Owner · Jul 19, 2008

We tried to feed pmoz. Pmoz wasn't hungry

Callimachus · Jul 19, 2008

Please note the "AFAIK" in my previous response. It stands for "as far as I know". Only AOL tech and some admin level editors have access to the code for Robozilla and those editors who write their own tools may or may not have released it to general editorial access.

I have a bot trap on my own website that works well and simply. It will catch ill behaved bots, even those that ignore robots.txt (the reading of which does not guarantee proper behaviour - indeed it can be used for the opposite purpose), and even some ill behaved visitors.

Tech Owner · Jul 20, 2008

Callimachus,

When your response was quoted "AFAIK" wasn't omitted. This thread has become very informative for site owners, editors and anyone reading who is willing to try and understand our use of tools. Neither use of our respective tools is perfect, nor is anyone claiming they are. They do however free up needed resources for those using them.

"I have a bot trap on my own website .....". I started with a bot trap in the early stages of the site. However, as the site grew we were pickup by Google News, Slash dotted many times, Digg and others. The log files revealed that regular visitors were only using 1/4 of the total bandwith used each month. 3/4s of the bandwith was being used by bots, scraper, proxies, botnets and other undesirables. This was too much for me and the hosting provider wanted the equivalent of an entry level dedicated server or, the account was going to be closed. A server was selected to provide the site with the tools and resources needed to combat what the undesirables had done. The site had been copied, was available on many different data centers with my adsense code replaced with that of someone else. In my eyes this ment war, nothing short of full scale combat

.

Time For Tools (Combat Fatigues Optional

A script that was designed to stop comment spam but, does have many other capabilities was selected first. Another script was implemented to combat scrapers and site rippers ( scrapers/rippers copy all and or part of a page or site). The scripts were and are a very good starting point but, protection/security/combat is an ongoing building process and security was identifies as the next level that needed to be addressed. Mod_Security, an Apache module was added to the arsenal.

Log Files

The log files from all the scripts and module were compared. The comparison would surprise those who thaught they knew, me included. 90% of the recurring problems were all coming from select few easy to identify locations.

- Hosting data centers world wide (Hosting includes, website hosts, co-location servers and dedicated servers)

- Non democratic countries (that isn't a political statement)
- Countries with "Shakey" (I can't think of another way to discribe this group) enviroments.

The Arsenal Expands

The last addition has been a firewall with a very basic configuration. The server(hardawre) was now going to be protected. A decession was made, by me, that the site was for regular/human visitors. The scripts and module mentioned above don't stop server access. Successful attacks are still possible if a server can be reached. IP ranges of the recurring problems were loaded in to the firewall. The configuration of the firewall is to "Drop Connection" when an undesirable IP ranges try to gain access. Drop Connection, no server response, is like you hit an unknown blackhole in cyberspace, full scale combat.

Oops, The End Result Of Full Scale Combat Includeds Collaterial Damage

The scripts blocked "zilla" and later the firewall blocked pmoz and anything that originated from a known recurring problem IP range. Can I change the past? No. Should an exception be made for a small group of sites that have restrictive/non-human access controls. That isn't my decession to make. Did I block more than just the "directory" trying to do the best they can? Yes, take a look:

Mod_Security log file:

Request: www.mysite.com 204.146.25.202 - - [24/May/2008:06:12:55 +0000] "HEAD /node/85$
----------------------------------------
HEAD /node/852 HTTP/1.1
Accept: application/octet-stream, application/*, audio/*, image/gif, image/jpeg, image/pjpeg, image$
Accept-Language: en-us
Cache-Control: no-cache
Connection: Keep-Alive
Content-Length: 0
Host: www.mysite.com
Pragma: no-cache
Referer: http://www.ibm.com/developerworks/blogs/page/woolf?tag=j2ee
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)
mod_security-action: 403
mod_security-message: Access denied with code 403. Pattern match "!^$" at HEADER("Content-Length") $

HTTP/1.1 403 Forbidden
Keep-Alive: timeout=15, max=99
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1
--06a71e5c--

_____________________________________________________________________________________

The above is IBM trying to verify if a page linked to from one of their developer's blogs is available.

Conclusion

I learn, listen and go forward knowing that one day I am going to "shoot myself in the foot" again. I hope the non-tech oriented who have chosen to read the above understand it.

Regards

Stern123 · Apr 17, 2009

Stern123 said:
Hello... Newman!

The more thoughtful developers of spiders out there are taking the extra step of reassuring webmasters with identifiable IP ranges and reverse dns checks and all that. While you may not have the resources to take it that far, wouldn't it be better to use a DMOZ or AOL IP at the very least? Sure, pmoz may have a unique agent, but we also saw a fraudulent Googlebot from the same hosting range, so clearly, the user agent (which can so easily be faked) is not a sign of trust whatsoever.

In my opinion, it's actually quite Newman-like to shrug off any culpability for your dubious neighbors and see us as just a drop in the bucket

Update: Our server continues to receive suspicious activity from pmoz's hosting server -- including a hit by "Morfeus F***ing Scanner" ... No, I'm not being obscene -- that is the exact (albeit censored) name and user agent of a hacking tool that is scanning for php vulnerabilities.

To summarize, ODP relies on a scanning tool from a bad "neighborhood" containing the likes of "Morfeus F***ing Scanner" and other potential sources of spamming and hacking activity. BUT the manager of that tool has expressed a total lack of concern, and so webmasters cannot take the precaution of denying visits from that bad neighborhood IF they wish to remain listed in DMOZ.

Please note that no reputable global web-based organization would use a tool like pmoz.info and risk associating itself with shared hosting and Morfeus F***ing Scanners.

I once again beseech the powers-that-be to review this thread and eliminate DMOZ's dependency on pmoz.info. There is the alternative -- use only your reliable and professional tool called "Robozilla" which comes from dmoz.aol.com and which can be safely accepted by all websites.

tschild · Apr 17, 2009

Does "from pmoz's hosting server" mean "from the same IP that pmoz.info resolves to"?

pvgool · Apr 17, 2009

As Morfeus is a well known and old scanner that is running from many servers from all over the world you should either block that specific scanner (and not some ipnumber from which one of the scanner is running at the moment) or better yet you should keep your server upto date and save from possible attacks.
Blaming others for things they can not control will be of no help.
pmoz is doing a very good job for DMOZ. Accidentaly marking a few sites as not accesible is not a problem for us. Much better than letting the real errors exist. But that has been explained to you before.

Stern123 · Apr 17, 2009

pvgool said:
Blaming others for things they can not control will be of no help.

I don't comprehend the defensive and defiant attitude. I am not blaming you for your neighbors. I am questioning the choice of setting up shop in a 3rd party spammy neighborhood, instead of better alternatives and suggested improvements. I've written so many posts about this, I don't know how else to clarify myself and stop you from running me around in circles.

Much better than letting the real errors exist. But that has been explained to you before.

Yes, but I already addressed that when I wrote:

You evade the point. You make it sound like it's a choice of either a) a few sites incorrectly removed, or b) a few sites not removed correctly, and that's misleading.
<SNIP>
...But I think you know that, and what you're really trying to say is that you won't put the extra time to upgrade the bot, because you don't care about the few false positives. Let's be honest.

Which you acknowledged when you wrote:

Yes, for me personaly (but I can't speek for any other editor or for the DMOZ as an organsiation) I don't care about a few false positives.

So for the sake of everyone's time and patience, please stop misleading everyone that this either a black-and-white choice of either "accidentaly marking a few sites as not accesible" OR "letting the real errors exist". Because that's simply evasive and dishonest.

jimnoble · Apr 17, 2009

There are several possibilities here including:

The volunteer author of the excellent pmoz.info tools goes to the trouble of moving them to a better hosting neighbourhood - which might or might not remain that way in perpetuity.
You adjust your blocking strategy to permit access from pmoz.info.
Nobody does anything

Please don't assume that the tools' volunteer author has the resources available to make changes in any meaningful time scale. You're the guy motivated to get something done so your most realistic solution is to take option 2.

That isn't a defensive or dishonest attitude, it's logic

.

Stern123 · Apr 17, 2009

jimnoble said:
You're the guy motivated to get something done so your most realistic solution is to take option 2.

Agreed, and I have already implemented that. But the point is that it's a bandaid solution, and I'm here to make a reasonable criticism of the situation plus suggestions for improvements.

That isn't a defensive or dishonest attitude, it's logic .

I consider it defensive/dishonest if the primary responses are 'no, you change' or 'we're blameless, stop blaming us' and 'we don't care' and so forth. More honest is a balanced response (like yours) along the lines of 'Yes, we acknowledge this, and we are interested in moving forward, but it will take time, so in the meantime, please consider this temporary solution'

Unfortunately, many came across to me as overly-defensive, lazy and/or uncaring -- a far cry from progressive thinking.

To expand on your list above, I would wish to suggest one or more of the following:

-Move pmoz to any other shared hosting which genuinely enforces a no-spam policy

-Move pmoz to dmoz.org or dmoz.aol.com (like Robozilla) for authenticity and verifiable reverse dns checks (like yahoo, google, or any other professional search engine/directory)

-Just stop running pmoz BUT continue to use Robozilla to auto-tag websites

-Use pmoz as a secondary authority; only auto-remove sites if Robozilla also "agrees"

-Use pmoz to auto-remove site only if it gets consecutive 404s (page not found)

-If pmoz gets a 403 (forbidden) or 50x (error) response code, always defer to Robozilla as the higher authority

Some of these changes are more time and resource intensive, others can be relatively quick and easy to implement, but any of the above is worthwhile, unlike the time that it takes to argue in circles over non-sequiturs. To those who protest about protecting the integrity of DMOZ, I can't imagine why any of these improvements would compromise ODP's ability to weed out bad websites.

pvgool · Apr 17, 2009

> Because that's simply evasive and dishonest
No it is not.
We have the choice between running tools developed by editors (like the one you see from pmoz) or not running them. We don't have access to the aol servers so can not put our tools on those servers.
Than the choice for us is easy. Run the tools. Accept that they are not perfect. Each month several 1000 sites get removed by these tools. That is good. If they also remove some sites a few sites that by accident, so be it. All sites marked by the tools are marked for an editor to review. If during such a review the editor noitces that the tool was wrong he can mark the site so the tools will know about their mistake. That is all fine for us. And we have no problem with this process.
Will the owner of such a site think we have a problem. Possibly yes, as you have shown. But with several millions sites listed, several millions sites being rejected over the years and millions of sites waiting review (either being suggested or not) the status of a single site is irrelevant for us.

Stern123 · Apr 17, 2009

pvgool said:
> Because that's simply evasive and dishonest
No it is not.

Yes, it is, because...

pvgool said:
> We have the choice between running tools developed by editors (like the one you see from pmoz) or not running them

3rd option: Improve the tools (see above)
4th option: Run the tools from a better host (see above)
5th option: Defer to Robozilla (see above)
etc. etc.

With your black-and-white stasist attitude, we'd only be allowed to use Windows 95, Apple Macintosh, or no computer at all! I hope that Zhuo Zhang et. al. will take a more progressive open-minded approach to DMOZ 2.0!

ODP link checker and site removal

Stern123

Member

hutcheson

Tech Owner

Member

hutcheson

plantrob

Curlie Admin

Tech Owner

Member

pvgool

kEditall/kCatmv

Tech Owner

Member

plantrob

Curlie Admin

Tech Owner

Member

Callimachus

Member

Tech Owner

Member

Stern123

Member

tschild

kEditall/kCatmv

pvgool

kEditall/kCatmv

Stern123

Member

jimnoble

DMOZ Meta

Stern123

Member

pvgool

kEditall/kCatmv

Stern123

Member