Baffled here - DMOZ identifiers needed ala robots,crawlers, etc

WhiteHat

Member
Joined
May 8, 2004
Messages
4
:confused: We're thoroughly baffled. And our apologies in advance for being long-winded, but we think it is warranted (the windiness, as well as the apology :) )

Sometime in the last week to ten days, our site (http://football.refs.org) was deleted from DMOZ. Baffling to us, considering that less three years ago, our site was THE starred site in the DMOZ category of Sports: Football: American: Officiating. (As proof, look at IA's Wayback Machine for DMOZ's listing for August 2001 - http://web.archive.org/web/20010802200232/dmoz.org/Sports/Football/American/Officiating/).

We couldn't imagine why we were suddenly removed from DMOZ until we considered the following:

In the last three months, we have taken a significantly more aggressive approach in securing our server. Malicious harvesters, spiders, and many unacceptable user-agents are being denied any access to our site on our server. We utilize a comprehensive list of banned "bad bots", banned IP addresses, and the like that are denied access to our server for security reasons. As an example, web crawlers that have violated or are known to violate the robots exclusion standard are forbidden from any and ALL access to our server; we don't even let them access the robots.txt file, since they violate it anyway.

We can only speculate that someone from DMOZ attempted to reach our site using a banned user-agent and , as such, was then denied access & sent to our "forbidden" page.

Our sites are fully reachable utilizing 99.8% of available browsers - Netscape (all versions). IE (all versions), Opera (all versions), Lynx (for you die-hards); however, scores of bad user-agents, such as Webdav, Zeus, Nutch, Indy Library, etc., are not acceptable and are forbidden access.

One can choose to allow or deny Googlebot (or any bot) access to one's server per the robots.txt files utilizing the robots exclusion standards. Unfortunately, there appears to be a total lack of any semblance of a DMOZ standard. In our opinion, DMOZ utilization of human editors versus bots/spiders/etc does not eliminate a need for standardized browser/user-agent identifiers. We believe that that fact needs to be addressed and corrected by DMOZ.

The lack of a specific official "search bot" or standardized indentifiable user-agent tag utilized by DMOZ editors, does a disservice to the DMOZ because server and/or site owners are unable to clearly identify DMOZ inquiries and ,therefore, cannot choose to allow or to deny a DMOZ search of a site. Our situation may exemplify how this lack of standard(s) unfairly penalizes the server owners who take server security seriously, which in return penalizes the DMOZ with misinformation, which then ultimately penalizes the DMOZ users who are given incomplete information.

Can anyone explain any other logical reason why we would go from the premier site in our category to non-existant on DMOZ?
Heck, our site's links section lists over 130 sites overwhelms the DMOZ's listing of only 35 in this category to which we belong (or to which we had belonged until a week ago).

Interesting aside, FWIW - if one goes to http://www.dmoz.org/Sports/Officiating/ , our category of Football, American shows "@40", but clicking on the link shows "Sports: Football: American: Officiating (35)" - it appears to us there are several errors here in DMOZ.

Does dmoz.org/Sports/Football/American/Officiating/ have an editor? more than one? Is there anyway for us to make our server DMOZ friendlier? If we knew how, we would attempt to accomodate.

Thanks to those waded through this post. Any thoughts - anyone?
 

theseeker

Curlie Meta
Joined
Mar 26, 2002
Messages
613
It looks like you are blocking our own link checking spider (which is certainly not agressive, by the way, checking only one link, and only to see if the link responds). The link checking spider comes back with an error and the site is unreviewed until an editor has a chance to look at it. This is the second time this has happened to your site in the last 6 months, by the way.

I verified the site was working and re-added it, but to avoid this in the future, you may want to check your logs for Robozilla, and let it through.

Football, American shows "@40"

The @ symbol indicates that this is not really a category, but a link to a category in another part of the directory. Dead sites in that caetgory were recently cleaned up, causing the count in the category to drop to 35, but at the moment our page generation process is slow. With the 1-4 day lag time between edited pages and the public server, some pages that display links to categories will not have caught up yet.

This should automatically clear itself with a week. :monacle:
 

hutcheson

Curlie Meta
Joined
Mar 23, 2002
Messages
19,136
(1) Your speculation about the link-checking spider was correct. The site bounced the spider, and an editor moved the site into unreviewed until it could be rechecked by hand.

(2) The ODP link checker, user-agent "robozilla" is very well-behaved, and you should let it in: That will keep this same thing from happening regularly. Robozilla will only check your home page, and it will only check that about once a month on average. It does obey robots.txt, so you can keep it out that way if you wish -- but you have no reason to do so.

(3) ODP editors use whatever browser they want, and you need to just treat them just like any other surfer. And, given that you are treating them just like any other surfer, there is no conceivable honest reason to try to distinguish them from all other surfers. You can choose to support whatever browsers you want; but so long as you respond to unsupported browsers with a recognizable "unsupported browser" message, editors should not delete the site indiscriminately. (For the record, this is a more generous treatment of browser-restricted sites than Yahoo has.) In any case, I don't see that there has been any problem along these lines.

(4) No need for the more elaborate conspiracy theories -- it's robozilla.

(5) Really, I think all you need to do is let robozilla in. An editor will eventually recheck the sites that robozilla couldn't reach, and re-add the ones that still work (which, if what you are saying is correct, will include yours.) And this will happen regardless of how you treat our brave (but dangerous) mascot. It's just that when robozilla returns, the whole cycle may start over unless you fix that.
 

WhiteHat

Member
Joined
May 8, 2004
Messages
4
Thank you to both of you for your quick and detailed responses.

We have modified the appropriate conf file on our server per your identification of Robozilla. I'm not sure when or why Robozilla was included for exclusion; regardless, as I type, Robozilla no longer appears in our exclusion list.

Much appreciation from us to you for a timely response and for an informative response.
 

WebAlert

Member
Joined
Mar 27, 2004
Messages
10
Is there a comprehensive list anywhere on the net of good spiders.

I have searched the net, but all you find is how to write the robots.txt files but nothing on who is good and who is bad !

Even looking at a couple of examples the causes confusion

http://dmoz.org/robots.txt
- allows every robot in
http://www.searchengineworld.com/robots.txt
- a massive list of allowable robots but it is a different list to their listed 'good guy' robots http://www.searchengineworld.com/robots/robots4.txt
(which doesn't validate on their own validator !)

and I also took the liberty of looking at football.ref.org's robots.txt and found another long list - it was hard to tell if they were the same, as SEW's wasn't as neat and tidy (ie in alphabetal order) as footie ref's list !

Basically my 2 questions are:
1) Where did you (WhiteHart) get your list from
2) (but only out of curiosity) ... how come dmoz allows all spiders even the bad uns into it's site

Cheers,
Nathan
 

thehelper

Member
Joined
Mar 26, 2002
Messages
4,996
We want to make our data fully available to everyone - we don't mind bots. If a bot causes us a problem then we deal with it on a case by case basis but since we got a good dedicated server here, it is not a problem. If you have bandwidth problems or are hosting on a DSL line or whatever, then you should worry about bad bots, but don't worry about us - we are fine :)

Good luck on your good bot and bad bot search - I have never had a problem with them and my site is botted by all of them - never had a problem with them at all.
 
This site has been archived and is no longer accepting new content.
Top