A strange error message from dmoz

bjorn

Member
Joined
Jun 30, 2004
Messages
10
A user of my script (phpODP) has reported getting this message:

Code:
<center>
<table width="75%"><tr><td>
<b><font size="+1">Access denied (for you)</font></b>
<p>
Possible reasons are:
<ul>
<li><b>You are trying to mirror the page via HTTP.
We dislike this!</b> It's a total waste of bandwidth, CPU and time!
Please <a href="mailto:dmoz@teamix.net">contact</a> us and use <a href=http://rsync.samba.org/>rsync</a> in the future.


<li><b>We blocked you by mistake.</b>
Sorry for that. Please get in <a href="mailto:dmoz@teamix.net">contact</a> with us und we deblock you immediately.


</ul>
</center>

Why does this happen? I've never experienced getting this message, and this is the first time I've seen it. From the message, it seems dmoz doesn't like that people are using their content directly?

See detailed discussion here:
http://www.bie.no/forum/index.php?act=ST&f=2&t=119&st=0

- bjorn
 

Sunanda

Member
Joined
Jun 15, 2003
Messages
248
You'll need to talk to the ODP staff for a definite answer.

But there is some advice in the robots.txt
http://www.dmoz.org/robots.txt

# Please do not crawl us faster than 1 hit/second.
#
# If you need to examine many dmoz pages, please download the rdf file from
# http://rdf.dmoz.org/ instead of crawling us.

A better design for your system might be for you to take a regular RDF, and for your clients to crawl the site you create from that.

It won't be as up-to-date as DMOZ. But if you updated the 1000 most changeable cats every day, that would be almost as good. And you'd be crawling DMOZ at only once per minute, which is well within the requested limits.

Though you and your clients may also have issues with the license conditions of the ODP data. Again, best check that with the ODP staff.
 

hutcheson

Curlie Meta
Joined
Mar 23, 2002
Messages
19,136
Well, the ODP public servers have been bogged down lately -- I've noticed it, and I've seen it mentioned in public forums.

So I'd say this message means that ODP staff have figured out (at least part of) WHY the public servers have been so bogged down lately.

Whether it's the script itself, or some single idiot user, or just that too many thoughtless users have done what amounts to conspiring to launch a DDOS attack against the ODP servers using the script, really doesn't matter at this point. ("We're techs, not psychiatrists. So long as we know WHO, we don't care WHY.")

DMOZ doesn't mind people (and I emphasize the word "people") using their content. DMOZ doesn't even mind 'bots using their content, so long as they don't interfere with the people. But the 'bots got way too greedy, and it's whack-a-bot time.
 

bjorn

Member
Joined
Jun 30, 2004
Messages
10
wrathchild; I have no idea - the user claims to have gotten this message when accessing dmoz.org.

Thanks for your insight people.

Sunanda; AFAIK, the ODP license has no problems with this use of the data, as long as the proper attribution is in place.

I think maybe dmoz's problem is with individual high-traffic sites that didn't have cache enabled (my scripts support caching of odp data) - and that have been requesting a lot of pages from dmoz.
 

pvgool

kEditall/kCatmv
Curlie Meta
Joined
Oct 8, 2002
Messages
10,093
Ask your customer if they know who teamix is. It might be that their own hosting provider is blocking them.
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
I've been playing with your script the last week, and it's been failing a lot, and from what I can see it's because dmoz.org is stressed out. I think that there has been a large number of additional sites using ODP data created over the last year, and and many of these are taking bandwidth from ODP directly using scripts like this as opposed to downloading the RDF, and working from it directly.

I would think if this problem continues, then it would be inevitable that scripts will be blocked. I'm not sure whether a policy is spelled out clearly as to the conditions of taking data feeds directly from DMOZ.ORG, as opposed to using the RDF.

See this thread from last year http://www.resource-zone.com/forum/showthread.php?p=49572#post49572
 

bjorn

Member
Joined
Jun 30, 2004
Messages
10
Well, as said - if the sites enable cache then the page is only requested once, which shouldn't be more of a problem for dmoz than a normal user requesting it.

About teamix, I just found out that they are the ones hosting the german mirror:
http://de.dmoz.org/

- bjorn
 

Sunanda

Member
Joined
Jun 15, 2003
Messages
248
Though no where near as extreme yet (that'll depend on how successful / mis-used your product is) you are in a situation similar to that of Netgear's approach to the University of Wisconsin.

It does not seem a sound business decision to assume a non-profit organization will provide adequate bandwidth to support a commercial product sold by a third party

As I suggested before, you can really only guarantee your paying users a fixed service level if they are accessing a DMOZ clone of your own.
 

Callimachus

Member
Joined
Mar 15, 2004
Messages
704
About teamix, I just found out that they are the ones hosting the german mirror:

At a guess, the German ISP has noticed significant bandwidth usage from "screen scraping" scripts/bots across the directory and decided that they don't care to further subsidize those bandwidth costs While there are uses for such activities, if abused (either intentionally or by incorrect configuration) they can create major problems in bandwidth usage.

It's similar to people that link directly to images on someone elses server and leech their bandwidth to display them. There was a time when this practice had become so widespread many hosts and sites prevented, and still prevent, linking to images except from within their own networks/IP blocks/domains.
 
This site has been archived and is no longer accepting new content.
Top