Jump to content

Recommended Posts

Posted

A user of my script (phpODP) has reported getting this message:

 

<center>
<table width="75%"><tr><td>
<b><font size="+1">Access denied (for you)</font></b>
<p>
Possible reasons are:
<ul>
<li><b>You are trying to mirror the page via HTTP.
We dislike this!</b> It's a total waste of bandwidth, CPU and time!
Please <a href="mailto:dmoz@teamix.net">contact</a> us and use <a href=http://rsync.samba.org/>rsync</a> in the future.


<li><b>We blocked you by mistake.</b>
Sorry for that. Please get in <a href="mailto:dmoz@teamix.net">contact</a> with us und we deblock you immediately.


</ul>
</center>

 

Why does this happen? I've never experienced getting this message, and this is the first time I've seen it. From the message, it seems dmoz doesn't like that people are using their content directly?

 

See detailed discussion here:

http://www.bie.no/forum/index.php?act=ST&f=2&t=119&st=0

 

- bjorn

Posted

You'll need to talk to the ODP staff for a definite answer.

 

But there is some advice in the robots.txt

http://www.dmoz.org/robots.txt

 

# Please do not crawl us faster than 1 hit/second.

#

# If you need to examine many dmoz pages, please download the rdf file from

# http://rdf.dmoz.org/ instead of crawling us.

 

A better design for your system might be for you to take a regular RDF, and for your clients to crawl the site you create from that.

 

It won't be as up-to-date as DMOZ. But if you updated the 1000 most changeable cats every day, that would be almost as good. And you'd be crawling DMOZ at only once per minute, which is well within the requested limits.

 

Though you and your clients may also have issues with the license conditions of the ODP data. Again, best check that with the ODP staff.

  • Meta
Posted

Well, the ODP public servers have been bogged down lately -- I've noticed it, and I've seen it mentioned in public forums.

 

So I'd say this message means that ODP staff have figured out (at least part of) WHY the public servers have been so bogged down lately.

 

Whether it's the script itself, or some single idiot user, or just that too many thoughtless users have done what amounts to conspiring to launch a DDOS attack against the ODP servers using the script, really doesn't matter at this point. ("We're techs, not psychiatrists. So long as we know WHO, we don't care WHY.")

 

DMOZ doesn't mind people (and I emphasize the word "people") using their content. DMOZ doesn't even mind 'bots using their content, so long as they don't interfere with the people. But the 'bots got way too greedy, and it's whack-a-bot time.

Posted

wrathchild; I have no idea - the user claims to have gotten this message when accessing dmoz.org.

 

Thanks for your insight people.

 

Sunanda; AFAIK, the ODP license has no problems with this use of the data, as long as the proper attribution is in place.

 

I think maybe dmoz's problem is with individual high-traffic sites that didn't have cache enabled (my scripts support caching of odp data) - and that have been requesting a lot of pages from dmoz.

  • Meta
Posted
Ask your customer if they know who teamix is. It might be that their own hosting provider is blocking them.

I will not answer PM or emails send to me. If you have anything to ask please use the forum.

Posted

I've been playing with your script the last week, and it's been failing a lot, and from what I can see it's because dmoz.org is stressed out. I think that there has been a large number of additional sites using ODP data created over the last year, and and many of these are taking bandwidth from ODP directly using scripts like this as opposed to downloading the RDF, and working from it directly.

 

I would think if this problem continues, then it would be inevitable that scripts will be blocked. I'm not sure whether a policy is spelled out clearly as to the conditions of taking data feeds directly from DMOZ.ORG, as opposed to using the RDF.

 

See this thread from last year http://www.resource-zone.com/forum/showthread.php?p=49572#post49572

Posted

Well, as said - if the sites enable cache then the page is only requested once, which shouldn't be more of a problem for dmoz than a normal user requesting it.

 

About teamix, I just found out that they are the ones hosting the german mirror:

http://de.dmoz.org/

 

- bjorn

Posted

Though no where near as extreme yet (that'll depend on how successful / mis-used your product is) you are in a situation similar to that of Netgear's approach to the University of Wisconsin.

 

It does not seem a sound business decision to assume a non-profit organization will provide adequate bandwidth to support a commercial product sold by a third party

 

As I suggested before, you can really only guarantee your paying users a fixed service level if they are accessing a DMOZ clone of your own.

  • Editall
Posted
About teamix, I just found out that they are the ones hosting the german mirror:

 

At a guess, the German ISP has noticed significant bandwidth usage from "screen scraping" scripts/bots across the directory and decided that they don't care to further subsidize those bandwidth costs While there are uses for such activities, if abused (either intentionally or by incorrect configuration) they can create major problems in bandwidth usage.

 

It's similar to people that link directly to images on someone elses server and leech their bandwidth to display them. There was a time when this practice had become so widespread many hosts and sites prevented, and still prevent, linking to images except from within their own networks/IP blocks/domains.

ODP Editor callimachus

Any opinions expressed are my own, and do not represent an official opinion or communication from the ODP.

Private messages asking for submission status or preferential treatment will be ignored.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...