All sites that use live ODP data are down now...

September 28, 2003

... while you can easily access dmoz.org itself.

What's happening? Are we not authorized to use odp data anymore?

You can do the test by yourself, go to:

http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Sites_Using_ODP_Data/full-index.html

and pick a site of your choice: unless they cache some pages on their own server, there's no information from dmoz.org

giz · September 28, 2003

Re: All sites that use live ODP data are down now.

Maybe they should be using http://ch.dmoz.org/ instead?

September 28, 2003

Re: All sites that use live ODP data are down now.

Are you suggesting that if all these sites use ch.dmoz.org, there are no problems at all?

September 28, 2003

Apparently, a techie has blocked access to sites that use ODP data since Saturday, September 27 (last time my server cached a file was 27-Sep-2003 3:40 PM EST)

PS. It's not only about my site, but about ALL sites that use ODP live data :-(

windharp · September 29, 2003

You did try to access dmoz.org by hand? Most likely it was no "techie" that blocked access but server load that prevents you from accessing it.

ch.dmoz.org is a bit out of date as is de.dmoz.org. Both are external mirrors at different places (hint: That are countrycodes in fromt of dmoz.org ) so they don't put load on our main server.

theseeker · September 29, 2003

When these types of scripts that use live data first showed up, dmoz staff asked the writers of the script to aim them at some other server, like the netscape servers (though I suspect that may not be allowed anymore either). But the programmers making the scripts wanted the most up to date data, and so eventually ignored that request.

Since the dmoz.org system was not made to handle a lot of traffic (that's why the data is distributed through the RDF), the number of robots and screen scraping programs have been slowing the servers down for quite some time. I'm quite surprised that it's taken this long, but from all the signs, I would say that sites taking data directly from the public servers are going to be blocked now.

I suggest exploring other avenues, like processing the RDF. The mirror servers are probably not the solution. They are provided free of charge, and the people providing them are probably not going to keep providing the types of resources it would take to satisfy all the sites that want live data.

September 29, 2003

Thank you for this information.

I'm trying to find a way to chop these huge RDF dumps in manageable pieces, instead of using ODP live data.

P.S. I know what the problem is with these spiders:

When Googlebot or another SE spider comes along on my site to index pages, this bot leaves traces in MY server log files. But the bot will use my program to request pages on dmoz.org, so in the log files of dmoz.org, there is no trace of the Googlebot, but instead, it appears that my site is an unknown robot that abuses the dmoz.org server - and they block access to my site :-(

windharp · September 29, 2003

At least Googlebot is very obedient regarding the robots.txt files. Since it is unnecessary to spider borrowed content anyway, such should be excluded by default. [My oppinion!]

September 29, 2003

windharp, I agree with you: spidering "borrowed content" is not quite fair. It's too late now to exclude bots, because my site has no access to dmoz.org anymore.

Wish I did that from the beginning...

Sign In

All sites that use live ODP data are down now...

Recommended Posts

Guest browser007

giz

Guest browser007

Guest browser007

windharp

theseeker

Guest browser007

windharp

Guest browser007

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity