Jump to content

All sites that use live ODP data are down now...


Recommended Posts

Guest browser007
Posted

Re: All sites that use live ODP data are down now.

 

Are you suggesting that if all these sites use ch.dmoz.org, there are no problems at all?

Guest browser007
Posted

Apparently, a techie has blocked access to sites that use ODP data since Saturday, September 27 (last time my server cached a file was 27-Sep-2003 3:40 PM EST)

 

PS. It's not only about my site, but about ALL sites that use ODP live data :-(

  • Meta
Posted

You did try to access dmoz.org by hand? Most likely it was no "techie" that blocked access but server load that prevents you from accessing it.

 

ch.dmoz.org is a bit out of date as is de.dmoz.org. Both are external mirrors at different places (hint: That are countrycodes in fromt of dmoz.org ;) ) so they don't put load on our main server.

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

  • Meta
Posted

When these types of scripts that use live data first showed up, dmoz staff asked the writers of the script to aim them at some other server, like the netscape servers (though I suspect that may not be allowed anymore either). But the programmers making the scripts wanted the most up to date data, and so eventually ignored that request.

 

Since the dmoz.org system was not made to handle a lot of traffic (that's why the data is distributed through the RDF), the number of robots and screen scraping programs have been slowing the servers down for quite some time. I'm quite surprised that it's taken this long, but from all the signs, I would say that sites taking data directly from the public servers are going to be blocked now.

 

I suggest exploring other avenues, like processing the RDF. The mirror servers are probably not the solution. They are provided free of charge, and the people providing them are probably not going to keep providing the types of resources it would take to satisfy all the sites that want live data.

 

:monacle:

Guest browser007
Posted

Thank you for this information.

I'm trying to find a way to chop these huge RDF dumps in manageable pieces, instead of using ODP live data.

 

P.S. I know what the problem is with these spiders:

 

When Googlebot or another SE spider comes along on my site to index pages, this bot leaves traces in MY server log files. But the bot will use my program to request pages on dmoz.org, so in the log files of dmoz.org, there is no trace of the Googlebot, but instead, it appears that my site is an unknown robot that abuses the dmoz.org server - and they block access to my site :-(

  • Meta
Posted
At least Googlebot is very obedient regarding the robots.txt files. Since it is unnecessary to spider borrowed content anyway, such should be excluded by default. [My oppinion!]

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

Guest browser007
Posted

windharp, I agree with you: spidering "borrowed content" is not quite fair. It's too late now to exclude bots, because my site has no access to dmoz.org anymore.

Wish I did that from the beginning...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...