Guest browser007 Posted September 28, 2003 Posted September 28, 2003 ... while you can easily access dmoz.org itself. What's happening? Are we not authorized to use odp data anymore? You can do the test by yourself, go to: http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Sites_Using_ODP_Data/full-index.html and pick a site of your choice: unless they cache some pages on their own server, there's no information from dmoz.org
giz Posted September 28, 2003 Posted September 28, 2003 Re: All sites that use live ODP data are down now. Maybe they should be using http://ch.dmoz.org/ instead?
Guest browser007 Posted September 28, 2003 Posted September 28, 2003 Re: All sites that use live ODP data are down now. Are you suggesting that if all these sites use ch.dmoz.org, there are no problems at all?
Guest browser007 Posted September 28, 2003 Posted September 28, 2003 Apparently, a techie has blocked access to sites that use ODP data since Saturday, September 27 (last time my server cached a file was 27-Sep-2003 3:40 PM EST) PS. It's not only about my site, but about ALL sites that use ODP live data :-(
Meta windharp Posted September 29, 2003 Meta Posted September 29, 2003 You did try to access dmoz.org by hand? Most likely it was no "techie" that blocked access but server load that prevents you from accessing it. ch.dmoz.org is a bit out of date as is de.dmoz.org. Both are external mirrors at different places (hint: That are countrycodes in fromt of dmoz.org ) so they don't put load on our main server. Curlie Meta/kMeta Editor windharp
Meta theseeker Posted September 29, 2003 Meta Posted September 29, 2003 When these types of scripts that use live data first showed up, dmoz staff asked the writers of the script to aim them at some other server, like the netscape servers (though I suspect that may not be allowed anymore either). But the programmers making the scripts wanted the most up to date data, and so eventually ignored that request. Since the dmoz.org system was not made to handle a lot of traffic (that's why the data is distributed through the RDF), the number of robots and screen scraping programs have been slowing the servers down for quite some time. I'm quite surprised that it's taken this long, but from all the signs, I would say that sites taking data directly from the public servers are going to be blocked now. I suggest exploring other avenues, like processing the RDF. The mirror servers are probably not the solution. They are provided free of charge, and the people providing them are probably not going to keep providing the types of resources it would take to satisfy all the sites that want live data.
Guest browser007 Posted September 29, 2003 Posted September 29, 2003 Thank you for this information. I'm trying to find a way to chop these huge RDF dumps in manageable pieces, instead of using ODP live data. P.S. I know what the problem is with these spiders: When Googlebot or another SE spider comes along on my site to index pages, this bot leaves traces in MY server log files. But the bot will use my program to request pages on dmoz.org, so in the log files of dmoz.org, there is no trace of the Googlebot, but instead, it appears that my site is an unknown robot that abuses the dmoz.org server - and they block access to my site :-(
Meta windharp Posted September 29, 2003 Meta Posted September 29, 2003 At least Googlebot is very obedient regarding the robots.txt files. Since it is unnecessary to spider borrowed content anyway, such should be excluded by default. [My oppinion!] Curlie Meta/kMeta Editor windharp
Guest browser007 Posted September 29, 2003 Posted September 29, 2003 windharp, I agree with you: spidering "borrowed content" is not quite fair. It's too late now to exclude bots, because my site has no access to dmoz.org anymore. Wish I did that from the beginning...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now