Guest Dare2 Posted July 8, 2003 Posted July 8, 2003 Just realised what the project is, and I am rapt that this exists. Great idea. Okay, first the blush :o because some of this is probably obvious to all but me. I would like to offer directory listings on some of my sites. I understand that if I follow the terms of use and provide the correct attributions, etc, this is acceptable. Now the Qs. 1: Can I pull info from the open directory using XMLHTTP object with ASP, or is it required that the RDFs are downloaded on a regular basis? 2: If I can pull the pages, do I need to pull the html, or is there a way to just get the data, in XML or even in a simple (eg comma-delimited list) format? Thanks in advance.
Guest Dare2 Posted July 9, 2003 Posted July 9, 2003 Okay, that was obviously way too easy and not deserving of a response. Another Q - how do you prevent the files at http://dmoz.org/rdf/ expanding when attempting to download via a browser. If this is also too easy, is there a forum where people will answer newb questions on getting the data?
giz Posted July 9, 2003 Posted July 9, 2003 It is probably best for you to take the RDF files. A new version usually appears about 3 or 4 times per month. Screen scrapers use up a lot of the ODP resources, slowing the system down for both submitters and editors. This has sort of forced the ODP into moving all of the editor functions on to a separate machine to get away from the problems. The next phase of upgrades starts this coming weekend. Expect downtime of about 1 week for submitting, editing, editor applications and so on. Setting up datafeeds has been suggested and discussed in the past. It may happen in the future but there is a lot more important stuff to do first, not least the current server upgrades. If you are going to read in HTML pages at least do it from a mirror like ch.dmoz.org or de.dmoz.org (latter offline at the moment) so that ODP main machine is not slowed down. The RDFs at http://dmoz.org/rdf/ are very old versions. Use the new URL at http://rdf.dmoz.org/rdf/ instead. To stop the browser opening the files, right click the link, and then select the appropriate "Save" option. Easy questions. Easy answers. You just needed to allow for Timezones, working hours, and sleep time.
Guest Dare2 Posted July 9, 2003 Posted July 9, 2003 Hi giz, Thanks for that, really appreciate it. (You guys sleep?) The download of, for example, the structure file (45mb) using IE, with right-click, save-target-as looks like it is downloading compressed, gets to 45mb and keeps going forever ... 99% complete, downloading 45mb of 45mb 99% complete, downloading 49mb of 49mb 99% complete, downloading 55mb of 55mb etc. Is there an FTP area? Thanks again.
Meta theseeker Posted July 9, 2003 Meta Posted July 9, 2003 I've never been able to download them using IE, as it always tries to uncompress them on the fly. I used to use Opera, which worked fine. You can also use wget.
Guest Dare2 Posted July 9, 2003 Posted July 9, 2003 Hi theseeker, Thanks for that. I now have them, and wget as well.
Meta hutcheson Posted July 10, 2003 Meta Posted July 10, 2003 I wish to demur from my distinguished colleague's statement. (1) Whether a RDF download or a screen-scrape is more expensive depends on how much activity your site gets. Google should get the RDF (which they do, of course); geocities.com/Ulan_Bator/mypurlingsite/index.htm, with 600 page views a month, can scrape the Purling category from the ODP all month long, no problem. Do the math, and see how much screenscraping will cost. If the costs are anywhere close to even, the ODP would surely prefer you go the RDF approach.
giz Posted July 10, 2003 Posted July 10, 2003 Thanks for the additonal clarification to the message. I used the word "probably" when speaking of the RDF as there were bound to a few instances when it might not be true, and your example may be one such event. However, the effect of one screen scraper taking 1000 pages an hour is the same as 1000 screen scrapers taking one page each; but I believe the more hungry diners have been denied service these days, leaving just the small players and those who play fair with the resources. You might also be interested in http://rodan.ncc.com/rdf/ and some of the snippets in other pages linked from there.
Guest Dare2 Posted July 22, 2003 Posted July 22, 2003 Hi guys. Regards bandwidth, loadspeed, congestion, etc,etc: A really useful feature would be to have dated update files that list only the items that have been dropped, moved, added or otherwise altered. This would make it more acceptable for low- and middle-order (hitwise) sites to hold the data. It would reduce the load on DMOZ servers both ways (fewer calls to the directory as the option is more attractive to more sites, and smaller downloads for sites wanting to stay up to date). Just a thought, and as I don't know all the ins and outs it may be a poor one. But ...
Guest Dajuroka Posted October 3, 2003 Posted October 3, 2003 Same here .... really scary when you can only have 500MB a month!
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now