Jump to content

Some questions re access via ASP


Recommended Posts

Guest Dare2
Posted

Just realised what the project is, and I am rapt that this exists. Great idea.

 

Okay, first the blush :o because some of this is probably obvious to all but me.

 

I would like to offer directory listings on some of my sites. I understand that if I follow the terms of use and provide the correct attributions, etc, this is acceptable.

 

Now the Qs.

 

1: Can I pull info from the open directory using XMLHTTP object with ASP, or is it required that the RDFs are downloaded on a regular basis?

 

2: If I can pull the pages, do I need to pull the html, or is there a way to just get the data, in XML or even in a simple (eg comma-delimited list) format?

 

Thanks in advance.

 

:)

Guest Dare2
Posted

Okay, that was obviously way too easy and not deserving of a response. :p

 

Another Q - how do you prevent the files at http://dmoz.org/rdf/ expanding when attempting to download via a browser.

 

If this is also too easy, is there a forum where people will answer newb questions on getting the data? :)

Posted

It is probably best for you to take the RDF files. A new version usually appears about 3 or 4 times per month.

 

 

Screen scrapers use up a lot of the ODP resources, slowing the system down for both submitters and editors. This has sort of forced the ODP into moving all of the editor functions on to a separate machine to get away from the problems. The next phase of upgrades starts this coming weekend. Expect downtime of about 1 week for submitting, editing, editor applications and so on.

 

 

Setting up datafeeds has been suggested and discussed in the past. It may happen in the future but there is a lot more important stuff to do first, not least the current server upgrades.

 

 

If you are going to read in HTML pages at least do it from a mirror like ch.dmoz.org or de.dmoz.org (latter offline at the moment) so that ODP main machine is not slowed down.

 

 

The RDFs at http://dmoz.org/rdf/ are very old versions. Use the new URL at http://rdf.dmoz.org/rdf/ instead.

 

 

To stop the browser opening the files, right click the link, and then select the appropriate "Save" option.

 

 

 

Easy questions. Easy answers. You just needed to allow for Timezones, working hours, and sleep time.

Guest Dare2
Posted

Hi giz,

 

Thanks for that, really appreciate it.

 

(You guys sleep?)

 

The download of, for example, the structure file (45mb) using IE, with right-click, save-target-as looks like it is downloading compressed, gets to 45mb and keeps going forever ...

99% complete, downloading 45mb of 45mb

99% complete, downloading 49mb of 49mb

99% complete, downloading 55mb of 55mb

etc.

 

:confused:

 

Is there an FTP area?

 

Thanks again.

  • Meta
Posted
I've never been able to download them using IE, as it always tries to uncompress them on the fly. I used to use Opera, which worked fine. You can also use wget.
Guest Dare2
Posted

Hi theseeker,

 

Thanks for that. I now have them, and wget as well. :)

  • Meta
Posted

I wish to demur from my distinguished colleague's statement.

 

(1) Whether a RDF download or a screen-scrape is more expensive depends on how much activity your site gets. Google should get the RDF (which they do, of course); geocities.com/Ulan_Bator/mypurlingsite/index.htm, with 600 page views a month, can scrape the Purling category from the ODP all month long, no problem. Do the math, and see how much screenscraping will cost.

 

If the costs are anywhere close to even, the ODP would surely prefer you go the RDF approach.

Posted

Thanks for the additonal clarification to the message. I used the word "probably" when speaking of the RDF as there were bound to a few instances when it might not be true, and your example may be one such event. However, the effect of one screen scraper taking 1000 pages an hour is the same as 1000 screen scrapers taking one page each; but I believe the more hungry diners have been denied service these days, leaving just the small players and those who play fair with the resources.

 

 

You might also be interested in http://rodan.ncc.com/rdf/ and some of the snippets in other pages linked from there.

  • 2 weeks later...
Guest Dare2
Posted

Hi guys.

 

Regards bandwidth, loadspeed, congestion, etc,etc: A really useful feature would be to have dated update files that list only the items that have been dropped, moved, added or otherwise altered.

 

This would make it more acceptable for low- and middle-order (hitwise) sites to hold the data. It would reduce the load on DMOZ servers both ways (fewer calls to the directory as the option is more attractive to more sites, and smaller downloads for sites wanting to stay up to date).

 

Just a thought, and as I don't know all the ins and outs it may be a poor one. But ...

 

:)

  • 2 months later...
Guest Dajuroka
Posted
Same here .... really scary when you can only have 500MB a month! :confused:

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...