Jump to content

Recommended Posts

Posted

Hi,

 

I want to use the categorized ODP data, well...actually, the web-site pages which are listed in the ODP. For this I need to download pages from the links (category-wise) in the ODP listing.

 

I have downloaded the rdf dump from ODP web-site. The problem is that the dump is too large: 1.85 GB single file, on disk. The question is: How should I go about processing it? There are parsers but isn't the file too large? Is there a way to split the dump into categories or atleast into parts to make it more manageable?

 

Thanks!

Rahul.

Posted

The two best tools I've found so far for parsing odp data aren't listed in that part of the directory.

 

There's a perl module called Catalog-1.03, available at http://senga.org or downloadable from CPAN, that can be used to fairly easily extract the entire contents of the RDF into a mySQL database and browse it locally with a web front end. I've been working on updating some of it since I found it, but I haven't figured out how to put stuff back into Savannah yet (though I have been in contact with the original author). In its currently available state it's not compliant with the DMOZ acknowledgement license so you can't use it in public out of the box, but that's not hard to fix.

 

On a G4 800 MHz mac it will extract the entire rdf overnight, and then a couple hours will get it into the mySQL database. The ODP parser creates several files, one of which is just the URLs, their descriptions, etc. in an easily extractable format, so if you just want URLs, that's a good way to get them.

 

If you want to pull out chunks from the RDF (say one or two subcategories) there's a perl module called DMOZ::ParseRDF (available at CPAN) that is *really* fast. It sucked out the entire music section in seconds. If you then feed that to the RDF parser from Catalog, you can get a file pretty quickly that just has all the URLS.

Posted
I think I suggested linking Catalog to that category some time ago (it's listed somewhere else because it's actually a more general tree-like catalog tool). I'll submit the DMOZ::Parse though...
Posted

Yes. Please submit any such sites to the relevant categories. :-)

 

 

The submit function is usually of far more use to us when surfers submit sites that they have found to be useful, rather than when webmasters submit sites that they have made themselves.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...