How to Use ODP Data for Research

rj365

Member
Joined
Mar 27, 2005
Messages
6
Hi,

I want to use the categorized ODP data, well...actually, the web-site pages which are listed in the ODP. For this I need to download pages from the links (category-wise) in the ODP listing.

I have downloaded the rdf dump from ODP web-site. The problem is that the dump is too large: 1.85 GB single file, on disk. The question is: How should I go about processing it? There are parsers but isn't the file too large? Is there a way to split the dump into categories or atleast into parts to make it more manageable?

Thanks!
Rahul.
 

bitingduck

Member
Joined
Mar 19, 2005
Messages
8
The two best tools I've found so far for parsing odp data aren't listed in that part of the directory.

There's a perl module called Catalog-1.03, available at http://senga.org or downloadable from CPAN, that can be used to fairly easily extract the entire contents of the RDF into a mySQL database and browse it locally with a web front end. I've been working on updating some of it since I found it, but I haven't figured out how to put stuff back into Savannah yet (though I have been in contact with the original author). In its currently available state it's not compliant with the DMOZ acknowledgement license so you can't use it in public out of the box, but that's not hard to fix.

On a G4 800 MHz mac it will extract the entire rdf overnight, and then a couple hours will get it into the mySQL database. The ODP parser creates several files, one of which is just the URLs, their descriptions, etc. in an easily extractable format, so if you just want URLs, that's a good way to get them.

If you want to pull out chunks from the RDF (say one or two subcategories) there's a perl module called DMOZ::parseRDF (available at CPAN) that is *really* fast. It sucked out the entire music section in seconds. If you then feed that to the RDF parser from Catalog, you can get a file pretty quickly that just has all the URLS.
 

windharp

Meta/kMeta
Curlie Meta
Joined
Apr 30, 2002
Messages
9,204
(You did suggest those links in the abovementioned category, did you? ;-) )
 

bitingduck

Member
Joined
Mar 19, 2005
Messages
8
I think I suggested linking Catalog to that category some time ago (it's listed somewhere else because it's actually a more general tree-like catalog tool). I'll submit the DMOZ::parse though...
 

giz

Member
Joined
May 26, 2002
Messages
3,112
Yes. Please submit any such sites to the relevant categories. :)


The submit function is usually of far more use to us when surfers submit sites that they have found to be useful, rather than when webmasters submit sites that they have made themselves.
 
This site has been archived and is no longer accepting new content.
Top