RDF Content Format

rj365 · September 8, 2005

Hi!

I observed the pattern of the sample content file (content.example) available on the OPD site. From the point of view of extracting the hyperlinks, it appears that the format of the content is such that each link is repeated exactly twice. Once in the link <r:resource tag within the <Topic>/<catid>/<link> tag AND once in the <ExternalPage> tag.

Can one safely assume that this repeatition is true for the entire real dump and thereby ignore the links in either one of the two places?

Thanks!

Rahul.

rrshwrk · October 19, 2005

Hi,

I am new to ODP. Hence I am posting this question. I found this question in one of the threads but cudnt find answer relevant to me. I do not want to store the information the informatin in a database as i have to index them using lucene.

Here is the problem:

I need documents along with category information. Hence I am using ODP. Now i need to download

the actual documents (the content) along with their category information. After downloading, these will be passed to lucene where it will do the indexing and used by my system.

Please help me how should i do it.

thank you

addy · November 1, 2005

You can use Nutch that allows dmoz-like dumps injection and subsequent fetching of links. Anyway it's very easy to do what you need by yourself. You have just to parse RDF dump filtering only categories URL and create the subhierarchy of the category of your interest using its prefix to find subcategories and related catid. At such point you have a table with two fields:

one for subcategories and one for related catid. Then you filter content.rdf.u8 to extract only links with such catids. After than you have a table with two fields at least: one for link and one for its catid. Then you can just write a shell script that using a program like wget download the content of the link and you save it using as prefix its catid. Now that you have all your files you are able to index them with Lucene and maintain category information as a keyword near to the indexed content of the page (stored as a Lucene document field).

Sign In

RDF Content Format

Recommended Posts

rj365

rrshwrk

addy

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity