Jump to content

Recommended Posts

Posted

Hi!

 

I observed the pattern of the sample content file (content.example) available on the OPD site. From the point of view of extracting the hyperlinks, it appears that the format of the content is such that each link is repeated exactly twice. Once in the link <r:resource tag within the <Topic>/<catid>/<link> tag AND once in the <ExternalPage> tag.

 

Can one safely assume that this repeatition is true for the entire real dump and thereby ignore the links in either one of the two places?

 

Thanks!

Rahul.

  • 1 month later...
Posted

Hi,

 

I am new to ODP. Hence I am posting this question. I found this question in one of the threads but cudnt find answer relevant to me. I do not want to store the information the informatin in a database as i have to index them using lucene.

 

Here is the problem:

I need documents along with category information. Hence I am using ODP. Now i need to download

the actual documents (the content) along with their category information. After downloading, these will be passed to lucene where it will do the indexing and used by my system.

 

Please help me how should i do it.

thank you

  • 2 weeks later...
Posted

You can use Nutch that allows dmoz-like dumps injection and subsequent fetching of links. Anyway it's very easy to do what you need by yourself. You have just to parse RDF dump filtering only categories URL and create the subhierarchy of the category of your interest using its prefix to find subcategories and related catid. At such point you have a table with two fields:

one for subcategories and one for related catid. Then you filter content.rdf.u8 to extract only links with such catids. After than you have a table with two fields at least: one for link and one for its catid. Then you can just write a shell script that using a program like wget download the content of the link and you save it using as prefix its catid. Now that you have all your files you are able to index them with Lucene and maintain category information as a keyword near to the indexed content of the page (stored as a Lucene document field).

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...