rj365 Posted September 8, 2005 Posted September 8, 2005 Hi! I observed the pattern of the sample content file (content.example) available on the OPD site. From the point of view of extracting the hyperlinks, it appears that the format of the content is such that each link is repeated exactly twice. Once in the link <r:resource tag within the <Topic>/<catid>/<link> tag AND once in the <ExternalPage> tag. Can one safely assume that this repeatition is true for the entire real dump and thereby ignore the links in either one of the two places? Thanks! Rahul.
rrshwrk Posted October 19, 2005 Posted October 19, 2005 Hi, I am new to ODP. Hence I am posting this question. I found this question in one of the threads but cudnt find answer relevant to me. I do not want to store the information the informatin in a database as i have to index them using lucene. Here is the problem: I need documents along with category information. Hence I am using ODP. Now i need to download the actual documents (the content) along with their category information. After downloading, these will be passed to lucene where it will do the indexing and used by my system. Please help me how should i do it. thank you
addy Posted November 1, 2005 Posted November 1, 2005 You can use Nutch that allows dmoz-like dumps injection and subsequent fetching of links. Anyway it's very easy to do what you need by yourself. You have just to parse RDF dump filtering only categories URL and create the subhierarchy of the category of your interest using its prefix to find subcategories and related catid. At such point you have a table with two fields: one for subcategories and one for related catid. Then you filter content.rdf.u8 to extract only links with such catids. After than you have a table with two fields at least: one for link and one for its catid. Then you can just write a shell script that using a program like wget download the content of the link and you save it using as prefix its catid. Now that you have all your files you are able to index them with Lucene and maintain category information as a keyword near to the indexed content of the page (stored as a Lucene document field).
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now