Jump to content

Recommended Posts

Posted

Hi @LL,

 

I have tried parsing the portion example structure of the RDF files from dmoz (rdf.dmoz.org) but it says : Fail to parse RDF and XML parsing failed.

 

I am hoping to parse correctly the whole 'structure' but since the file is large, i am using so far the example(a portion of the rdf structure.rdf.u8) provided in the website.

 

Has anyone managed / fixed to successfully parse the structure example?

 

Care to explain/share it please...???

I have made some changes to the code but still failed..:mad:

 

thanks!

  • Meta
Posted

I usually try my tools on the Kids & Teens files. They are much smaller, so it's a lot easier to parse them. I never tried to parse the example though.

 

Please note that the RDF dump is not valid RDF code, because it was designed at a time when the RDF specification was not yet fully finished. So using a standard RDF parser might throw a lot of errors, but in general the files should be readable.

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

Posted

In that case what i shall do then ??? Is there any tool that helps to fix and debug it ?

As my project requires to use the structure and content dump from dmoz. Once this is done, i would have to transfer the files into a relational database.

 

please help......

  • Meta
Posted

It is valid XML, so you can use XML tools.

 

Alternatively, you can process the files line by line by your own script (statefully, as you need to know what sort of item the line in question is part of). That's what I do for my own purposes.

Posted
It is valid XML, so you can use XML tools.

 

Alternatively, you can process the files line by line by your own script (statefully, as you need to know what sort of item the line in question is part of). That's what I do for my own purposes.

 

Thanks for the input.....

Which XML tools would you suggest then?

I would take the hardship way of going through line by line....any other alternatives to opt for ? The RDF dump has a lot of errors.... Is there any new fixed RDF dump available now ?

 

/H

Posted

I've had good luck with a SAX parser in Java. Don't try a DOM parser, the tree is too big.

 

import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;

Posted
The RDF dump has a lot of errors...

What do you mean by errors? If you're refering to it not being valid RDF, then this is a known feature. The ODP format was decided before the RDF spec was finalised. (The issue is more that it's erroneously refered to as an "RDF dump", when it should really be described as an "XML Data Dump".) If you are refering to other problems, we might like to know about them...

 

As for tools to use, have you tried looking at http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/Upload_Tools/ ?

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...