heroine Posted April 1, 2007 Posted April 1, 2007 Hi @LL, I have tried parsing the portion example structure of the RDF files from dmoz (rdf.dmoz.org) but it says : Fail to parse RDF and XML parsing failed. I am hoping to parse correctly the whole 'structure' but since the file is large, i am using so far the example(a portion of the rdf structure.rdf.u8) provided in the website. Has anyone managed / fixed to successfully parse the structure example? Care to explain/share it please...??? I have made some changes to the code but still failed.. thanks!
Meta windharp Posted April 2, 2007 Meta Posted April 2, 2007 I usually try my tools on the Kids & Teens files. They are much smaller, so it's a lot easier to parse them. I never tried to parse the example though. Please note that the RDF dump is not valid RDF code, because it was designed at a time when the RDF specification was not yet fully finished. So using a standard RDF parser might throw a lot of errors, but in general the files should be readable. Curlie Meta/kMeta Editor windharp
heroine Posted April 2, 2007 Author Posted April 2, 2007 In that case what i shall do then ??? Is there any tool that helps to fix and debug it ? As my project requires to use the structure and content dump from dmoz. Once this is done, i would have to transfer the files into a relational database. please help......
Meta tschild Posted April 2, 2007 Meta Posted April 2, 2007 It is valid XML, so you can use XML tools. Alternatively, you can process the files line by line by your own script (statefully, as you need to know what sort of item the line in question is part of). That's what I do for my own purposes.
heroine Posted April 5, 2007 Author Posted April 5, 2007 It is valid XML, so you can use XML tools. Alternatively, you can process the files line by line by your own script (statefully, as you need to know what sort of item the line in question is part of). That's what I do for my own purposes. Thanks for the input..... Which XML tools would you suggest then? I would take the hardship way of going through line by line....any other alternatives to opt for ? The RDF dump has a lot of errors.... Is there any new fixed RDF dump available now ? /H
brmehlman Posted April 5, 2007 Posted April 5, 2007 I've had good luck with a SAX parser in Java. Don't try a DOM parser, the tree is too big. import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; import javax.xml.parsers.SAXParserFactory; import javax.xml.parsers.ParserConfigurationException; import javax.xml.parsers.SAXParser;
chaos127 Posted April 5, 2007 Posted April 5, 2007 The RDF dump has a lot of errors... What do you mean by errors? If you're refering to it not being valid RDF, then this is a known feature. The ODP format was decided before the RDF spec was finalised. (The issue is more that it's erroneously refered to as an "RDF dump", when it should really be described as an "XML Data Dump".) If you are refering to other problems, we might like to know about them... As for tools to use, have you tried looking at http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/Upload_Tools/ ?
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now