dermotz Posted March 18, 2004 Posted March 18, 2004 What kind of errors are in the RDF dump which need to be filtered in order to have all data without loosing any? Will a Jena parser be able to read everything without filtering? What would be the fastes approach to retrieve only a sub-branch of the whole dmoz dump? Does anyone of you happen to know a quick-and-dirty grep pattern in order to filter out all data that does NOT include a certain directory, e.g. Top/Science/blabla in order to just have a specific branch? Thanks!
giz Posted March 19, 2004 Posted March 19, 2004 One of the editors puts an error list on his website each week when the new RDF is published. There are also some notes about known issues with the dump, and some ideas for future features. A Google search should find it.
bobrat Posted March 19, 2004 Posted March 19, 2004 Try http://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=deskbar&q=rdf+dump+errors The URL starting rainwaterreptileranch.org would be a good start, I think, but I have not used anything from those sites. I do some playing with RDF dumps, but just hack around the errors, as I encounter them
giz Posted March 20, 2004 Posted March 20, 2004 No RDF dump this week (due to a minor technical problam with one category). By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data. Might take a few weeks for all of the oddities to be found and expunged.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now