What kind of errors are in the RDF dump which need to be filtered in order to have all data without loosing any?
Will a Jena parser be able to read everything without filtering?
What would be the fastes approach to retrieve only a sub-branch of the whole dmoz dump?
Does anyone of you happen to know a quick-and-dirty grep pattern in order to filter out all data that does NOT include a certain directory, e.g. Top/Science/blabla in order to just have a specific branch?
Thanks!
Will a Jena parser be able to read everything without filtering?
What would be the fastes approach to retrieve only a sub-branch of the whole dmoz dump?
Does anyone of you happen to know a quick-and-dirty grep pattern in order to filter out all data that does NOT include a certain directory, e.g. Top/Science/blabla in order to just have a specific branch?
Thanks!