Processing the RDF dump

dermotz

Member
Joined
Mar 18, 2004
Messages
112
What kind of errors are in the RDF dump which need to be filtered in order to have all data without loosing any?

Will a Jena parser be able to read everything without filtering?

What would be the fastes approach to retrieve only a sub-branch of the whole dmoz dump?

Does anyone of you happen to know a quick-and-dirty grep pattern in order to filter out all data that does NOT include a certain directory, e.g. Top/Science/blabla in order to just have a specific branch?

Thanks!
 

giz

Member
Joined
May 26, 2002
Messages
3,112
One of the editors puts an error list on his website each week when the new RDF is published. There are also some notes about known issues with the dump, and some ideas for future features. A Google search should find it.
 

giz

Member
Joined
May 26, 2002
Messages
3,112
No RDF dump this week (due to a minor technical problam with one category).

By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data.

Might take a few weeks for all of the oddities to be found and expunged.
 
This site has been archived and is no longer accepting new content.
Top