Processing the RDF dump

dermotz · Mar 18, 2004

What kind of errors are in the RDF dump which need to be filtered in order to have all data without loosing any?

Will a Jena parser be able to read everything without filtering?

What would be the fastes approach to retrieve only a sub-branch of the whole dmoz dump?

Does anyone of you happen to know a quick-and-dirty grep pattern in order to filter out all data that does NOT include a certain directory, e.g. Top/Science/blabla in order to just have a specific branch?

Thanks!

giz · Mar 18, 2004

One of the editors puts an error list on his website each week when the new RDF is published. There are also some notes about known issues with the dump, and some ideas for future features. A Google search should find it.

bobrat · Mar 18, 2004

Try http://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=deskbar&q=rdf+dump+errors

The URL starting rainwaterreptileranch.org would be a good start, I think, but I have not used anything from those sites. I do some playing with RDF dumps, but just hack around the errors, as I encounter them

giz · Mar 19, 2004

No RDF dump this week (due to a minor technical problam with one category).

By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data.

Might take a few weeks for all of the oddities to be found and expunged.

Processing the RDF dump

dermotz

Member

giz

Member

bobrat

Member

giz

Member