Processing the RDF dump

dermotz · March 18, 2004

What kind of errors are in the RDF dump which need to be filtered in order to have all data without loosing any?

Will a Jena parser be able to read everything without filtering?

What would be the fastes approach to retrieve only a sub-branch of the whole dmoz dump?

Does anyone of you happen to know a quick-and-dirty grep pattern in order to filter out all data that does NOT include a certain directory, e.g. Top/Science/blabla in order to just have a specific branch?

Thanks!

giz · March 19, 2004

One of the editors puts an error list on his website each week when the new RDF is published. There are also some notes about known issues with the dump, and some ideas for future features. A Google search should find it.

bobrat · March 19, 2004

Try http://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=deskbar&q=rdf+dump+errors

The URL starting rainwaterreptileranch.org would be a good start, I think, but I have not used anything from those sites. I do some playing with RDF dumps, but just hack around the errors, as I encounter them

giz · March 20, 2004

No RDF dump this week (due to a minor technical problam with one category).

By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data.

Might take a few weeks for all of the oddities to be found and expunged.

Sign In

Processing the RDF dump

Recommended Posts

dermotz

giz

bobrat

giz

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity