Jump to content

Recommended Posts

Posted

What kind of errors are in the RDF dump which need to be filtered in order to have all data without loosing any?

 

Will a Jena parser be able to read everything without filtering?

 

What would be the fastes approach to retrieve only a sub-branch of the whole dmoz dump?

 

Does anyone of you happen to know a quick-and-dirty grep pattern in order to filter out all data that does NOT include a certain directory, e.g. Top/Science/blabla in order to just have a specific branch?

 

Thanks!

Posted
One of the editors puts an error list on his website each week when the new RDF is published. There are also some notes about known issues with the dump, and some ideas for future features. A Google search should find it.
Posted

No RDF dump this week (due to a minor technical problam with one category).

 

By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data.

 

Might take a few weeks for all of the oddities to be found and expunged.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...