fizyk Posted October 19, 2010 Posted October 19, 2010 I wrote a python program that's parsing structure.rf.u8 file, as surprising as it is, python haven't crashed on any string operation as it should without forcing encoding on them. Looking further into that I discovered that there are none non-ascii characters in that data dump, and there are over 50 thousands records (out of over 760000) that are supposed to have some unicode characters (usually national letters, accents) but have question marks instead. At first I though that there were some issues with converting that data by my python program, but then I looked into structure.rdf.u8 data dump, and found question marks there as well. Here are few examples of how that data looks exactly: 3??3 Eyes; dmoz_uri: Top/Arts/Animation/Anime/Titles/3/3??3_Eyes; dmoz-id: 425470 DNA??; dmoz_uri: Top/Arts/Animation/Anime/Titles/D/DNA??; dmoz-id: 426485 Kamikaze Kait?? Jeanne; dmoz_uri: Top/Arts/Animation/Anime/Titles/K/Kamikaze_Kait??_Jeanne; dmoz-id: 426017 Notice the title and dmoz_uri for those category. Within ODP sites the first category has some x-like character instead those two question marks ( @see http://www.dmoz.org/Arts/Animation/Anime/Titles/3/3×3_Eyes/ ). Since that data looks... well... terribly, I have small question: is it possible to get hands on unicode encoded data not ascii?
Meta pvgool Posted October 19, 2010 Meta Posted October 19, 2010 There is a bug with the RDF see also http://www.resource-zone.com/forum/index.php?showtopic=52989 I will not answer PM or emails send to me. If you have anything to ask please use the forum.
RZ Admin photofox Posted October 20, 2010 RZ Admin Posted October 20, 2010 The latest files dated October 19th should be free of the encoding errors. Curlie Admin photofox
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now