Illegal byte sequences, Invalid UTF-8

phasevar

Member
Joined
Mar 17, 2004
Messages
4
I'm attempting to parse the rdf dumps but I'm running into illegal UTF-8 characters. Is anything being done to correct this? What workarounds are possible to get an xml parser to work with this?

I read the Errata and saw the filtering change done on 2003-03-12, but apparently something is not working quite right today because even the link provided in that update reports an overwhelming number of UTF-8 errors.

http://rodan.ncc.com/rdf/


I'm using the Python XML libraries to parse the files and I can't keep it from throwing exceptions. Expat's test utilities throw errors as well.

Could someone who is currently parsing these files share a bit of wisdom?

Thanks!
-Dan
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
I don't know if this helps you, but the entire ODP has changed over to UTF-8 in the last month. The changes are still in progress, but most categories have been changed. Here and there there will be sites and descriptions that will have to be manually converted.
 

windharp

Meta/kMeta
Curlie Meta
Joined
Apr 30, 2002
Messages
9,204
Is anything being done to correct this?
Yes, we are working on this. Due to the amount of errors and limitations to staff time it will take some time till they are eliminated. No schedule yet, sorry to say.
 

phasevar

Member
Joined
Mar 17, 2004
Messages
4
windharp said:
Yes, we are working on this. Due to the amount of errors and limitations to staff time it will take some time till they are eliminated. No schedule yet, sorry to say.

Any hints for processing these files in the interim?
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
Sorry, I occasionaly play around with RDF processing, but not recently.

I do notice that Google has copied the changes for one of my categories where I manually fixed up some UTF-8 characters, and that page is marked as UTF-8, but all the characters I changed do not display correctly. However, in an equivalent World category, they seem to have done it correctly. I only checked a few categories but that seems to be a general rule.

So they haven't figured it out either.

Roughly speaking, ODP has automatically converted all the World categories to UTF-8 [however some glitches crept in which stil are being fixed], however in other categories, which will occassionly have accents in titles and descriptions, most of that is being done manually, and I think may take a while.
 

dermotz

Member
Joined
Mar 18, 2004
Messages
112
DMOZ internal processing

I was just wondering.....how do you cope with your own errors yourself?

Does this mean you need to rely on your own software rather than using standard rdf classes ?


just curious...........
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
Sorry, what I meant was that the most of the sites in ODP were converted automatically to UTF-8. However, for variuous reasons, mostly in the categories not in World, there are site descriptions that were not converted correctly to UTF-8. It turns out in many case there were invalid characters in the old descriptions that looked ok, for example a character that looked like a single apostrophe, but wasn't, now display as a question mark, and must be fixed. And a lot of the accents did not get converted and display as the wrong character. So this is what is in the latest RDF dump.

The invalid characters are being fixed by editors in the same way we would fix a spelling error. So when I say where I manually fixed up some UTF-8 characters I meant as an editor I have fixed the errors [not in my RDF software processing], and the next RDF will pick up those changes, but it may take a while before all thos manual changes show up, and until that is complete the RDF will get a mixed up set of characters. However the fixup is considered a priority.
 

windharp

Meta/kMeta
Curlie Meta
Joined
Apr 30, 2002
Messages
9,204
Any hints for processing these files in the interim?
I assume that running every line of code that contains an error through an UTF8-Encoder should fix it. The problem is simply, that a lot of characters were not converted to UTF8. Don't blame me if it doesn't work, though. There might be some other errors as well (I could imagine double encoded characters for example. You would have to decode them to remove the error. Don't know how to detect that, to be honest)
 

giz

Member
Joined
May 26, 2002
Messages
3,112
No RDF dump this week (due to a minor technical problem with one category).

By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data.

Might take a few weeks for all of the oddities to be found and expunged.
 

giz

Member
Joined
May 26, 2002
Messages
3,112
We believe that site titles and descriptions are now all encoded in proper UTF-8, but the next RDF will be used to prove or disprove that guess.

There are still some category descriptions awaiting an edit to clear an encoding error in it, and a lot of work still to be done on the FAQs.

The last RDF on 2004-04-03 had 10% of the errors of the previous one. No RDF last week due to other issues. Another one is coming soon.
 

clevariant

Member
Joined
Apr 21, 2004
Messages
6
Byte-by-byte, Baby

I'm parsing the file in Java, and I had to extend FileReader and filter all characters over 255, like so:

public int read() throws IOException
{
int character = super.read();
while (character > 255)
character = super.read();
return character;
}

I just passed this FileReader subclass to a SAX parser, and that did the trick.
 

giz

Member
Joined
May 26, 2002
Messages
3,112
The latest content dump had only 56 errors in over 1 800 000 000 characters, and those have all been fixed by hand now. Looking for zero errors in that dump file next week.

Still some complex work to do on the structure file, and that will probably take at least a few more weeks to get right.
 

giz

Member
Joined
May 26, 2002
Messages
3,112
We're getting there. It will take at least a few more weeks, maybe a lot longer if any extra problems surface.
 

xell

Member
Joined
Apr 22, 2004
Messages
2
I'm also parsing the rdf files in java. I solved the UTF-8 problem by using iconv. If you have linux on your computer then run this command:

iconv -c -f UTF-8 -t UTF-8 structure.rdf.u8 > cleaned_structure.rdf.u8

This will clean the rdf file of any illegal UTF-8 character.
 
This site has been archived and is no longer accepting new content.
Top