Illegal byte sequences, Invalid UTF-8

phasevar · Mar 17, 2004

I'm attempting to parse the rdf dumps but I'm running into illegal UTF-8 characters. Is anything being done to correct this? What workarounds are possible to get an xml parser to work with this?

I read the Errata and saw the filtering change done on 2003-03-12, but apparently something is not working quite right today because even the link provided in that update reports an overwhelming number of UTF-8 errors.

http://rodan.ncc.com/rdf/

I'm using the Python XML libraries to parse the files and I can't keep it from throwing exceptions. Expat's test utilities throw errors as well.

Could someone who is currently parsing these files share a bit of wisdom?

Thanks!
-Dan

bobrat · Mar 17, 2004

I don't know if this helps you, but the entire ODP has changed over to UTF-8 in the last month. The changes are still in progress, but most categories have been changed. Here and there there will be sites and descriptions that will have to be manually converted.

windharp · Mar 17, 2004

Is anything being done to correct this?

Yes, we are working on this. Due to the amount of errors and limitations to staff time it will take some time till they are eliminated. No schedule yet, sorry to say.

phasevar · Mar 17, 2004

windharp said:
Yes, we are working on this. Due to the amount of errors and limitations to staff time it will take some time till they are eliminated. No schedule yet, sorry to say.

Any hints for processing these files in the interim?

bobrat · Mar 17, 2004

Sorry, I occasionaly play around with RDF processing, but not recently.

I do notice that Google has copied the changes for one of my categories where I manually fixed up some UTF-8 characters, and that page is marked as UTF-8, but all the characters I changed do not display correctly. However, in an equivalent World category, they seem to have done it correctly. I only checked a few categories but that seems to be a general rule.

So they haven't figured it out either.

Roughly speaking, ODP has automatically converted all the World categories to UTF-8 [however some glitches crept in which stil are being fixed], however in other categories, which will occassionly have accents in titles and descriptions, most of that is being done manually, and I think may take a while.

dermotz · Mar 18, 2004

DMOZ internal processing

I was just wondering.....how do you cope with your own errors yourself?

Does this mean you need to rely on your own software rather than using standard rdf classes ?

just curious...........

bobrat · Mar 19, 2004

Sorry, what I meant was that the most of the sites in ODP were converted automatically to UTF-8. However, for variuous reasons, mostly in the categories not in World, there are site descriptions that were not converted correctly to UTF-8. It turns out in many case there were invalid characters in the old descriptions that looked ok, for example a character that looked like a single apostrophe, but wasn't, now display as a question mark, and must be fixed. And a lot of the accents did not get converted and display as the wrong character. So this is what is in the latest RDF dump.

The invalid characters are being fixed by editors in the same way we would fix a spelling error. So when I say where I manually fixed up some UTF-8 characters I meant as an editor I have fixed the errors [not in my RDF software processing], and the next RDF will pick up those changes, but it may take a while before all thos manual changes show up, and until that is complete the RDF will get a mixed up set of characters. However the fixup is considered a priority.

windharp · Mar 19, 2004

Any hints for processing these files in the interim?

I assume that running every line of code that contains an error through an UTF8-Encoder should fix it. The problem is simply, that a lot of characters were not converted to UTF8. Don't blame me if it doesn't work, though. There might be some other errors as well (I could imagine double encoded characters for example. You would have to decode them to remove the error. Don't know how to detect that, to be honest)

giz · Mar 19, 2004

No RDF dump this week (due to a minor technical problem with one category).

By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data.

Might take a few weeks for all of the oddities to be found and expunged.

giz · Apr 11, 2004

We believe that site titles and descriptions are now all encoded in proper UTF-8, but the next RDF will be used to prove or disprove that guess.

There are still some category descriptions awaiting an edit to clear an encoding error in it, and a lot of work still to be done on the FAQs.

The last RDF on 2004-04-03 had 10% of the errors of the previous one. No RDF last week due to other issues. Another one is coming soon.

clevariant · Apr 21, 2004

Byte-by-byte, Baby

I'm parsing the file in Java, and I had to extend FileReader and filter all characters over 255, like so:

public int read() throws IOException
{
int character = super.read();
while (character > 255)
character = super.read();
return character;
}

I just passed this FileReader subclass to a SAX parser, and that did the trick.

giz · Apr 22, 2004

The latest content dump had only 56 errors in over 1 800 000 000 characters, and those have all been fixed by hand now. Looking for zero errors in that dump file next week.

Still some complex work to do on the structure file, and that will probably take at least a few more weeks to get right.

clevariant · Apr 22, 2004

The "structure" file is what I'm parsing.

giz · Apr 22, 2004

We're getting there. It will take at least a few more weeks, maybe a lot longer if any extra problems surface.

xell · Apr 22, 2004

I'm also parsing the rdf files in java. I solved the UTF-8 problem by using iconv. If you have linux on your computer then run this command:

iconv -c -f UTF-8 -t UTF-8 structure.rdf.u8 > cleaned_structure.rdf.u8

This will clean the rdf file of any illegal UTF-8 character.

Illegal byte sequences, Invalid UTF-8

phasevar

Member

bobrat

Member

windharp

Meta/kMeta

phasevar

Member

bobrat

Member

dermotz

Member

bobrat

Member

windharp

Meta/kMeta

giz

Member

giz

Member

clevariant

Member

giz

Member

clevariant

Member

giz

Member

xell

Member