Jump to content

Recommended Posts

Posted

I'm attempting to parse the rdf dumps but I'm running into illegal UTF-8 characters. Is anything being done to correct this? What workarounds are possible to get an xml parser to work with this?

 

I read the Errata and saw the filtering change done on 2003-03-12, but apparently something is not working quite right today because even the link provided in that update reports an overwhelming number of UTF-8 errors.

 

http://rodan.ncc.com/rdf/

 

 

I'm using the Python XML libraries to parse the files and I can't keep it from throwing exceptions. Expat's test utilities throw errors as well.

 

Could someone who is currently parsing these files share a bit of wisdom?

 

Thanks!

-Dan

Posted
I don't know if this helps you, but the entire ODP has changed over to UTF-8 in the last month. The changes are still in progress, but most categories have been changed. Here and there there will be sites and descriptions that will have to be manually converted.
  • Meta
Posted
Is anything being done to correct this?

Yes, we are working on this. Due to the amount of errors and limitations to staff time it will take some time till they are eliminated. No schedule yet, sorry to say.

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

Posted
Yes, we are working on this. Due to the amount of errors and limitations to staff time it will take some time till they are eliminated. No schedule yet, sorry to say.

 

Any hints for processing these files in the interim?

  • Like 1
Posted

Sorry, I occasionaly play around with RDF processing, but not recently.

 

I do notice that Google has copied the changes for one of my categories where I manually fixed up some UTF-8 characters, and that page is marked as UTF-8, but all the characters I changed do not display correctly. However, in an equivalent World category, they seem to have done it correctly. I only checked a few categories but that seems to be a general rule.

 

So they haven't figured it out either.

 

Roughly speaking, ODP has automatically converted all the World categories to UTF-8 [however some glitches crept in which stil are being fixed], however in other categories, which will occassionly have accents in titles and descriptions, most of that is being done manually, and I think may take a while.

Posted

DMOZ internal processing

 

I was just wondering.....how do you cope with your own errors yourself?

 

Does this mean you need to rely on your own software rather than using standard rdf classes ?

 

 

just curious...........

Posted

Sorry, what I meant was that the most of the sites in ODP were converted automatically to UTF-8. However, for variuous reasons, mostly in the categories not in World, there are site descriptions that were not converted correctly to UTF-8. It turns out in many case there were invalid characters in the old descriptions that looked ok, for example a character that looked like a single apostrophe, but wasn't, now display as a question mark, and must be fixed. And a lot of the accents did not get converted and display as the wrong character. So this is what is in the latest RDF dump.

 

The invalid characters are being fixed by editors in the same way we would fix a spelling error. So when I say where I manually fixed up some UTF-8 characters I meant as an editor I have fixed the errors [not in my RDF software processing], and the next RDF will pick up those changes, but it may take a while before all thos manual changes show up, and until that is complete the RDF will get a mixed up set of characters. However the fixup is considered a priority.

  • Meta
Posted
Any hints for processing these files in the interim?

I assume that running every line of code that contains an error through an UTF8-Encoder should fix it. The problem is simply, that a lot of characters were not converted to UTF8. Don't blame me if it doesn't work, though. There might be some other errors as well (I could imagine double encoded characters for example. You would have to decode them to remove the error. Don't know how to detect that, to be honest)

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

Posted

No RDF dump this week (due to a minor technical problem with one category).

 

By the time the next one appears, a lot of the illegal sequences will have been fixed by hand, or by scripts checking and correcting the data.

 

Might take a few weeks for all of the oddities to be found and expunged.

  • 4 weeks later...
Posted

We believe that site titles and descriptions are now all encoded in proper UTF-8, but the next RDF will be used to prove or disprove that guess.

 

There are still some category descriptions awaiting an edit to clear an encoding error in it, and a lot of work still to be done on the FAQs.

 

The last RDF on 2004-04-03 had 10% of the errors of the previous one. No RDF last week due to other issues. Another one is coming soon.

  • 2 weeks later...
Posted

Byte-by-byte, Baby

 

I'm parsing the file in Java, and I had to extend FileReader and filter all characters over 255, like so:

 

public int read() throws IOException

{

int character = super.read();

while (character > 255)

character = super.read();

return character;

}

 

I just passed this FileReader subclass to a SAX parser, and that did the trick.

Posted

The latest content dump had only 56 errors in over 1 800 000 000 characters, and those have all been fixed by hand now. Looking for zero errors in that dump file next week.

 

Still some complex work to do on the structure file, and that will probably take at least a few more weeks to get right.

Posted

I'm also parsing the rdf files in java. I solved the UTF-8 problem by using iconv. If you have linux on your computer then run this command:

 

iconv -c -f UTF-8 -t UTF-8 structure.rdf.u8 > cleaned_structure.rdf.u8

 

This will clean the rdf file of any illegal UTF-8 character.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...