Surprise! Different charsets!!

H

hughprior

I've been trying to sort out the accented characters (for www.localpin.com), when I discovered to my surprise (read absolute horror!), that different pages of dmoz use a different charset, e.g. compare the '%e8' character in:

[sorry, but the 2nd URL displays wrongly here - go to Philosophers, C and find the name Cixous]
http://www.dmoz.org/World/Hrvatski/Ra%e8unala/
http://www.dmoz.org/Society/Philosophy/Philosophers/C/Cixous,_H%e9l%e8ne/

They use the SAME accented character in the URL (%e8) but are this character is NOT displayed the same!!!

In the first it is rendered as an e with a grave accent, in the second as a c with an inverted circumflex accent (if DMOZ was working I could copy/paste them here).

On inspecting the pages in more detail, I notice that the first is created with charset ISO-8859-2 and the second with ISO-8859-1.

It's worth noticing that neither of the mirrors get this difference correctly (both pages should display the word incorrectly as Raèunala)
http://de.dmoz.org/World/Hrvatski/
http://ch.dmoz.org/World/Hrvatski/

OK. Well done DMOZ. Displays nicely for DMOZ. The end user (as long as they don't use a mirror) sees the correct displayed character.

But now I want to know:

a) How on earth, as a user of DMOZ data, do I know what pages use what charset?!! It isn't as far as I know in the RDF dump. If it was, I guess the mirrors would correctly pick it up.

b) How many charsets are used?

c) Where is this documented?

Thanks.
 

beebware

Member
Joined
Mar 25, 2002
Messages
1,070
The charsets are detailed on the RDF listings and as you can see from the tag display "d:charset" is part of the "structure.rdf" file.

Why the mirrors aren't sending the right charset code? Pass.... - but then again, the mirrors (IIRC) don't actually use the RDFs but use another method of mirroring the data (as they have to have "exact data copies" of the layout etc whereas downstream users such as Google are more interested in the actual data then the presentation)
 

totalxsive

Member
Joined
Mar 25, 2002
Messages
2,348
Location
Yorkshire, UK
This is a problem on our list to fix - we're on the long windy road to getting the whole directory into Unicode. A solution will be here sometime in the future but in the meantime I'm afraid you'll have to put up with encoding soup.
 
This site has been archived and is no longer accepting new content.
Top