H
hughprior
I've been trying to sort out the accented characters (for www.localpin.com), when I discovered to my surprise (read absolute horror!), that different pages of dmoz use a different charset, e.g. compare the '%e8' character in:
[sorry, but the 2nd URL displays wrongly here - go to Philosophers, C and find the name Cixous]
http://www.dmoz.org/World/Hrvatski/Ra%e8unala/
http://www.dmoz.org/Society/Philosophy/Philosophers/C/Cixous,_H%e9l%e8ne/
They use the SAME accented character in the URL (%e8) but are this character is NOT displayed the same!!!
In the first it is rendered as an e with a grave accent, in the second as a c with an inverted circumflex accent (if DMOZ was working I could copy/paste them here).
On inspecting the pages in more detail, I notice that the first is created with charset ISO-8859-2 and the second with ISO-8859-1.
It's worth noticing that neither of the mirrors get this difference correctly (both pages should display the word incorrectly as Raèunala)
http://de.dmoz.org/World/Hrvatski/
http://ch.dmoz.org/World/Hrvatski/
OK. Well done DMOZ. Displays nicely for DMOZ. The end user (as long as they don't use a mirror) sees the correct displayed character.
But now I want to know:
a) How on earth, as a user of DMOZ data, do I know what pages use what charset?!! It isn't as far as I know in the RDF dump. If it was, I guess the mirrors would correctly pick it up.
b) How many charsets are used?
c) Where is this documented?
Thanks.
[sorry, but the 2nd URL displays wrongly here - go to Philosophers, C and find the name Cixous]
http://www.dmoz.org/World/Hrvatski/Ra%e8unala/
http://www.dmoz.org/Society/Philosophy/Philosophers/C/Cixous,_H%e9l%e8ne/
They use the SAME accented character in the URL (%e8) but are this character is NOT displayed the same!!!
In the first it is rendered as an e with a grave accent, in the second as a c with an inverted circumflex accent (if DMOZ was working I could copy/paste them here).
On inspecting the pages in more detail, I notice that the first is created with charset ISO-8859-2 and the second with ISO-8859-1.
It's worth noticing that neither of the mirrors get this difference correctly (both pages should display the word incorrectly as Raèunala)
http://de.dmoz.org/World/Hrvatski/
http://ch.dmoz.org/World/Hrvatski/
OK. Well done DMOZ. Displays nicely for DMOZ. The end user (as long as they don't use a mirror) sees the correct displayed character.
But now I want to know:
a) How on earth, as a user of DMOZ data, do I know what pages use what charset?!! It isn't as far as I know in the RDF dump. If it was, I guess the mirrors would correctly pick it up.
b) How many charsets are used?
c) Where is this documented?
Thanks.