World/Deutsch/ partially broken

giz

Member
Joined
May 26, 2002
Messages
3,112
Over the past few months, but mostly in the last month or so, a project has been running to convert all of the ODP to UTF-8 character encoding, from the myriad of different (ISO-8859, KO-18R, Shift-JIS, ISO 2022, and so on) encodings previously in use.

This project is progressing well. All of the /World categories are converted (might be one or two small ones missed perhaps?), but in some places there is now a duplicate category. This occurs if the category has accented characters, or non-Latin characters, in the Category name. This will leave behind one old category with the old ISO-8859 (or whatever) encoded name, and page of sites, and the new UTF-8 encoded category name, with it's page of UTF-8 encoded data. An automatic process is going through and taking the old pages out. Additionally the relcat and altlang links are being updated to reflect the new category names (well, not new names, but new encoding of the characters that make the name), and errors could be found there too while the job is still "in progress".

This should all be cleared up over the next few weeks or so. Additionally be advised that the RDF dump might have some UTF-8 errors in it next week, if any site entries got corrupted in the changeover. Things are looking good at the moment, but the job still isn't quite finished.

You might find it easier to browse the Google version of the ODP data in some categories (even though that data is a couple of months old now).



<edit>Both of your links worked OK for me at this time. Might have been a temporary glitch, but do bear in mind the changeover of encodings if you have any further problems.</edit>
 

xixtas01

Member
Joined
Jun 16, 2003
Messages
624
Good post, giz.

It looks to me like you may have been looking at a cached copy on the public servers, though. On the edit side, it looks wonky to me. I'll report this in our internal bugtacking forum.
 

windharp

Meta/kMeta
Curlie Meta
Joined
Apr 30, 2002
Messages
9,204
No need to report. All the public pages are being rebuilt right now due to the change in encoding.
 

xixtas01

Member
Joined
Jun 16, 2003
Messages
624
It's hard to keep up with all the permutations of the UTF-8 conversion. But Windharp is right. This is expected behavior and will resolve in time as pages are regenerated. Sorry.

giz suggested using the google directory, another copy of the directory which is more recent is available at open.thumbshots.org.
 

xixtas01

Member
Joined
Jun 16, 2003
Messages
624
Seems like every time I open my mouth in this thread I stick my foot in it. :) Thumbshots was working when I posted that message. I didn't realize that they were *that* current.

The editor side is now displaying correctly, so I expect the public side will update and display correctly in the next 0-3 days.
 

windharp

Meta/kMeta
Curlie Meta
Joined
Apr 30, 2002
Messages
9,204
Thumbshots was working when I posted that message. I didn't realize that they were *that* current.
To be precise they are pulling live data (I testzed that some time ago). Maybe they cache it, I don't know that. They use the RDF dump, but only for generating the thumbshots. Most likely the delay you can notice there is just the caching of our own dmoz.org public servers (Remember there are multiple servers responding to those queries in turn, so the results might differ on subsequent queries) ;-)

For the rest you should be right, the public servers cache data for 4-7 days (At least that was what we were told when that system was started) so it should slowly fix itself in the nex week. Since most categories were already rebuilt some days ago, it is quite likely that as good as no empty categories are to be seen in the timescale you mentioned. :)
 
This site has been archived and is no longer accepting new content.
Top