ODP Corruption - Multiple CatID 1

JasonTimmins

Member
Joined
Feb 21, 2009
Messages
18
Hi There,

Over the past month or so, I've noticed that the RDF files I use to drive my site contain some strange data. Specifically, many (59 in this week's version of content.rdf) occurances of CatID 1 (<catid>1</catid> in the XML.) This is many more than the handful usually found as the top-level catagories in the DMOZ hierarchy. Here's an example...

<Topic r:id="Top/Business/Textiles_and_Nonwovens/Industrial_Yarns_and_Sewing_Threads/Carpet_Yarns">
/Regional/North_America/United_States/Georgia/Localities/A/Athens/Arts_and_Entertainment/Museums

I also see that there are 53 occurances of CatID records in structure.rdf.

This has really made a mess of my DMOZ-based directory, <URL Removed>. Have a look for yourself... nasty.

Can someone take a look at the files and verify my findings? I've downloaded the files each week for the last six weeks or so and they've all contained this type of corruption. What's going on?

Cheers
Jason.
 

jimnoble

DMOZ Meta
Joined
Mar 26, 2002
Messages
18,915
Location
Southern England
What's going on is a known bug which is being investigated by AOL's systems engineers. I can't give a time scale for its resolution I'm afraid :(.
 

sharonfranz

Member
Joined
Jul 22, 2009
Messages
10
jimnoble said:
What's going on is a known bug which is being investigated by AOL's systems engineers. I can't give a time scale for its resolution I'm afraid :(.

Do you know if this only affects the most recent dump? Does the previous month's dump have this bug too? Thanks.
 

sharonfranz

Member
Joined
Jul 22, 2009
Messages
10
Thanks! I'll try that one out. I'm only using the categories (structure.rdf).

Do you know any good resources on the best approach to work with the data on a database level? I'm using PHP/MySQL. I've already imported the data into the database. I took at peek at the tables. To display the sub-categories for each topic, I'm going by the fatherid. I'm new at this, so I'm just wondering if anyone has a better approach. Thanks.

photofox said:
I believe (and don't quote me on this) that the last available good RDF dump is http://rdf.dmoz.org/rdf/archive/2009-04-07/

The files since that dump are either not available in the archive or problems have been reported with them (i.e. the CatID issue).
 

JasonTimmins

Member
Joined
Feb 21, 2009
Messages
18
Hi There,

The problem seemed to start around April but the current file has this at line 6...

<!-- Generated at 2009-06-01 14:19:20 GMT on core-n01 -->

...which seems to indicate that they are not even bothering to update the RDFs at the moment.

As I write, there are still 59 entries for CatID=1 in this week's structure.rdf. <mutter>

Bye for now
Jason.
 

JasonTimmins

Member
Joined
Feb 21, 2009
Messages
18
I don't get what you're trying to say Jim. I assume that that line in the XML is the creation date of the file. If it is, clearly, it's not been created since the 1st of June.

It looks like the data has been broken since April. I'd like to think that an organisation the size of AOL is capable of 'actively pursuing' a 'known bug' in less than 4 month. Jeeez.
 

jimnoble

DMOZ Meta
Joined
Mar 26, 2002
Messages
18,915
Location
Southern England
I'm not an AOL employee but I recognise that, as with many other companies in these difficult times, resources are lower than one might wish and that their efforts have to be prioritised. As I said previously, AOL engineers are working on the issue but we can't predict when it will be resolved.

If you're dissatisfied with the situation, I suggest that you ask for a refund ;).
 

sharonfranz

Member
Joined
Jul 22, 2009
Messages
10
Hehe... I'm surprised AOL is even using resources on DMOZ since it obviously doesn't generate revenue for them.

I'm wondering if DMOZ should go the route of Wikipedia and ask for donations. This seems to work really well for non-profits. Figure out how many people it'll take to run this place, stick a goal (an amount), and ask for donations.

Take a look at refdesk.com. It's way smaller than DMOZ, but he's able to get enough contributions to keep his site running.

Just a suggestion.

jimnoble said:
I'm not an AOL employee but I recognise that, as with many other companies in these difficult times, resources are lower than one might wish and that their efforts have to be prioritised. As I said previously, AOL engineers are working on the issue but we can't predict when it will be resolved.

If you're dissatisfied with the situation, I suggest that you ask for a refund ;).
 

sharonfranz

Member
Joined
Jul 22, 2009
Messages
10
Just imported the data. I'm using Josser to import the data into my database. I'm still getting multiple entries for CatID, which is supposed to be unique, right? Is it just the Josser utility or another corrupt data dump? Anyone else used Josser? Thx.

photofox said:
I believe (and don't quote me on this) that the last available good RDF dump is http://rdf.dmoz.org/rdf/archive/2009-04-07/

The files since that dump are either not available in the archive or problems have been reported with them (i.e. the CatID issue).
 

Elper

Curlie Admin
RZ Admin
Joined
Sep 15, 2004
Messages
2,899
The main rdf issues (including the Duplicate CatID) are afaik fixed... There may still a glitch with a colon : too many, but the rdf (of 19 January 2010) should be usable.
 

scottie

New Member
Joined
Mar 16, 2011
Messages
2
Location
Portland, Oregon
I have python code (runs in 2.6.5) that crawls through the structure files and identifies a class of mistaken entries. Essentially, there should only be a few entries ('' and 'Top' in the default structure) where "/" + the <d:Title> entry doesn't end the <Topic>'s r:id field. I've gone back into the .rdf.u8 and looked at those 78 entries and they all seem to be mistakes. If anyone is interested, I'd be happy to give them the code to run the check on the generated objects. It takes minutes to run, so it is not _that_ horrible.
 
This site has been archived and is no longer accepting new content.
Top