Dangling links to categories in ODP dump.

IZh · Jan 9, 2006

Hello!

When I importing ODP data I have found that there are many links (related, symbolic, etc) to categories which doesn't exist in structure.rdf.u8. Some links can be resolved by looking for corresponding categoty in redirect.rdf.u8 but others are completely broken.

I don't know how, but broken links are not displayed on the site (despite its presence in ODP data)

.

In any way, I think that having such a links in ODP data is no good, because it is not enough to use structure.rdf.u8 and content.rdf.u8 to import data. One should try to resolve links to categories through redirect.rdf.u8 and remove links if can't resolve. It is not very convenient.

If it is interesting to someone, here is lists of dangling links from recent dump:
Links resolved through redirect.rdf.u8: resolved.txt.u8.gz (43,840 bytes; 5359 category names)
Unresolved links: unresolved.txt.u8.gz (9,882 bytes; 616 category names)

I have not validated netscape-structure.rdf.u8 and netscape-content.rdf.u8 because they contains invalid UTF-8 characters (links in ISO-8859-1 encoding). So it is impossible to import data in UTF-8 parser mode. And one need to import it like it is in ISO-8859-1, and later try to guess for each link in what encoding it is. Also netscape-content.rdf.u8 contains many links like "javascript: window.sidebar.addpanel ('Title', 'http://url/', '')". Such links adds address to sidebar in Mozilla based browsers. But in MSIE they will not work. So it is not portable.

Also, I have used a hack to parse redirect.rdf.u8. The story is the same. Some category links are in UTF-8 and some in ISO-8859-1. So I have modified encoding field in the first line of the file and the xml-parser thinks that encoding is ISO-8859-1, so it will not abort parsing with errors about invalid UTF-8 characters. in practice, I don't need most of redirects (especially all in ISO-8859-1) but only 5359 to resolve links. But it is hard to filter unnecessary redirects. And I hope that someday we will not need redirects file at all to import ODP data.

Thanks.

P.S. First unresolved category link points to "Top/". It is not enough to remove trailing slash from it. Look at this:

Code:

<Topic r:id="Top/Reference/Education/Colleges_and_Universities/North_America/United_States/Indiana/Bethel_College/Athletics">
<catid>5824241</catid>
...
<symbolic r:resource="/NAIA/B:Top/"/>
</Topic>

<Alias r:id="/NAIA/B:Top/">
<d:Title>/NAIA/B</d:Title>
<Target r:resource="Top/"/>
</Alias>

I don't think that link named "/NAIA/B" should point to "Top".

arubin · Jan 9, 2006

I don't know where NAIA/B came from, but you've found a real error. Thank you.

bobrat · Jan 9, 2006

From what I recall, from doing some parsing on the RDF last year, the broken links problem has been there for some time. I also had to do a lot to work around it and prevent broken links displaying. At that time, since none of this is documented, I got the feeling that most parsing scripts out there probably do not handle the situation correctly.

IZh · Jan 10, 2006

Why not correct links (which possible) through redirects.rdf.u8 and move rest links to unexisting categories to separate place? So data will be consistent.

Also links to Top/Bookmarks, Top/Test, Top/UTF8 from the Top category are also dangling because there are no such topics in ODP data. And everyone must remove it while parsing ODP data. Perhaps these links should be removed from publicly available dumps.

arubin · Jan 10, 2006

IZh]Why not correct links (which possible) through [I]redirects.rdf.u8[/I said:
and move rest links to unexisting categories to separate place? So data will be consistent.

I thought "we" had a semiautomated process, which corrects the redirecting links in the next RDF. I could be wrong, of course.

Also links to Top/Bookmarks, Top/Test, Top/UTF8 from the Top category are also dangling because there are no such topics in ODP data. And everyone must remove it while parsing ODP data. Perhaps these links should be removed from publicly available dumps.

Isn't Bookmarks in the dump? I thought it was....

Links to UTF8 should be gone from the internal listings, as the UTF8 conversion has been completed. Test/World is accessible to public, and should be in the dump.

IZh · Jan 10, 2006

arubin]Isn said:
thought[/i'] it was....

No, Top/Bookmarks can be found in redirects only.

arubin]Test/World is accessible to public said:
be in the dump.

Should, but is not. In the structure.rdf.u8 downloaded today there are number of related references (such as <related r:resource="Top/Test/World/Karelian"/>) to some subcategories of Top/Test but there are no such categories itself in the dump.

motsa · Jan 10, 2006

As far as I know, Top/Bookmarks shouldn't be in the dumps. I wouldn't expect Top/World to be there either since they are categories that are WIP.

IZh · Jan 10, 2006

motsa said:
I wouldn't expect Top/World to be there either since they are categories that are WIP.

Do you mean Top/Test/World?
By the way, as I may see Top/Test/World is really accessible by public but Top/Test is not. And some links from Top/Test/World points to Top/Test/World_Test which also requires password. So dumping Top/Test/World would be difficult because of links to protected categories.

In any way, here is the snippet from Top category definition from recent ODP dump:

Code:

<Topic r:id="Top">
<catid>1</catid>
<d:Title>Top</d:Title>
<lastUpdate>2005-12-19 11:36:46</lastUpdate>
...
<narrow r:resource="Top/Test"/>
...
<narrow r:resource="Top/Bookmarks"/>
..
<narrow r:resource="Top/UTF8"/>
...
</Topic>

As you may see, there are links to Top/Test, Top/Bookmarks and Top/UTF8 subcategories, but none of these topics (or its subcategories) cannot be found in the dump.

sfromis · Jan 25, 2006

After each RDF dump, I'm running a program to update the ODP database for blind cross-category links, based on redirects. That fixes a few hundred (or more) links each time. Occasionally, as a result of reorgs, thousands of links have to be updated.

About the link with name /NAIA/B/, pointing to Top/ - thats a database error. I'll report that to staff, as it needs to be fixed. Thanks for reporting this issue.

FTR....

The Netscape dumps are not useful for general consumption. I'd recommend ignoring them.

Bookmarks is indeed not in the dump, as this branch is not part of the real directory. A very, very long time ago, there was a separate RDF dump for this. Any links pointing here are errors.

Test/World is public, as new editors are welcomed in these emerging language cateories. As they're not "ready for prime time", they're deliberately not in the RDF dumps. A few directory categories do link to the Test/World languages. Indeed, thats a dead link in the RDF dump, but we've chosen to accept it anyway.

The strange Top links to UTF8, Test and Bookmarks are indeed removal-worthy. I'll remind staff, but I'd recommend ignoring them.

More generally, some dead links appear because the dump is created while categories may be created or deleted. Again, the simplest would be for parsers to ignore unfindable categories.

I do agree that an internally consistent RDF dump would be a plus, but I cannot promise anything in this direction.

sfromis · Jan 25, 2006

I think I've spotted what triggered the wierd link, and after an update, the category Reference/Education/Colleges_and_Universities/North_America/United_States/Indiana/Bethel_College/Athletics now looks normal. Lets hope that its also normalized in the next RDF dump.

Dangling links to categories in ODP dump.

IZh

Member

arubin

Editall/Catmv

bobrat

Member

IZh

Member

arubin

Editall/Catmv

IZh

Member

motsa

IZh

Member

sfromis

Member

sfromis

Member