Hello!
When I importing ODP data I have found that there are many links (related, symbolic, etc) to categories which doesn't exist in structure.rdf.u8. Some links can be resolved by looking for corresponding categoty in redirect.rdf.u8 but others are completely broken.
I don't know how, but broken links are not displayed on the site (despite its presence in ODP data) .
In any way, I think that having such a links in ODP data is no good, because it is not enough to use structure.rdf.u8 and content.rdf.u8 to import data. One should try to resolve links to categories through redirect.rdf.u8 and remove links if can't resolve. It is not very convenient.
If it is interesting to someone, here is lists of dangling links from recent dump:
Links resolved through redirect.rdf.u8: resolved.txt.u8.gz (43,840 bytes; 5359 category names)
Unresolved links: unresolved.txt.u8.gz (9,882 bytes; 616 category names)
I have not validated netscape-structure.rdf.u8 and netscape-content.rdf.u8 because they contains invalid UTF-8 characters (links in ISO-8859-1 encoding). So it is impossible to import data in UTF-8 parser mode. And one need to import it like it is in ISO-8859-1, and later try to guess for each link in what encoding it is. Also netscape-content.rdf.u8 contains many links like "javascript: window.sidebar.addpanel ('Title', 'http://url/', '')". Such links adds address to sidebar in Mozilla based browsers. But in MSIE they will not work. So it is not portable.
Also, I have used a hack to parse redirect.rdf.u8. The story is the same. Some category links are in UTF-8 and some in ISO-8859-1. So I have modified encoding field in the first line of the file and the xml-parser thinks that encoding is ISO-8859-1, so it will not abort parsing with errors about invalid UTF-8 characters. in practice, I don't need most of redirects (especially all in ISO-8859-1) but only 5359 to resolve links. But it is hard to filter unnecessary redirects. And I hope that someday we will not need redirects file at all to import ODP data.
Thanks.
P.S. First unresolved category link points to "Top/". It is not enough to remove trailing slash from it. Look at this:
I don't think that link named "/NAIA/B" should point to "Top".
When I importing ODP data I have found that there are many links (related, symbolic, etc) to categories which doesn't exist in structure.rdf.u8. Some links can be resolved by looking for corresponding categoty in redirect.rdf.u8 but others are completely broken.
I don't know how, but broken links are not displayed on the site (despite its presence in ODP data) .
In any way, I think that having such a links in ODP data is no good, because it is not enough to use structure.rdf.u8 and content.rdf.u8 to import data. One should try to resolve links to categories through redirect.rdf.u8 and remove links if can't resolve. It is not very convenient.
If it is interesting to someone, here is lists of dangling links from recent dump:
Links resolved through redirect.rdf.u8: resolved.txt.u8.gz (43,840 bytes; 5359 category names)
Unresolved links: unresolved.txt.u8.gz (9,882 bytes; 616 category names)
I have not validated netscape-structure.rdf.u8 and netscape-content.rdf.u8 because they contains invalid UTF-8 characters (links in ISO-8859-1 encoding). So it is impossible to import data in UTF-8 parser mode. And one need to import it like it is in ISO-8859-1, and later try to guess for each link in what encoding it is. Also netscape-content.rdf.u8 contains many links like "javascript: window.sidebar.addpanel ('Title', 'http://url/', '')". Such links adds address to sidebar in Mozilla based browsers. But in MSIE they will not work. So it is not portable.
Also, I have used a hack to parse redirect.rdf.u8. The story is the same. Some category links are in UTF-8 and some in ISO-8859-1. So I have modified encoding field in the first line of the file and the xml-parser thinks that encoding is ISO-8859-1, so it will not abort parsing with errors about invalid UTF-8 characters. in practice, I don't need most of redirects (especially all in ISO-8859-1) but only 5359 to resolve links. But it is hard to filter unnecessary redirects. And I hope that someday we will not need redirects file at all to import ODP data.
Thanks.
P.S. First unresolved category link points to "Top/". It is not enough to remove trailing slash from it. Look at this:
Code:
<Topic r:id="Top/Reference/Education/Colleges_and_Universities/North_America/United_States/Indiana/Bethel_College/Athletics">
<catid>5824241</catid>
...
<symbolic r:resource="/NAIA/B:Top/"/>
</Topic>
<Alias r:id="/NAIA/B:Top/">
<d:Title>/NAIA/B</d:Title>
<Target r:resource="Top/"/>
</Alias>