Some symbolic links don't make sense!

xuqy

Member
Joined
May 3, 2005
Messages
4
I know symbolic link in DMOZ category hierarchy is an artifact to accomadate the multi-labled requirements inherent in hierarchical classification scheme, but when I take it for granted that a symbolic linked category is a subcategory of the category which owns the symbolic link, I got some really puzzlement. For example, below is a category path( >: narrow; =>[...]: symbolic link):

Top> Computers> Artificial_Intelligence> People> Papert, Seymour@ =>[Top> Computers> History> Pioneers>Papert, Seymour] > Logo@ => [Top> Computers> Programming> Languages> Lisp> Logo]

All this seems good, but then comes the puzzle, the "Logo" categary has a symbolic link: "Papert, Seymour@", which is linked to: Top> Computers> History> Pioneers> Papert, Seymour

Then "Top> Computers> History> Pioneers> Papert, Seymour" and "Top> Computers> Programming> Languages> Lisp> Logo" contains each other!

My first reaction is that I spotted an exception, then I wrote a simple graph traversing algorithm to detect path cycles and applied it to my DMOZ category structure database. I have the final conclusion that DMOZ category structure is by no means a DAG graph, that is, it is not uncommon to come up with path cycles like the above.

I am a researcher in automatic text categorization area and DMOZ is a must for me, but when I follow the symbolic links as well as ordinary subcategories, I often reach unexpected unrelevant categories in the end. Can you imagine when I am collecting web pages from DMOZ about the topic "Artificial Intelligence", I reach a category labeled "Microsoft Office"? It seems that some symbolic links are not edited well enough or discrimented from "see also" link well engouth. For the time being, I refrained myself from following symbolic links. That is a pity because many symbolic links refer to truely relevant content.
 

windharp

Meta/kMeta
Curlie Meta
Joined
Apr 30, 2002
Messages
9,204
it is not uncommon to come up with path cycles like the above.
Yes, because in some areas we have a rule to allow those exceptions. In general when it is an author and his works, or an actor and his appearances we use bidirectional @links instead of "related categories".

The basic idea in the ODP is, that an @link should be used, when the target-category could be a subcategory of the current one and the backlink should be done as a RelCat. In some cases - like the above - there is no clear Parent/Child connection, so we have to use either @Link/@Link or Relcat/Relcat. As you have seen, the decision was made to use @links.

Another remark: The ODP was not made to be understandable by machines - we target a human audience. And most of those should be able to handle the problem. ;-)

[EDIT]

Are you are trying to do automated classifications using the ODP structure? You might want to consider limiting the tool on "1 @link per Path" instead of ignoring them, it should give you better fits without creating endless loops.

It seems that some symbolic links are not edited well enough or discrimented from "see also" link well enough.
In the end it boils down to "The ODP is a human edited directory". We are trying our best to not do mistakes, but they slip in. There is a project dealing with "clearly wrong" @links, but due to many other projects, only a small amount of work is done here ATM.
 

xuqy

Member
Joined
May 3, 2005
Messages
4
Thank you so much

Whether to follow symbolic links in text categorization has puzzled me for quite some time. I tried to restrict symbolic links per path within some specified threshhold and even to validate symbolic link manually which is boring and expensive. In web information retrival literature, quite some work has been based on DMOZ datasets, but most of them, if not none, neglected symbolic links totally. In some text categorization senarios this leads to hardship. Now I appreciated your hardship as well.
I have no idea what is Yahoo's approach to this problem, maybe it gives a stricter rule?
 
This site has been archived and is no longer accepting new content.
Top