Duplicate Sites in RDF dump

beebware

Member
Joined
Mar 25, 2002
Messages
1,070
According to the RDF dump, there are a number of occasions where the same site with the same title and same description appears in the same category twice. Conducting a search on the ODP at http://search.dmoz.org/ shows the "2 sites in one category" problem in the search results, but does not show the problem in the actual categories themselves.

Examples:
Category: http://dmoz.org/Arts/Music/Bands_and_Artists/L
URL: http://www.geocities.com/parttimelife/
Title/Description: Lazaras - Punk/metal band. Biography, news, schedule and mp3 downloads.
Search Result exhibiting the problem: http://search.dmoz.org/cgi-bin/search?search=lazaras

Category: Category: http://dmoz.org/Business/Business_Services/Translation/Multiple_Language/Europe/United_Kingdom/
URL: http://www.tiservicesuk.com
Title/Description: TI Services (UK) Limited - Translation, interpreting and language training by bilingual staff.
Search Result exhibiting the problem: http://search.dmoz.org/cgi-bin/search?search=interpreting+language+training+ti+services+limited

Hope the description and examples help illustrate what I mean...There may be more than the above listed, but once I found it it had occured mutliple (ie more than once), I instigated a work around for the problem. Maybe suggest to staff a "duplicate site/category" detector system (I actually found these by generating a 32hex hash using md5hash, tagging the category ID to the end of the hash and then just scanning the list for duplicates).
 

sfromis

Member
Joined
Mar 25, 2002
Messages
202
Thanks for reporting this. :) I've posted a copy of your findings internally.
 
This site has been archived and is no longer accepting new content.
Top