Standardising categories

While trying to extract all of the UK related information from the RDF dumps, I have found that the categories used vary widely in their naming schemes.

At first, I assumed that everything from the UK would be included under the Regional/United_Kingdom category. How wrong I was.

I then started to find other United_Kingdom categories spread around the directory. Not a problem.

However, having delved more deeply, each of the following is used for a UK related category, outside of the UK regional category at some stage, without any reference to the 'UK':

Scotland
Scottish
English
England
Wales
Welsh
Cymraeg
Highland
Celtic
Northern_Ireland

While this isn't a huge problem, it does mean that there is no standard way for people to find 'uk' related information. It may, at some stage in the future, be worth looking at standardising the naming conventions, perhaps by enforcing a 'United_Kingdom' category above any national subcats.

For example, instead of having 'scottish' and 'english' categories under soccer, these might be better placed under an umbrella 'united kingdom' cat, with 'scottish' and 'english' forming subcategories.
 

lissa

Member
Joined
Mar 25, 2002
Messages
918
There are four aspects to the problem you have raised:

1. Specifically for the UK, there is an option box to be checked if the site is related to the UK. The intention of this is to allow extracting those sites, no matter where they are in the directory structure. I do not know how this piece of data appears in the RDF dump. I'm also pretty sure that not every nook and cranny in the directory has been gone through to check the box on appropriate sites since the feature was added. But this is one method that may be expanded in the future to address your concern.

2. One method to try to maintain consistency in naming between the topical and regional portions of the directory is the use of a preferred terms list. This doesn't cover every possible term, but it might be possible to add to it, depending upon how widespread a particular term is.

3. The only specific rule about regional names in the topical portion of the directory is that if the category in question is broken down by physical location, then it must follow the regional structure and naming convention, either listing the countries directly as a subcat or by creating regions as subcat with the countries in them. Any categories in your examples that don't follow this should be renamed.

4. Related category links would be appropriate in some of your examples between the regional and topical portion of the directory.

The challenge (and fun!) of being an editor is finding this type of stuff and straightening it out. Discussion of finding things like this and coordinating on fixing them are the bulk of the conversations in the internal forums. The truly hard part is that it is impossible for any single person to have a full grasp of everything in the directory, especially because it changes every day, so there is always more to clean up!

In answer to your particular concern, if a higher level UK editor doesn't happen to comment in this forum, please consider emailing them directly and perhaps they will be able to improve some of the naming, UK site tags, and related links.
 

To briefly add a few thoughts to what lissa said,

It may help to read the introduction of http://dmoz.org/regionalguidelines.html which tries to explain what the Regional directory tree is about.

Note that the United Kingdom part of the directory normally uses British English, rather than American English, which may account for some naming differences. Aside from that, whilst we try to keep to standard naming conventions inevitably oddities arise.

Ethnicity and locality are two different things. For example, many sites in http://dmoz.org/Society/Ethnicity/Welsh/ relate to communities outside of the United Kingdom.

Some of the topical examples you mention are specifically related to the United Kingdom as a geographical region. But they may be of interest to people outside the United Kingdom as well. For example, http://dmoz.org/Society/History/By_Time_Period/Seventeenth_Century/Wars_and_Conflicts/English_Civil_War/ could probably be in both Society and United Kingdom. It's in Society but look in http://dmoz.org/Regional/Europe/United_Kingdom/England/Society_and_Culture/History/ and you will see Civil_War@ , so hopefully it can still be found easily from a United Kingom perspective.
 

gti96

Member
Joined
Feb 28, 2002
Messages
180
<For example, instead of having 'scottish' and 'english' categories under soccer, these might be better placed under an umbrella 'united kingdom' cat, with 'scottish' and 'english' forming subcategories.>

Exactly what categories are you referencing? I couldn't find any Scottish subcats under Soccer. Or are you referring to this cat?
http://dmoz.org/Sports/Soccer/UEFA/
 

Yes, that category might be better off containing 'United Kingdom' with Scotland, England, Wales and Northern Ireland as subcategories. This is just a suggestion for discussion - I am not saying that I know better!

Part of the problem is finding UK sites when the naming conventions aren't consistent. Someone else said that the categories are supposed to be flagged as 'UK'.

However, taking this football example, searches for categories don't turn up much:

http://search.dmoz.org/cgi-bin/search?search=uk+soccer
http://search.dmoz.org/cgi-bin/search?search=uk%20football&utf8=1&locale=en_us&morecat=1

The second search does list England, but none of the others. I would guess that it they were under a 'United Kingdom' umbrella they would have come up under the same searches.
 

timksi,
I think that again, the problem is that the categories aren't flagged as 'uk', so that there are a number of searches that they don't appear for:

http://search.dmoz.org/cgi-bin/search?search=uk+civil+war
http://search.dmoz.org/cgi-bin/search?search=british+civil+war

I think that if all the UK related (including the 'ethnicity' categories) were flagged to appear for UK related searches, the problem would be solved. It is not a big problem by any means, but might be worth looking at to improve the directory in the future.

All of this came about because I was just trying to extract the UK related information, which is very difficult to fully automate, due to the massive spread of categories and different naming conventions.
 

gti96

Member
Joined
Feb 28, 2002
Messages
180
Ahhh.

http://dmoz.org/Sports/Soccer/UEFA/ is a Sports category. The Sports branch of the directory is considered a topical branch and not a regional branch. That means that the ontology of the categories within it should reflect the Sport, ie Soccer, and not (in this example) the UK.

To get into it a bit more, there is no UK soccer. There is Soccer in England and Scotland and all the other countries listed under UEFA, but there is no UK soccer and there is no governing body of UK soccer. That is why there is no UK subcat with the countries below it.

However, in http://dmoz.org/Regional/Europe/United_Kingdom/Recreation_and_Sports/Sports/Football/
which is the Regional branch, I think you will see the breakdown you proposed.
 

I think the "flag" that lissa refers to is a checkbox on the submission form. This is a test for a proposed geographic coding system, and UK is the only region represented. I think they're trying to work out a way for users of the data to pull out specific regions. But that system is a long way from completion.

As it is now, the only way to identify a site by region is by its category, title, or description.
 

In a similar vein to gti96's point, the English Civil War is to the best of my knowledge not commonly referred to as the British Civil War, neither as the UK Civil War.

I think it is important to consider that ODP is primarily a directory, not a search engine. The search tool can be convenient, but is not an absolute guide. For example, a search on http://search.dmoz.org/cgi-bin/search?search=whatever&all=no&cs=&cat=Regional%2FEurope%2FUnited_Kingdom only shows up results lying beneath Regional/Europe/United_Kingdom, not everything that might be United Kingdom related. If you want to find as much United Kingdom related content as possible, you need to use the directory as a directory.

If you have the technical knowledge, you might consider examining all of the related links and @links under Regional/Europe/United_Kingdom and bringing those categories and their sub-categories into your dataset. You'll miss some potentially related topics, but I think you'll capture most.
 

stevesliva

Member
Joined
Mar 28, 2002
Messages
80
In regards to Cymraeg ... World/Cymraeg/ is a category for sites in that language. All non-English language sites are listed in World/[Language].
 
This site has been archived and is no longer accepting new content.
Top