Reorganization and why dmoz human editing fails

JOE3656 · Jan 11, 2004

Human-only editing has reached it's limit.

Many submitters vent frustration at the dmoz, not realizing editing continues to follow the same algorithmic laws that limit all computer programs based on trees. Human editing of the web is a losing proposition, even if based on the following highly dmoz favorable assumptions. Organization or reorganization of just 500,000 sites takes at least millions of node visits (following an nlog(n) vistation boundary (not exactly right (it's similar to a b*-tree) but it's close enough). A subcat with 4000 sites requires a least 12000 visits to remain somewhat organized (<1 year out of date).

Assume that 5% of the websites become dead links every year. Assume also that 4 minutes is taken per site per year to manage or add each site. 500,000 sites = 2 million minutes and as the average depth of the tree reachs a fourth or fifth level and the tree widens, this 3-4 minute time gets longer. (25000 hours per year/ minimum.) Reach a million sites and the demand for time more than doubles. Reach the 4 million sites currently listed and you get a demand time of 266,000 hours required for even yearly edits just to eliminate and replace 200,000 dead sites. Growth is not even mentioned.

Will 65000 editors put the 4 hours per year just to keep to the current level? Not likely. Is their choice always the best and most interesting sites? No, it's not in the design of dmoz.

A category like Recreational Boating (dmoz contains only 4000 sites) can barely keep up with the 10's of thousands of useful sites, (yes, I am aware of the submission process). (Example: There are 1000+ yacht clubs and 1000 marinas in the US, almost all of whom have web sites of interest, so the category should have far more than the 4000 sites listed.) A promised reorg had to fail to be completed by the boating site editor at his stated schedule, because he would have needed to devote more than 6 weeks of dedicated effort to reorg the sites by region.

Editors have families, and commitments.

The dmoz approach of human-only editing is algorithmically naive, and needs to be reexamined. The dmoz must use google-like bayesian categorization tools, as a primary selector, with suggestions and exception processing being provided by human intervention. Any other approach is a sure loser, a pyramid scheme. By the way, it's not a lack of understanding about the dmoz that prompts me to write this, so re-explaining the dmoz won't change my mind.

So what is dmoz really doing? Not really much, it's not steering users to pay sites like google does, "sponsored", but it's not keeping up either.

Dmoz approaches must be realistically based on algorithmic laws and that means the use of automated methods, (assuming it wishes to continue with any level of success). Editors must have tools to search the internic databases and to select useful sites.

By success, I mean being the first choice of end users for finding the open and free information they seek without hidden steering to "sponsors".

Thanks

motsa · Jan 11, 2004

Re: Reorganization and why dmoz human editing fail

>> A promised reorg had to fail to be completed by the boating site editor at his stated schedule, because he would have needed to devote more than 6 weeks of dedicated effort to reorg the sites by region. Editors have families, and commitments.

It hasn't failed, it has simply failed to meet your deadline expectations. The process is still ongoing and will complete eventually.

>> The dmoz approach of human-only editing is algorithmically naive, and needs to be reexamined. The dmoz must use google-like bayesian categorization tools, as a primary selector, with suggestions and exception processing being provided by human intervention. Any other approach is a sure loser, a pyramid scheme.

We mustn't do anything. And we're not Google or any other search engine for that matter.

>> By the way, it's not a lack of understanding about the dmoz that prompts me to write this, so re-explaining the dmoz won't change my mind.

It's not a lack of understanding of how the ODP works but a severe lack of understanding of how the ODP community works. And. no offence, we're not too bothered if you go through life with that lack of understanding and never change your mind.

>> So what is dmoz really doing? Not really much, it's not steering users to pay sites like google does, "sponsored", but it's not keeping up either.

Depends on your definition of keeping up. We add thousands of net new sites every day. If you don't feel that the ODP is doing much, why do you feel it is important enough that you come here regularly to gripe? Why not just ignore us?

Sunanda · Jan 11, 2004

Joe, you are writing to the wrong people.

We're humans. We like editing. So far (note the qualification) we've been doing it better than other approaches.

If the time has come for other approaches, then maybe it is time for other people to take up the challenge of creating the most comprehensive algorithmically-derived directory of the Web.

Why not get together with a small band of fellow-minded volunteers and see what you can come up with?

If it's demonstrably better than the OPD, then all the downstream users (from Google onwards) will swap over to your system in a trice.

But simply claiming bayesian is better is simply a claim. I'd like to see a demonstration, please.

spectregunner · Jan 11, 2004

Many submitters vent frustration at the dmoz....

The above mistakenly suggests that ODP exists to please submitters. Nothing could be further from the truth.

Speaking for myself, I'm very busy helping to build a directory that is useful to the people who will use it. The wants, needs, desires and emotions of the submitters is really of little interest to me. If ODP ever completely shut off submissions, I would personally cheer because my efficiency in building a directory would improve, since there would be no spam, no mirrors, no attempts to defraud, no webmasters submitting 200 diffferent websites in 200 different categories, that I would have to deal with.

Of course, shutting off submissions is not even being actively considered, but you need to understand that you, personally, are part of the problem, not part of the solution. You appear like clockwork to complain that your site is not listed is a category that is undergoing reorganization. You toss about useless statistics that indicate your cluelessness as to how editing occurs, how long it takes, what is involved in doing a reorg and what motivates editors.

Rather than exercising us, you might be better served by finding ways to get your sites listed in all of those other search engines and directories that you feel are better managed than ODP, and just let us do our thing, quietly adding thousands of sites every single day -- for the love of doing so.

pvgool · Jan 11, 2004

By success, I mean being the first choice of end users for finding the open and free information they seek without hidden steering to "sponsors".

And another big misunderstanding.
DMOZ is not build to be a search engine by itself.
DMOZ is offering it's contents for free to be used by others. The end users should search through the sites who are using our content.

xixtas01 · Jan 11, 2004

We do not edit with paper and pencil. We do have tools and sites are automatically checked (yes, gasp!) by computer for quality and maintenance issues.

My recent experience in Houston leads me to believe that one person can check, organize and maintain 4000 sites with considerably less than 6 person-weeks of effort, so however those numbers are derived, our real-world experience tells us that they are incorrect.

I don't see you as part of the problem, myself. The solution you have suggested above seems off-target to me, perhaps I don't understand it.

The dmoz must use google-like bayesian categorization tools, as a primary selector, with suggestions and exception processing being provided by human intervention. Any other approach is a sure loser, a pyramid scheme.

Editors are capable of (and do) use Google and other "bayesian" algorhythmic routines to check sites. Surely you're not suggesting that we write software that would do what Google does except being much less prone to being fooled by spam and mirrors and affiliate content so that the results could be automatically included in the directory. I have the utmost respect for our programming staff, but forgive me for suggesting this is beyond scope.

Having said that, I'd like to echo the sentiments of the other editors and invite you to try to create a directory this way. Perhaps I simply lack imagination.

FWIW, bayesian

JOE3656 · Jan 11, 2004

Spectre: > The end users as stated in the original posting are those searching for information. They should not steered to sites that paid more for their submission to show up higher on the list, either.

(No other large fully open searching option exists, other than dmoz. For a frightening look at how steering happens, goto searchengines.com and read about the recent mergers of search engine companies.)

Nor am I clueless, nor simply concerned with "listing my site", it's just that the current approach must fail over time. I attempted to become part of the solution, but the at the time the option was (and remains) closed in "the area of my interest, the thing we are best at". Becoming an editor is not a solution to the problem, because I still end up with only fixing the problem in a suboptimal manner, (in a small area), and the paralellization of multiple human editors doesn't really help. Dmoz is tree based and that means that it's efficiency is bound to searching and maintaining trees, . Fixing the dmoz approach is the best solution.

Ok, posting a useful solution that the dmoz editors would use (that is an classifier that accepts a keywords list, but is tuned to knowledge of specific dmoz problems.)

Since at least one of you asked nicely for a solution, can you set some specific requirements for a solution, please recommend them here to add or delete as necessary.

Possible Requirement 1 - Quick classification for any tree or subtree
- resolving regionalization - the classifier must have regionalization and linguistic information. (Should the bot be able to move a site into a regional selection, without human (some, none or all) intervention.)
Possible requirement #2, A bot must maintain the state of edits of a tree or subtree. Once human edited, the bot must be able to skip over these sites for recategorization (unless the site goes dead).
- ... Other user requirements... ? (Read internic files, check aliveness, etc.)

Remember, what this approach entails is a bot-human approach, taking what works best in spider search engines and what works best using humans approaches (as the limit decision makers who set rules). A human may not always directly edit or add sites, He/she uses the bot to a manage an area of the dmoz.

So are many dmoz editors in general willing to accept a bot assistant, that works like the current engines? It is a change in dmoz philosophy.

JOE

thehelper · Jan 11, 2004

I would support a tool that pulled all the links from a site and gave me the option of putting them in my bookmarks or something (probably already exists) but in no way shape, form, or fashion do I want to automatically ADD anything that I do not review first.

pvgool · Jan 11, 2004

So are many dmoz editors in general willing to accept a bot assistant,

I think we are. I only have to look at all the bots we already have. Most stuff that can be put in a program to help us (like as you mentioned: Read internic files, check aliveness) is already available. Some will never be build as it is almost impossible to put human irrational thinking into program logic.

donaldb · Jan 12, 2004

Why are you under the impression that we have no automated tasks? Yes, part of the Boating reorganization has to be done manually, but not all of it is. The reason for the manual task is that the sites that are moving to Regional areas need to be examined one by one. How would a bot help there? I can't imagine any bot that is going to be smart enough to search through a web site and figure out what town the marina is located in. Not all 4000 sites are being moved to Regional. Not all 4000 sites are even being moved. Some parts of the Boating structure will not change one bit. Some will drastically change. Some categories will be moved automatically with the tools that we use to move categories. Some have already been moved. What's the rush here? No one is going to die if it takes a while for us to move some boating web sites around in our directory. When we're finished, we'll be finished and we will have accomplished our goal - no failure

chris22rv · Jan 12, 2004

Re: Reorganization and why dmoz human editing fail

JOE3656

Dmoz is tree based and that means that it's efficiency is bound to searching and maintaining trees,

This is a very good point, while DMOZ does attempt to get around the parent/child structure by allowing the @ links, it's not enough to do it category by category, sites themselves should belong to multiple categories.

My own experience with classified newspapers and editing DMOZ has shown me that as an editor you have to impose what is in your opinion the most significant attribute of a site, i.e. if it's a pet clothing site, does it come under pets or clothing? The same quandary came up in classified newspapers where you'd have cars for sale, and the classification system of the paper was Cars/Fords/Escorts/ then you'd have your actual car ad, which might say it was an estate (stationwagon) and green, and cost £5000. Now, if I wanted a Ford Escort, then it'd be fairly easy to find this car, but if I just wanted a green estate car for £5,000 or less, I'd have to read nearly every ad. Once these ads were transferred to the web, it became much easier to find just those ads you wanted. It basically enabled the switch from 1 dimensional to as many dimensions as you require.

The biggest problem I see with this approach at DMOZ are the endless arguments about which is the single best category to list a site in, and the problems with the locations. As far as the user is concerned, I don't think they are really concerned if a site is listed once, twice, or 200 times, as long as they find it in the category they expect.

There are advantages to a tree structure, and particularly being ruthless about policing the content of categories, but at DMOZ there are arbitrary rules about what goes into a category, and these aren't always what you'd expect as a user.

For instance my own site which compares amongst other things the rates and extras provided by all the major UK credit cards, and returns the latest rates and conditions for all of them on a single page would be a very useful addition to the category: http://dmoz.org/Regional/Europe/United Kingdom/Business and_Economy/Financial Services/Personal Finance/Credit Cards/

But because someone decided 'these are the rules' at some point in the past, it wasn't allowed to go in there, no, instead it's at:

http://dmoz.org/Regional/Europe/United Kingdom/Recreation and_Sports/Home and_Garden/Consumer Information/Price Comparisons/

This makes no sense to someone using the tree structure at all. Would you look under 'recreation and sport/home and garden' to find a credit card comparison tool? Now I know from previous discussions that a DMOZ editor reading that says "AH! I see, you're a disgruntled submitter, so I can now dismiss everything else you say". However, I'm also a DMOZ editor and would also be a DMOZ user much much more if ever I could find the sites I want. I recently wanted to find a computer shop near me which sold TFT monitors, just try to find that at DMOZ yourself.

I think DMOZ should allow sites to be in as many categories as are relevant, with a long term aim of having the site very loosely connected to value-based categories. Thus my site could be in the 'Price Comparisons' cat, and the 'Credit cards' cat, and the 'financial' cat, and the 'money' cat, and the 'personal finances' cat.

Tree structures are inherently bad, they require constant tweaking, often involving very labour-intensive reorganisations, they don't allow much flexibility, and they are hugely subjective, relying on someone to decide in advance which is the best structure, and then forcing the available data into that structure.

motsa · Jan 12, 2004

Re: Reorganization and why dmoz human editing fail

>> I think DMOZ should allow sites to be in as many categories as are relevant, with a long term aim of having the site very loosely connected to value-based categories. Thus my site could be in the 'Price Comparisons' cat, and the 'Credit cards' cat, and the 'financial' cat, and the 'money' cat, and the 'personal finances' cat.

Then I think you're looking for Yahoo -- they like to do that. A lot. To us, what you're describing is known as spam and as an editor, I would have hoped you'd have some understanding of that concept by now.

lissa · Jan 12, 2004

You have some interesting ideas JOE3656, but I think you'd getter a better reception if you hadn't started with claiming "dmoz human editing fails" when it has in fact built the best directory currently on the internet.

I agree that the pace of growth of the ODP is not keeping pace with the internet growth, even considering only the sites we would want to list, and not all the affiliate, spam, and junk sites. However, the ODP is still steadily growing and improving the quality of existing listings.

One thing that I think ODP is currently suffering from is the rapid early growth and individual methods for categorization that flourished all over. Some areas got bogged down with the initial method of categorization - and there are small, medium, and large reorganizations occurring all over to fix those problems. Reorganization takes time, and the majority of the time the hardest and longest part is deciding what the structure should be. Yes, sorting the sites is time-consuming, but that's the easiest part.

For example, there was "the great Regional reorg" about 3-4 yrs ago. The actual category moves occurred over a short period of time, but all the fine tuning with the new template took more than a year. However the new structure is much more maintainable, accomodates a lot of growth, and is amenable to tune-ups in smaller areas. A year and a half ago, the entire Business branch underwent a full top-level down reorg. After the major reorg, each main industry is in the process of a more detailed internal reorg. About 1/3 are done, 1/3 are in process, and 1/3 have yet to be started. Reorgs take a lot of time, and I don't really see that there is much opportunity for tool development to really help speed up the process.

What is really hard to understand is that areas that get reorged end up being so much more efficient to maintain than they were previously. The perceived "time lost" during a reorg is quickly made up, and then surpassed afterwards.

as the average depth of the tree reachs a fourth or fifth level and the tree widens, this 3-4 minute time gets longer.

You seem to have made the incorrect assumption that as the tree gets deeper, it takes longer to maintain the sites. The time cost is the same, but as you get higher in the tree, there are fewer specific editors able to access the sites (although these tend to be the editors who spend the most time on ODP.)

that 4 minutes is taken per site per year to manage or add each site.

At best, adding a site known to be good takes ~5 minutes. (One time cost.) On average, across the directory, considering time spent verifying that a site is good or finding them in the first place, ~15 minutes would be more representative.

Maintaining the listed sites is much less. We have tools to check the status of sites and flag ones needing attention. ~5% of the sites per year needing checked is the right order of magnitude, and ~5 minutes each for investigation and fixing/deleting is about right.

What is really hard to estimate is the time that editors spend learning, reading, discussing ideas, planning reorgs, managing unreviewed, deleting spam, improving directory quality, joining new editors, investigating abuse, mentoring, etc. etc.

I like to think about it in terms of editor-hours per year. There are roughly 10,000 active editors (strangely, this number has stayed between 8,000-12,000 for the last 3 years.) Here's my guesstimate at time put in (it may be wildly wrong, so take it with a grain of salt.)

100 editors x 40 hr/wk x 50 wk = 200,000 hr/yr
400 x 20 x 50 = 400,000
500 x 10 x 50 = 250,000
4000 x 1 x 50 = 200,000
5000 x 5hr/yr = 25,000
=1,075,000 editor hrs/yr

I would guess we are growing by 500,000 sites a year, which means probably this amount has also been edited, moved or deleted. So call it 1,000,000 sites edited per year. That works out to roughly 1 editor-hr per site per year, which includes all the editor activity, not just direct editing of listed sites.

So what does all that mean? Besides that we do a lot of work, I don't know.

But it gives a more realistic picture than your numbers.

We aren't yet at the point where maintaining the directory uses all our editor-power, and I forsee for the next year or two being able to maintain the pace of growth. If we want to increase the pace of growth, we need more efficient internal processes and more editors. Both of these things we constantly work on.

Dmoz approaches must be realistically based on algorithmic laws and that means the use of automated methods

We are always in search of new tools and methods to improve the way we do things. Some are things editors themselves can create, some would require extensive changes to the programming behind the ODP (for example, it's not actually a database, the data is stored in a series of flat files), and some ideas are so different that we aren't talking about a human edited directory anymore.

I'm not really sure where your ideas fit on that continuum, but hopefully my random discussion will help you formulate some more concrete ideas that we can use.

chris22rv · Jan 13, 2004

Re: Reorganization and why dmoz human editing fail

Motsa

To us, what you're describing is known as spam and as an editor, I would have hoped you'd have some understanding of that concept by now

Slapping the 'Spam' label on it is a little over-simplistic. "We don't have to talk about it any more - it's Spam'. Well, I do want to talk about it.

Appearing more than once in relevant categories is not a spam problem unless you're performing a search on the database - like Google - and get 20 results the same. DMOZ has chosen not to be built for keyword searchable (at least I hope it's a decision, because keyword searching at DMOZ is awful, half the time the name of the site doesn't turn it up). If, as a user, you're browsing one category, then seeing sites that are relevant is not a problem.

Too many DMOZ editors seem to think their job is to keep sites out. Why shouldn't a site of a shop selling dogs and cats be in both the dogs and cats categories, instead of us deciding that they HAVE to go into the /Pets/ parent category. As a user I'm only going to look in the /pets/cats/ cat, and perhaps ignore the /Pets/ cat as it's too broad.

It's almost an obsession with DMOZ editors to find the 'right' cat for a site, with usually very little regard for the user who doesn't have access to the behind the scenes ontology debates, and doesn't necessarily want to read the category guidelines of every category.

The tree structure itself is fundamentally a problem, there's only some point to it if you believe that people use it to drill down. My last company in Holland was in a government database listing companies by type. To find our company you had to first go to one of the 10 top level company types. For us, who were internet database system architects - you had to start by looking under 'Transport'. Because we were under: Transport/Communication/Internet/Developers/

How likely is it you'd start there?

Even if they come up with a 'better' structure there would be problems. The tree structure is based on the principle that there is one single most important classification to be put on a site, (and that we as editors know what it is).

The fact that each user might have a different idea about that we treat as irrelevant.

You've seen my car classified ad example. It would be very unusual that a site fits into only category. If it does it's often because we've over classified to the point where there is just one site in a cat. I've come from a classified ad background, where we spent hours debating categorisation, and it takes a while to do the Zen thing and just let it go. On the web, you soon realise, you have the freedom not to be bound by classification - it's just another property of your database item. Just because 90% of people think that one particular property is THE important one, that's no reason to make life difficult for the other 10%. The price of a car is as important to some people as the make, or the mileage, or the colour.

As a user of DMOZ I might welcome a classification of shops which sell organic food, another user might want to see hypermarkets, another shops which are open 24 hours in the London area. Why shouldn't Sainsbury's be in all 3?

There is a great deal of editor lore telling us only to place a site in multiple cats in exceptional circumstances. I suspect that this is partly due to the structure of the ODP database. It's not actually a relational database, and duplicates of sites really are duplicates, making the database unnecessarily large.

Before it was bought up by a big publishing company and they made it too commercial, there was a really excellent site offering similar editor checked content in the Netherlands called startpagina - www.startpagina.nl. There were unlimited subdomains, and the equivalent of DMOZ editors in charge of each of them, who had almost complete control of pages, adding whatever links they wanted, duplicates or not. So the editor of 'canyoning.pagina.nl' could put a site on that offered canyoning fun, and the same site might also be in 'canary-islands.pagina.nl' and in 'adventure-sports.pagina.nl'

Who cares that the site appears several different times? No-one. It was at one point the most popular site in Holland.

flicker · Jan 13, 2004

Re: Reorganization and why dmoz human editing fail

I too have sometimes wished it were possible to somehow @link individual sites (as opposed to only categories) myself. That way *every* URL we listed could have *only* one actual link to it (with the rest of the @links leading to # anchors on the page it's truly listed on, or something). However, every way I can think of to implement this vague concept completely sucks. Perhaps that's due to my personal tech limitations rather than an inherent flaw with the idea. You shouls bring it up on the internal Features forum. I'm sure someone there can give you a real idea of the feasibility of what you're asking about.

In the meantime, for the kind of search you're talking about, I recommend searching within the parent category in the Google directory. The ODP may not have the resources for a top-notch search function, but luckily somebody who does will nicely restrict your search to our data for you. ;-) Sure, their copy of our directory is a little bit out of date, but this tactic still often returns a great selection of DMOZ-included sites from sister, cousin, and auntie categories all of which contain the term I want.

xixtas01 · Jan 13, 2004

Re: Reorganization and why dmoz human editing fail

I can conceieve of a directory where all resources were only listed once, but were navigable in many different ways. Each listing would carry a set of tags that marked it in various ways. You could have a language tag that indicated languages available. A physical presence tag, that indicated where the enitity was physically located, a delivery area tag that indicated where the service was designed to be consumed. A meta-topic tag, and several subtopic tags. etc... Then your navigation could be generated on the fly and every listing would de drillable through many different paths.

However, this isn't a description of the ODP. The ODP works differently. The "one best" location seems like a good guideline, given the structure we work with. Anything else seems overly subjective and arbitrary.

donaldb · Jan 13, 2004

Re: Reorganization and why dmoz human editing fail

It's almost an obsession with DMOZ editors to find the 'right' cat for a site,...

There is a great deal of editor lore telling us only to place a site in multiple cats in exceptional circumstances.

Probably because it is in the guidelines. That's how we do things. It's not editor lore of any kind

chris22rv · Jan 13, 2004

Re: Reorganization and why dmoz human editing fail

Anything else seems overly subjective and arbitrary.

On the contrary, it's too subjective and arbitrary now - That's my entire point!

If we had multiple dimensions of categorisation - the above mentioned properties are a great start - then no-one would have to decide which of thos dimensions was THE most important one. THAT decision is both subjective and arbitrary.

You're right in saying that this isn't the ODP, but then isn't this thread about how the ODP needs to change? The system envisaged above is not so fanciful, it's actually the norm in most relational databases, however, the ODP is only a database in the same way that the bottom drawer of my computer desk is a 'document retrieval system' - there are documents in it (along with old CDROMS, crisp wrappers and a packet of chewing gum) and I could get them out again.

chris22rv · Jan 13, 2004

Re: Reorganization and why dmoz human editing fail

Probably because it is in the guidelines. That's how we do things.

What, and we can't question the guidelines? Is this is a religion or something?

Do you know of the experiment with the monkeys and the banana and the hosepipe? You get 5 monkeys in a cage, hang up a banana, when a monkey gets the banana, you spray all the monkeys with water. You take out a monkey and replace it with another one. You hang up a banana, the new monkey eventually tries to get the banana - the other 4 monkeys beat it up to prevent getting sprayed. You replace each of the monkeys in turn, each new monkey tries to get the banana and gets beaten up. Eventually you have 5 entirely new monkeys none of whom ever got soaked, who'll all beat up any new monkey who tries to get the banana - "because that's the way we do it around here".

Perhaps the guidelines are limited by the system we have in place, and if we replaced that, we'd have more freedom.

Jan 13, 2004

Re: Reorganization and why dmoz human editing fail

samisdad, this is a poor mockup of what's going on here.

Do you know of the experiment with the monkeys and the banana and the hosepipe?

Reorganization and why dmoz human editing fails

JOE3656

motsa

Sunanda

Member

spectregunner

Member

pvgool

kEditall/kCatmv

xixtas01

Member

JOE3656

thehelper

Member

pvgool

kEditall/kCatmv

donaldb

Member

chris22rv

motsa

lissa

Member

chris22rv

flicker

Member

xixtas01

Member

donaldb

Member

chris22rv

chris22rv