ODP link checker and site removal

Stern123 · Jul 14, 2008

We have a website that was listed for several years and was dropped in April 2008 (we know, because log files show referrals from dmoz.org until May 2008).

Since our website is an established quality site, we assumed that there was a problem with the ODP link checker accessing our site.

So looking again our log files, we saw two failed attempts in April by pmoz.info link checker to access the site. So voila, that would explain the site removal.

But why was the link checker blocked by our site? Because the IP address of the link checker belongs to a hosting company that also hosts many spammers. We have found fake Googlebots, scrapers, email harvesters and other suspicious activity from those IP ranges.

I can only assume that we're not the only website which has accidentally blocked the ODP link checker due to bad activity in that neighborhood.

Now if you take Googlebot, for example. Any hit from Googlebot can be easily verified. An IP lookup and a reverse DNS lookup will both confirm that the Googlebot is legitimate and not a fake agent.

Unfortunately, there is no such way that the ODP link checker can be verified in this way.

Webmasters could set up an exception in their security system to allow in anything calling itself "ODP link checker". But if everyone did that, then any bad bot can call itself "ODP link checker" and have a free ticket to every website, scraping email addresses et al. So this isn't the ideal situation.

So please re-consider the draconian consequences of removing websites automatically because the link checker bot is accidentally blocked by certain websites.

It could have worked a few years ago, but firewalls and website security measures are becoming more strict these days, and the ODP link checker is not being updated with the times.

I think that:
1) if the link checker tool cannot access a website, it should only flag the website
2) the listing should remain in ODP on 'probation' until manually reviewed to confirm that the site is gone
3) failing the above, please rewrite your link checker code so that it sends the correct headers and uses an IP address and/or reverse DNS record so that any website can confirm it is legitimate
4) if possible, please fast-track our website to restore the listing (we've emailed the editor and the staff, but no response yet)

Callimachus · Jul 14, 2008

I think that:
1) if the link checker tool cannot access a website, it should only flag the website
2) the listing should remain in ODP on 'probation' until manually reviewed to confirm that the site is gone
3) failing the above, please rewrite your link checker code so that it sends the correct headers and uses an IP address and/or reverse DNS record so that any website can confirm it is legitimate
4) if possible, please fast-track our website to restore the listing (we've emailed the editor and the staff, but no response yet)

Regarding the checker code, they are designed (afaik) to simulate a regular web user. If the bot can't reach the site, it is likely that some/most regular users might not be able to either. The link checker makes note of a site it gets no response from and rechecks it a week or two later. If it still receives no response then it indeed flags the site, and moves it to unreviewed to be manually checked by an editor. While editors usually give Q&A flags some priority, there is no specific time frame for such being reviewed.

Stern123 · Jul 14, 2008

Callimachus said:
Regarding the checker code, they are designed (afaik) to simulate a regular web user.

If that's true, I mean, is that really a good thing? Spammers are always trying to disguise themselves as a regular web browser... so who could be blamed if the link checker is accidentally flagged as suspicious?

Good search engine spiders declare themselves honestly and have nothing to hide. What does ODP link checker have to hide? Why does it furtively disguise itself as a browser and/or come from spammy neighborhoods?

Callimachus said:
The link checker flags a site it gets no response from. It rechecks it a week or two later. If it still receives no response then it indeed flags the site, and moves it to unreviewed to be manually checked by an editor.

Does anyone here think that this is the ideal method? I think that the site should remain listed until manually reviewed.

Or to take a different perspective, an expired site that lingers on ODP is a *lesser* evil than a site that is prematurely removed from ODP, don't you agree?

While Q&A flags are usually given priority, there is no specific time frame for such.

Looking at the bigger picture here, false positives and disgruntled webmasters clogging dmoz's inboxes with complaints is just a waste of time. If anything, OD is understaffed. The volunteers don't need to spend their precious time with "customer service" issues. And I strongly believe that the current system can be unfair and a burden on volunteer time.

For the sake of the struggle to remain relevant, and for the sake of good time management, I beseech the ODP management to be proactive and reconsider this system and take the initiative of keeping in pace with modern security concerns of the web.

motsa · Jul 14, 2008

Does anyone here think that this is the ideal method? I think that the site should remain listed until manually reviewed.

The whole point of automated tools like the link checker are to remove bad links from the directory ASAP as they degrade the user's experience. IMO it's better to have a few false positives that will be readded to the live directory when they are manually checked than to leave the true positives (i.e. the bad links) listed until someone gets around to checking them.

Stern123 · Jul 14, 2008

motsa said:
The whole point of automated tools like the link checker are to remove bad links from the directory ASAP as they degrade the user's experience. IMO it's better to have a few false positives that will be readded to the live directory when they are manually checked than to leave the true positives (i.e. the bad links) listed until someone gets around to checking them.

Do you have any statistical or anectodal evidence upon which you base this opinion? What is the probability of a website expiring and "degrading a user's experience"? What is the probability of a false positive degrading a webmaster's experience and degrading your inbox with complaints?

I don't mean to be provoke, but since I'm not an editor, I really have no idea. I only know the unfair feeling of being on the other end of this policy.

Do you know what is the average time frame for someone to get "around to checking them"? I think the current system works OK if the getting "around to checking them" is a week or two maximum. After that, I think the false positives become the greater evil.

I also couldn't help but notice that you haven't addressed any other portion of my long posts, which I think deserves at least *some* consideration.

motsa · Jul 14, 2008

Do you have any statistical or anectodal evidence upon which you base this opinion?

My eight years of experience as an editor? The main ODP link checking tool used to work the way you want this one to, i.e. it would just flag protential problems but leave the sites listed. It was NOT an ideal situation, for editors or for the quality of the directory, which is why the various link checking tools now unreview sites that appear to be problematic. I think it's fairly safe to say that they will not be changed back to the old method of leaving the flagged sites listed.

Do you know what is the average time frame for someone to get "around to checking them"? I think the current system works OK if the getting "around to checking them" is a week or two maximum. After that, I think the false positives become the greater evil.

We'll have to agree to disagree. I'd say (as an editor, not a site owner) that it would be far worse to leave bad links listed for weeks (or however long it takes someone to check them) than to mistakenly temporarily remove a few false positives for the same period of time. Consider that we're looking at the Big Picture, i.e. the overall quality of the public directory, rather than being focused on the status of any specific site. .

Looking at the bigger picture here, false positives and disgruntled webmasters clogging dmoz's inboxes with complaints is just a waste of time. If anything, OD is understaffed. The volunteers don't need to spend their precious time with "customer service" issues. And I strongly believe that the current system can be unfair and a burden on volunteer time.

Since we get very few complaints about false positives, I'd say it's not too much of a burden; much less of a burden than dealing with bad links that are still listed.

I also couldn't help but notice that you haven't addressed any other portion of my long posts, which I think deserves at least *some* consideration.

Of the four suggestions in your initial post, I addressed the first two. I didn't build the tool and I have no control over how it behaves in relation to suggestion three. And, no, posting here will not give your site preferential treatment. If the editor you've written to or staff choose to do so based on your email, that's fine but otherwise it will be rereviewed when an editor gets to it.

Stern123 · Jul 14, 2008

motsa said:
My eight years of experience as an editor?

Fair enough, unless other editors say otherwise, I'll of course take your word for it. I do look forward to a time when "getting around to it" is defined as days, not months, rather than the slow bureaucracy or surface impression thereof. I'm surprised that so few false positives are reported, perhaps we had the bad luck of being visited by a rarer spammier breed of link checker.

gloria · Jul 14, 2008

Add to that my nine years experience as an editor. The vast, overwhelming majority of positives are accurate.

Stern123 · Jul 15, 2008

How about this idea? After a link checker automatically transfers a website from 'listed' to 'unreviewed', the tool automatically *continues* to occasionally hit the site *until* a manual review by an editor. If the tool is later able to access the website (before a manual review), the site is *automatically* reverted back to 'listed'. (If prefered, the site remains flagged but listed until a manual review).

If a category is well-staffed and the website is reviewed relatively quickly, there is no difference. If the website really did expire, there's still no difference. However, if it's a false positive and the category is understaffed and it takes a long time to get around to it, then this can work. A webmaster might notice the dropped listing and adapt their security measures. Your designers might upgrade the link checker so that it's not getting blocked. The link checker might be rotated to a different IP in a new clean neighborhood. (As a courtesy notification to webmasters, this phase of link checker might be renamed to something like 'ODP Resurrector'). In all these cases, the link checker can give the site a green light again before editorial intervention.

Is that a fair compromise? A problem resulting from automation is cured by automation, and the damage from the false positive is mitigated.

windharp · Jul 15, 2008

Actually, we had the same suggestions by editors for several times already. Unfortunately, there are a few reasons speaking against this:

1) Our current database and system does not allow this to be done easily. Currently, this would be a rather difficult, and serverload consuming crawling operation. This should improve in a medium timeframe, due to backend changes. Which is another reason not to do this right now, changes to our tools should be a lot easier in the future.
2) We chronically lacked development power for quite some time, due to the developers being paid by AOL, and not being volunteers.
3) A lot of pages that come back are either parked, changed content, have been sold, ... -> they are not listable any more. So the tool used would need to either get some sort of editor quickconfirmation or have AI.

jimnoble · Jul 15, 2008

What often happens is that a site drops off the net for a while, is repurposed and then put back up again with entirely different content - sometimes months later. That's why I'm opposed to automated re-listing without intervening human review.

It would be bad enough that a US realtor ended up being listed in a Canadian category but I'm really not a great fan of the idea that pr0n sites could end up in the main directory or Kids_and_Teens and I'm sure that you aren't either.

Stern123 · Jul 15, 2008

windharp said:
1) Our current database and system does not allow this to be done easily. Currently, this would be a rather difficult, and serverload consuming crawling operation. This should improve in a medium timeframe, due to backend changes. Which is another reason not to do this right now, changes to our tools should be a lot easier in the future.
2) We chronically lacked development power for quite some time, due to the developers being paid by AOL, and not being volunteers.
3) A lot of pages that come back are either parked, changed content, have been sold, ... -> they are not listable any more. So the tool used would need to either get some sort of editor quickconfirmation or have AI.

I see. I appreciate your time in explaining what's going on behind the scenes. It's really great for public relations. I guess that at least two benefits of a partnership with AOL is that a) if developers are paid by AOL, they might run ODP link checkers to originate from AOL IP ranges for more legitimacy and authentication? b) ODP could use some sort of algorithm borrowed from AOL Search to determine if webpages have been parked or changed content? But I guess that's blue sky thinking if AOL is like a distant stepfather (stepmother?).

A third wild idea is that sites flagged by link checkers could be automatically and temporarily moved to a different directory (a 'probation' directory) until manually reviewed. This doesn't degrade the user's experience because these sites are removed from the main directory. But at least webmasters get a clue of what's going on instead of a mysterious site disappearance. Doesn't change much, but it gives the public the impression of a more transparent process (ala Wikipedia's warning boxes). I don't know if this fits with the backend changes you mentioned.

I guess I'm out of (inappropriate) ideas now. Thanks for your time.

motsa · Jul 15, 2008

Just to clarify -- there's no "partnership with AOL"; AOL owns the Open Directory Project. While the main link checker tool was created and is maintained by AOL staff, other link checker tools (including the one that brought you here originally) are not.

Stern123 · Jul 15, 2008

motsa said:
Just to clarify -- there's no "partnership with AOL"; AOL owns the Open Directory Project. While the main link checker tool was created and is maintained by AOL staff, other link checker tools (including the one that brought you here originally) are not.

Gotcha. Now this other link checker tool that brought me here... can you rename it to "Newman" (from Seinfeld)?

This way, when Newman hits our website again, I can grimace and say "Hello... Newman" and let it in through the door even though I'm not happy about it. And if Newman bans our site again, I can clench my fist and curse "Newman!"

windharp · Jul 16, 2008

ODP could use some sort of algorithm borrowed from AOL Search to determine if webpages have been parked or changed content?

Actually I think that our own tooldevelopers, especially the ones doing our nonofficial linkcheckers, are able to spot parking pages even better than AOL search does. No offense meant if someone from AOL reads that, it's just my personal experience with their search

But to identify content changes, one would have to store the websites content in advcane, _before_ the content change. Which needs google-esque data storage and processing capabilities for millions of sites listed in the ODP. Google does something like that (see google cache for example). I read some papers they published about it and was impressed. The bare amount of data they use for _experiments_ make you wonder how anybody on earth could do that live, but obviously they can.

A third wild idea is that sites flagged by link checkers could be automatically and temporarily moved to a different directory (a 'probation' directory) until manually reviewed.

So you suggest we still show dead sites to the user, telling him: "Hey, we have some dead sites here you might want to look at"? Remember our primary concern are always our users, the people visiting our site. Not the site owners. And I am sorry to say that in my oppinion our users would not benefit the slightest from having a pool of mainly dead sites to wade through.

Anyway: Things like that might (!) be possible with the new backend, but they certainly are not with todays software.

(ala Wikipedia's warning boxes).

Are you surprised to hear that I am not a fan of those wikipedia warning boxes, too? But because of a different issue: Warning tags ("POV", "missing citations") can be added very easily by people with the opposing POV, without even trying to fix the problem. Getting them removed is a lot of work, though. Even when they are totally unfounded. Seen that a lot in disputable topics. But hey, this is an ODP forum, so this is clearly off topic.

[OT warning tag goes here]

plantrob · Jul 16, 2008

Hi, this is Neumann

The pmoz.info bot that hit your site was mine. While callimachus' assumption that some editor-run bots simulate regular browsers is undoubtably true, it is not true for the two highest-impact bots ODP uses at the moment - the official robozilla bot, and my pmoz.info bot - both identify themselves uniquely through their useragent string.
Yes, my tools are hosted on a shared server. I cannot control what trouble my fellow hostees get into. That results in a few undue unlistings - but these are a drop in the bucket compared to the listings removed for good reason.
Rob

Stern123 · Jul 17, 2008

plantrob said:
That results in a few undue unlistings - but these are a drop in the bucket compared to the listings removed for good reason.

Hello... Newman!

The more thoughtful developers of spiders out there are taking the extra step of reassuring webmasters with identifiable IP ranges and reverse dns checks and all that. While you may not have the resources to take it that far, wouldn't it be better to use a DMOZ or AOL IP at the very least? Sure, pmoz may have a unique agent, but we also saw a fraudulent Googlebot from the same hosting range, so clearly, the user agent (which can so easily be faked) is not a sign of trust whatsoever.

In my opinion, it's actually quite Newman-like to shrug off any culpability for your dubious neighbors and see us as just a drop in the bucket

Stern123 · Jul 17, 2008

plantrob said:
Hi, this is Neumann

I just visited pmoz.info. I don't remember if it has been updated since earlier this year. From a cursory glance, it isn't entirely clear what the function of the bot is. Knowing what I know now, I would be afraid to block the bot. But a newbie to DMOZ could very well assume it's some sort of research/analytical tool and would not predict the dire consequence of blocking the bot. A simple clear-cut warning would help, like "If you block our bot, your site will be banned from dmoz!"

And then I saw the line "does not read robots.txt" and that's already a big "whoa!" The disclaimer "because its visit is so limited" feels like a paltry excuse to me. Reading robots.txt is probably rule #1 for how to write a polite web bot.

I won't engage in an extended debate about this, certaintly not in this forum where there is apparently (to me anyway) a very defensive attitude from certain editors when faced with constructive criticism from the public (while other editors are much more forthcoming and helpful and tolerant of us 'whiners'). Suffice to say, I've read about many bots being banned for the simple failure to read (and respect) robots.txt.

So given that and everything else I've expressed on this thread, forgive me for feeling bitterly disappointed that our listing was dropped due to an automated tool from a spammy neighborhood which doesn't even read robots.txt and whose designer won't apologize for any of it and dismisses the collateral damage as a "drop in the bucket".

Newman!!!

pvgool · Jul 17, 2008

There is a good reason to not read robots.txt
If robots.txt would block the bot the website could not be checked for availability. As it wouldn't be available to the bot the website would be removed from the directory. Causing a lot of unwanted and unnecessary removals.
The bot only visits the url as listed in DMOZ, it does not follow any links on the site. So it won't use a lot of bandwith.

And yes, the few wrong removals are just a drop in the bucket, and probaly only a drop in the ocean. From all the sites removed by this bot that I have checked I have rarely seen a wrong removal. If one in 10.000 removed sites is wrongly removed I would say that this tool is working perfectly. We care about the bucket (directory) not about single drops (sites). For me I prefer a few sites to be removed uncorrectly above a few sites to be not removed uncorrectly.

Tech Owner · Jul 17, 2008

Link Checkers

"Regarding the checker code, they are designed (afaik) to simulate a regular web user.'

No they do not "simulate a regular web user". Any bot that doesn't originate from a major SE gets a 403 access denied, (Robozilla on certain IPs will get through). All SE bots must pass a RDNS test before they gain access. Why? The major SEs have stated publically that if a site owner is in dought about the origins of any bot the only way to be sure the bot isn't a rouge is to do a RDNS.

If a regular visitor were to change the User Agent of their browser to "Link Checker" the visitor will be denied access. Those who change the UA of the browser must change it back (remove the mask) to be considered a regular/public vistor.

Try entering a public art gallery wearing a mast with a can of spray in hand! You may claim what may be a valid reason for your dress code and accessories but, do you actually expect to gain access. (This this is a public gallery and now you don't resemble a regular visitor.)

"If the bot can't reach the site, it is likely that some/most regular users might not be able to either. "

Regular visitors don't come from 74.208.25.118 = [ perfora.net ]. Regular visitors don't come from 216.15.74.85 with a url in the UA string. These two IP ranges 74.208.0.0/17 and 74.208.128.0/18 are blocked by my firewall (drop connection). Why? Past and ongoing abuse.

Log Files:

grep 216.15.74.85 /home/web/tech-httpd-access.log

216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Jan/2008:00:29:06 +0000] "GET / HTTP/1.1" 403 897
216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Mar/2008:18:00:40 +0000] "GET /node/7119 HTTP/1.1" 403 906
216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [19/Apr/2008:02:20:04 +0000] "GET /node/feed HTTP/1.1" 403 906

09/Jan/2008 and 19/Apr/2008 corespond very closely to the dates that the main page and the "feed" page were removed from a category with no editor. The category that does have an editor manually checked "node/7119".

"The link checker makes note of a site it gets no response from and rechecks it a week or two later."

No it doesn't "rechecks it a week or two later." Robozilla visited my site once this year and 74.208.25.118 has never accessed my site via the main, feed or node/7119 urls.

It does appear that editors aren't informed as to what is actaully happening. Maybe they are getting the "what should" happening instead of reality. Many site owner and editors are actually trying to achieve the same respective level of quality from different positions. eg: I don't want to encounter comment spam/garbage/hacking attempts and an editor doesn't want to open a "queue" and find spam/garbage either.

Regards and have a good day.

ODP link checker and site removal

Stern123

Member

Callimachus

Member

Stern123

Member

motsa

Stern123

Member

motsa

Stern123

Member

gloria

Stern123

Member

windharp

Meta/kMeta

jimnoble

DMOZ Meta

Stern123

Member

motsa

Stern123

Member

windharp

Meta/kMeta

plantrob

Curlie Admin

Stern123

Member

Stern123

Member

pvgool

kEditall/kCatmv

Tech Owner

Member