ODP link checker and site removal

Stern123 · July 15, 2008

We have a website that was listed for several years and was dropped in April 2008 (we know, because log files show referrals from dmoz.org until May 2008).

Since our website is an established quality site, we assumed that there was a problem with the ODP link checker accessing our site.

So looking again our log files, we saw two failed attempts in April by pmoz.info link checker to access the site. So voila, that would explain the site removal.

But why was the link checker blocked by our site? Because the IP address of the link checker belongs to a hosting company that also hosts many spammers. We have found fake Googlebots, scrapers, email harvesters and other suspicious activity from those IP ranges.

I can only assume that we're not the only website which has accidentally blocked the ODP link checker due to bad activity in that neighborhood.

Now if you take Googlebot, for example. Any hit from Googlebot can be easily verified. An IP lookup and a reverse DNS lookup will both confirm that the Googlebot is legitimate and not a fake agent.

Unfortunately, there is no such way that the ODP link checker can be verified in this way.

Webmasters could set up an exception in their security system to allow in anything calling itself "ODP link checker". But if everyone did that, then any bad bot can call itself "ODP link checker" and have a free ticket to every website, scraping email addresses et al. So this isn't the ideal situation.

So please re-consider the draconian consequences of removing websites automatically because the link checker bot is accidentally blocked by certain websites.

It could have worked a few years ago, but firewalls and website security measures are becoming more strict these days, and the ODP link checker is not being updated with the times.

I think that:

1) if the link checker tool cannot access a website, it should only flag the website

2) the listing should remain in ODP on 'probation' until manually reviewed to confirm that the site is gone

3) failing the above, please rewrite your link checker code so that it sends the correct headers and uses an IP address and/or reverse DNS record so that any website can confirm it is legitimate

4) if possible, please fast-track our website to restore the listing (we've emailed the editor and the staff, but no response yet)

Callimachus · July 15, 2008

I think that:
1) if the link checker tool cannot access a website, it should only flag the website
2) the listing should remain in ODP on 'probation' until manually reviewed to confirm that the site is gone
3) failing the above, please rewrite your link checker code so that it sends the correct headers and uses an IP address and/or reverse DNS record so that any website can confirm it is legitimate
4) if possible, please fast-track our website to restore the listing (we've emailed the editor and the staff, but no response yet)

Regarding the checker code, they are designed (afaik) to simulate a regular web user. If the bot can't reach the site, it is likely that some/most regular users might not be able to either. The link checker makes note of a site it gets no response from and rechecks it a week or two later. If it still receives no response then it indeed flags the site, and moves it to unreviewed to be manually checked by an editor. While editors usually give Q&A flags some priority, there is no specific time frame for such being reviewed.

Stern123 · July 15, 2008

Regarding the checker code, they are designed (afaik) to simulate a regular web user.

If that's true, I mean, is that really a good thing? Spammers are always trying to disguise themselves as a regular web browser... so who could be blamed if the link checker is accidentally flagged as suspicious?

Good search engine spiders declare themselves honestly and have nothing to hide. What does ODP link checker have to hide? Why does it furtively disguise itself as a browser and/or come from spammy neighborhoods?

The link checker flags a site it gets no response from. It rechecks it a week or two later. If it still receives no response then it indeed flags the site, and moves it to unreviewed to be manually checked by an editor.

Does anyone here think that this is the ideal method? I think that the site should remain listed until manually reviewed.

Or to take a different perspective, an expired site that lingers on ODP is a *lesser* evil than a site that is prematurely removed from ODP, don't you agree?

While Q&A flags are usually given priority, there is no specific time frame for such.

Looking at the bigger picture here, false positives and disgruntled webmasters clogging dmoz's inboxes with complaints is just a waste of time. If anything, OD is understaffed. The volunteers don't need to spend their precious time with "customer service" issues. And I strongly believe that the current system can be unfair and a burden on volunteer time.

For the sake of the struggle to remain relevant, and for the sake of good time management, I beseech the ODP management to be proactive and reconsider this system and take the initiative of keeping in pace with modern security concerns of the web.

motsa · July 15, 2008

Does anyone here think that this is the ideal method? I think that the site should remain listed until manually reviewed.

The whole point of automated tools like the link checker are to remove bad links from the directory ASAP as they degrade the user's experience. IMO it's better to have a few false positives that will be readded to the live directory when they are manually checked than to leave the true positives (i.e. the bad links) listed until someone gets around to checking them.

Stern123 · July 15, 2008

The whole point of automated tools like the link checker are to remove bad links from the directory ASAP as they degrade the user's experience. IMO it's better to have a few false positives that will be readded to the live directory when they are manually checked than to leave the true positives (i.e. the bad links) listed until someone gets around to checking them.

Do you have any statistical or anectodal evidence upon which you base this opinion? What is the probability of a website expiring and "degrading a user's experience"? What is the probability of a false positive degrading a webmaster's experience and degrading your inbox with complaints?

I don't mean to be provoke, but since I'm not an editor, I really have no idea. I only know the unfair feeling of being on the other end of this policy.

Do you know what is the average time frame for someone to get "around to checking them"? I think the current system works OK if the getting "around to checking them" is a week or two maximum. After that, I think the false positives become the greater evil.

I also couldn't help but notice that you haven't addressed any other portion of my long posts, which I think deserves at least *some* consideration.

motsa · July 15, 2008

Do you have any statistical or anectodal evidence upon which you base this opinion?

My eight years of experience as an editor? The main ODP link checking tool used to work the way you want this one to, i.e. it would just flag protential problems but leave the sites listed. It was NOT an ideal situation, for editors or for the quality of the directory, which is why the various link checking tools now unreview sites that appear to be problematic. I think it's fairly safe to say that they will not be changed back to the old method of leaving the flagged sites listed.

Do you know what is the average time frame for someone to get "around to checking them"? I think the current system works OK if the getting "around to checking them" is a week or two maximum. After that, I think the false positives become the greater evil.

We'll have to agree to disagree. I'd say (as an editor, not a site owner) that it would be far worse to leave bad links listed for weeks (or however long it takes someone to check them) than to mistakenly temporarily remove a few false positives for the same period of time. Consider that we're looking at the Big Picture, i.e. the overall quality of the public directory, rather than being focused on the status of any specific site. .

Looking at the bigger picture here, false positives and disgruntled webmasters clogging dmoz's inboxes with complaints is just a waste of time. If anything, OD is understaffed. The volunteers don't need to spend their precious time with "customer service" issues. And I strongly believe that the current system can be unfair and a burden on volunteer time.

Since we get very few complaints about false positives, I'd say it's not too much of a burden; much less of a burden than dealing with bad links that are still listed.

I also couldn't help but notice that you haven't addressed any other portion of my long posts, which I think deserves at least *some* consideration.

Of the four suggestions in your initial post, I addressed the first two. I didn't build the tool and I have no control over how it behaves in relation to suggestion three. And, no, posting here will not give your site preferential treatment. If the editor you've written to or staff choose to do so based on your email, that's fine but otherwise it will be rereviewed when an editor gets to it.

Stern123 · July 15, 2008

My eight years of experience as an editor?

Fair enough, unless other editors say otherwise, I'll of course take your word for it. I do look forward to a time when "getting around to it" is defined as days, not months, rather than the slow bureaucracy or surface impression thereof. I'm surprised that so few false positives are reported, perhaps we had the bad luck of being visited by a rarer spammier breed of link checker.

gloria · July 15, 2008

Add to that my nine years experience as an editor. The vast, overwhelming majority of positives are accurate.

Stern123 · July 15, 2008

How about this idea? After a link checker automatically transfers a website from 'listed' to 'unreviewed', the tool automatically *continues* to occasionally hit the site *until* a manual review by an editor. If the tool is later able to access the website (before a manual review), the site is *automatically* reverted back to 'listed'. (If prefered, the site remains flagged but listed until a manual review).

If a category is well-staffed and the website is reviewed relatively quickly, there is no difference. If the website really did expire, there's still no difference. However, if it's a false positive and the category is understaffed and it takes a long time to get around to it, then this can work. A webmaster might notice the dropped listing and adapt their security measures. Your designers might upgrade the link checker so that it's not getting blocked. The link checker might be rotated to a different IP in a new clean neighborhood. (As a courtesy notification to webmasters, this phase of link checker might be renamed to something like 'ODP Resurrector'). In all these cases, the link checker can give the site a green light again before editorial intervention.

Is that a fair compromise? A problem resulting from automation is cured by automation, and the damage from the false positive is mitigated.

windharp · July 15, 2008

Actually, we had the same suggestions by editors for several times already. Unfortunately, there are a few reasons speaking against this:

1) Our current database and system does not allow this to be done easily. Currently, this would be a rather difficult, and serverload consuming crawling operation. This should improve in a medium timeframe, due to backend changes. Which is another reason not to do this right now, changes to our tools should be a lot easier in the future.

2) We chronically lacked development power for quite some time, due to the developers being paid by AOL, and not being volunteers.

3) A lot of pages that come back are either parked, changed content, have been sold, ... -> they are not listable any more. So the tool used would need to either get some sort of editor quickconfirmation or have AI.

jimnoble · July 15, 2008

What often happens is that a site drops off the net for a while, is repurposed and then put back up again with entirely different content - sometimes months later. That's why I'm opposed to automated re-listing without intervening human review.

It would be bad enough that a US realtor ended up being listed in a Canadian category but I'm really not a great fan of the idea that pr0n sites could end up in the main directory or Kids_and_Teens and I'm sure that you aren't either.

Stern123 · July 15, 2008

1) Our current database and system does not allow this to be done easily. Currently, this would be a rather difficult, and serverload consuming crawling operation. This should improve in a medium timeframe, due to backend changes. Which is another reason not to do this right now, changes to our tools should be a lot easier in the future.
2) We chronically lacked development power for quite some time, due to the developers being paid by AOL, and not being volunteers.
3) A lot of pages that come back are either parked, changed content, have been sold, ... -> they are not listable any more. So the tool used would need to either get some sort of editor quickconfirmation or have AI.

I see. I appreciate your time in explaining what's going on behind the scenes. It's really great for public relations. I guess that at least two benefits of a partnership with AOL is that a) if developers are paid by AOL, they might run ODP link checkers to originate from AOL IP ranges for more legitimacy and authentication? b) ODP could use some sort of algorithm borrowed from AOL Search to determine if webpages have been parked or changed content? But I guess that's blue sky thinking if AOL is like a distant stepfather (stepmother?).

A third wild idea is that sites flagged by link checkers could be automatically and temporarily moved to a different directory (a 'probation' directory) until manually reviewed. This doesn't degrade the user's experience because these sites are removed from the main directory. But at least webmasters get a clue of what's going on instead of a mysterious site disappearance. Doesn't change much, but it gives the public the impression of a more transparent process (ala Wikipedia's warning boxes). I don't know if this fits with the backend changes you mentioned.

I guess I'm out of (inappropriate) ideas now. Thanks for your time.

motsa · July 15, 2008

Just to clarify -- there's no "partnership with AOL"; AOL owns the Open Directory Project. While the main link checker tool was created and is maintained by AOL staff, other link checker tools (including the one that brought you here originally) are not.

Stern123 · July 15, 2008

Just to clarify -- there's no "partnership with AOL"; AOL owns the Open Directory Project. While the main link checker tool was created and is maintained by AOL staff, other link checker tools (including the one that brought you here originally) are not.

Gotcha. Now this other link checker tool that brought me here... can you rename it to "Newman" (from Seinfeld)?

This way, when Newman hits our website again, I can grimace and say "Hello... Newman" and let it in through the door even though I'm not happy about it. And if Newman bans our site again, I can clench my fist and curse "Newman!"

windharp · July 16, 2008

ODP could use some sort of algorithm borrowed from AOL Search to determine if webpages have been parked or changed content?

Actually I think that our own tooldevelopers, especially the ones doing our nonofficial linkcheckers, are able to spot parking pages even better than AOL search does. No offense meant if someone from AOL reads that, it's just my personal experience with their search

But to identify content changes, one would have to store the websites content in advcane, _before_ the content change. Which needs google-esque data storage and processing capabilities for millions of sites listed in the ODP. Google does something like that (see google cache for example). I read some papers they published about it and was impressed. The bare amount of data they use for _experiments_ make you wonder how anybody on earth could do that live, but obviously they can.

A third wild idea is that sites flagged by link checkers could be automatically and temporarily moved to a different directory (a 'probation' directory) until manually reviewed.

So you suggest we still show dead sites to the user, telling him: "Hey, we have some dead sites here you might want to look at"? Remember our primary concern are always our users, the people visiting our site. Not the site owners. And I am sorry to say that in my oppinion our users would not benefit the slightest from having a pool of mainly dead sites to wade through.

Anyway: Things like that might (!) be possible with the new backend, but they certainly are not with todays software.

(ala Wikipedia's warning boxes).

Are you surprised to hear that I am not a fan of those wikipedia warning boxes, too? But because of a different issue: Warning tags ("POV", "missing citations") can be added very easily by people with the opposing POV, without even trying to fix the problem. Getting them removed is a lot of work, though. Even when they are totally unfounded. Seen that a lot in disputable topics. But hey, this is an ODP forum, so this is clearly off topic.

[OT warning tag goes here]

plantrob · July 17, 2008

Hi, this is Neumann

The pmoz.info bot that hit your site was mine. While callimachus' assumption that some editor-run bots simulate regular browsers is undoubtably true, it is not true for the two highest-impact bots ODP uses at the moment - the official robozilla bot, and my pmoz.info bot - both identify themselves uniquely through their useragent string.

Yes, my tools are hosted on a shared server. I cannot control what trouble my fellow hostees get into. That results in a few undue unlistings - but these are a drop in the bucket compared to the listings removed for good reason.

Rob

Stern123 · July 17, 2008

That results in a few undue unlistings - but these are a drop in the bucket compared to the listings removed for good reason.

Hello... Newman!

The more thoughtful developers of spiders out there are taking the extra step of reassuring webmasters with identifiable IP ranges and reverse dns checks and all that. While you may not have the resources to take it that far, wouldn't it be better to use a DMOZ or AOL IP at the very least? Sure, pmoz may have a unique agent, but we also saw a fraudulent Googlebot from the same hosting range, so clearly, the user agent (which can so easily be faked) is not a sign of trust whatsoever.

In my opinion, it's actually quite Newman-like to shrug off any culpability for your dubious neighbors and see us as just a drop in the bucket

Stern123 · July 17, 2008

Hi, this is Neumann

I just visited pmoz.info. I don't remember if it has been updated since earlier this year. From a cursory glance, it isn't entirely clear what the function of the bot is. Knowing what I know now, I would be afraid to block the bot. But a newbie to DMOZ could very well assume it's some sort of research/analytical tool and would not predict the dire consequence of blocking the bot. A simple clear-cut warning would help, like "If you block our bot, your site will be banned from dmoz!"

And then I saw the line "does not read robots.txt" and that's already a big "whoa!" The disclaimer "because its visit is so limited" feels like a paltry excuse to me. Reading robots.txt is probably rule #1 for how to write a polite web bot.

I won't engage in an extended debate about this, certaintly not in this forum where there is apparently (to me anyway) a very defensive attitude from certain editors when faced with constructive criticism from the public (while other editors are much more forthcoming and helpful and tolerant of us 'whiners'). Suffice to say, I've read about many bots being banned for the simple failure to read (and respect) robots.txt.

So given that and everything else I've expressed on this thread, forgive me for feeling bitterly disappointed that our listing was dropped due to an automated tool from a spammy neighborhood which doesn't even read robots.txt and whose designer won't apologize for any of it and dismisses the collateral damage as a "drop in the bucket".

Newman!!!

pvgool · July 17, 2008

There is a good reason to not read robots.txt

If robots.txt would block the bot the website could not be checked for availability. As it wouldn't be available to the bot the website would be removed from the directory. Causing a lot of unwanted and unnecessary removals.

The bot only visits the url as listed in DMOZ, it does not follow any links on the site. So it won't use a lot of bandwith.

And yes, the few wrong removals are just a drop in the bucket, and probaly only a drop in the ocean. From all the sites removed by this bot that I have checked I have rarely seen a wrong removal. If one in 10.000 removed sites is wrongly removed I would say that this tool is working perfectly. We care about the bucket (directory) not about single drops (sites). For me I prefer a few sites to be removed uncorrectly above a few sites to be not removed uncorrectly.

Tech Owner · July 17, 2008

Link Checkers

"Regarding the checker code, they are designed (afaik) to simulate a regular web user.'

No they do not "simulate a regular web user". Any bot that doesn't originate from a major SE gets a 403 access denied, (Robozilla on certain IPs will get through). All SE bots must pass a RDNS test before they gain access. Why? The major SEs have stated publically that if a site owner is in dought about the origins of any bot the only way to be sure the bot isn't a rouge is to do a RDNS.

If a regular visitor were to change the User Agent of their browser to "Link Checker" the visitor will be denied access. Those who change the UA of the browser must change it back (remove the mask) to be considered a regular/public vistor.

Try entering a public art gallery wearing a mast with a can of spray in hand! You may claim what may be a valid reason for your dress code and accessories but, do you actually expect to gain access. (This this is a public gallery and now you don't resemble a regular visitor.)

"If the bot can't reach the site, it is likely that some/most regular users might not be able to either. "

Regular visitors don't come from 74.208.25.118 = [ perfora.net ]. Regular visitors don't come from 216.15.74.85 with a url in the UA string. These two IP ranges 74.208.0.0/17 and 74.208.128.0/18 are blocked by my firewall (drop connection). Why? Past and ongoing abuse.

Log Files:

grep 216.15.74.85 /home/web/tech-httpd-access.log

216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Jan/2008:00:29:06 +0000] "GET / HTTP/1.1" 403 897

216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Mar/2008:18:00:40 +0000] "GET /node/7119 HTTP/1.1" 403 906

216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [19/Apr/2008:02:20:04 +0000] "GET /node/feed HTTP/1.1" 403 906

09/Jan/2008 and 19/Apr/2008 corespond very closely to the dates that the main page and the "feed" page were removed from a category with no editor. The category that does have an editor manually checked "node/7119".

"The link checker makes note of a site it gets no response from and rechecks it a week or two later."

No it doesn't "rechecks it a week or two later." Robozilla visited my site once this year and 74.208.25.118 has never accessed my site via the main, feed or node/7119 urls.

It does appear that editors aren't informed as to what is actaully happening. Maybe they are getting the "what should" happening instead of reality. Many site owner and editors are actually trying to achieve the same respective level of quality from different positions. eg: I don't want to encounter comment spam/garbage/hacking attempts and an editor doesn't want to open a "queue" and find spam/garbage either.

Regards and have a good day.

Stern123 · July 17, 2008

For me I prefer a few sites to be removed uncorrectly above a few sites to be not removed uncorrectly.

You evade the point. You make it sound like it's a choice of either a) a few sites incorrectly removed, or b) a few sites not removed correctly, and that's misleading.

The point is that developers can create a well-designed bot that is less likely to get banned. By doing so, you *still* remove all the incorrect sites *and* reduce the false positives.

As I've discussed ad nauseum, these measures include (but not limited to) moving away from spammy IPs, moving to AOL/DMOZ IPs, and reading robots.txt like a polite bot. The chance of the bot actually being blocked by robots.txt is probably miniscule, it's not a big deal like you make it out to be. In that very rare case, there could be a manual review or a 'Plan B' bot, the link checkers that pretend to be browsers, that check up on the sites that pmoz can't visit. I don't know, these are suggestions, but the point is even a single one (1) of these measures can be enacted without making it out to be some sort of degradation of the directory.

But I think you know that, and what you're really trying to say is that you won't put the extra time to upgrade the bot, because you don't care about the few false positives. Let's be honest.

I would invite you to come over to a webmaster forum to discuss the lazy design of pmoz bot and make your arguments and you'll probably be overwhelmed by a tidal wave of disagreement and you'll be forced to retreat to the safe haven of this forum.

hutcheson · July 17, 2008

You are, of course, always welcome to take whatever means think think necessary to protect themselves from spam. You're even welcome to describe those means in this forum (at least, so far as they relate to the ODP tools.) And in return, you're likely to get some information about non-confidential aspects of the ODP tools.

Both you and the ODP tool developers are free to use that information however they wish.

It is not, however, reasonable to expect you to justify your failure to cooperate with any particular ODP tool. Nor is it reasonable for you to expect anyone to justify their tool design to. Both your website and their tools are what they are. And everyone can choose for themselves to use them or cooperate with them, or not.

There's no basis for an "argument" (here or anywhere else), because my priorities are probably as irrelevant to you as yours are to me. Here, it's just an opportunity to share information.

Tech Owner · July 17, 2008

I don't have any problem with letting whoever is responsible for "http://pmoz.info/doc/botinfo.htm" (the info page isn't up to date) know how the bot is being detected. Something is leading me to believe it is known to the developer and the developer also knows the difference between a 301 (moved permantently) and a 403 (access denied). Rob am I correct ( if you are maintianing this link checker)?

Can anyone let me know where a dmoz policy states that I must allow bots to access my site, other than "zilla" inorder to ensure my site won't be removed.

hutcheson · July 18, 2008

No, there's no such policy.

And there's no policy against ODP'ers using tools that unremove sites "highly likely to be among the dead".

Nobody is telling you a site won't ever be listed if it don't allow robot visitors. We can't know that. And nobody can tell you a site will be listed so long as it allows robots X, Y, and Z to visit. We can't know that either.

This isn't a courtroom, where it's better to let a hundred serial killers free than to execute one innocent man. This is a question of -- what's the best way to allocate ODP resources? Experience has shown, it's better to unreview a live site than to leave 19 dead sites listed; and it's better to delete a live site outright than to leave 99 sites listed. Because the automatic tools free the editors from constant patrolling of existing sites, to review more unlisted sites (including, of course, the live sites unnecessarily removed.)

plantrob · July 18, 2008

Nope, I'm not about to double the number of http requests I make, just to read robots.txt for each URL. My bot does not crawl sites, does not cause a high server load, or otherwise make a nuisance of itself. I don't see any purpose whatsoever being served by reading robots.txt.

As techinfo noted, I use two different IP addresses (not just the perfora.net one) to check bad responses; only because of an overly restrictive access strategy did all three attempts yield an error response. Nearly a thousand listings go bad every day in the directory. In removing those, mistakes are inevitable - and reversible, by category editors. Overall, the system works quite well

Conversely, you could configure your site so that simple access requests are not blocked, but state-changing requests (e.g., POSTs) are subjected to greater scrutiny.

Sign In

ODP link checker and site removal

Recommended Posts

Stern123

Callimachus

Stern123

motsa

Stern123

motsa

Stern123

gloria

Stern123

windharp

jimnoble

Stern123

motsa

Stern123

windharp

plantrob

Stern123

Stern123

pvgool

Tech Owner

Stern123

hutcheson

Tech Owner

hutcheson

plantrob

Browse

Activity