Tech Owner Posted July 18, 2008 Posted July 18, 2008 Thanks for the reply Rob; The use of ours respective tools is clashing even though they are designed to achieve very similar end results. The tools utilised on my site are designed to block unknown bots. If the bot originates from a known problem data center the connection is dropped at the firewall. Most attempts of "defacing" a site, hacking, spam and botnets have been tracked to these data centers. A bot originating from an ISP will get a 403 (access denied) page, this isn't the generic apache 403 response page, the page states the reason for denial of access and includes an email for those willing to proceed. Log Files: (hutcheson, I know "His Honor" isn't present and the basics of the analogy actually applies to both sides grep 216.15.74.85 /home/web/tech-httpd-access.log 216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Jan/2008:00:29:06 +0000] "GET / HTTP/1.1" 403 897 216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [09/Mar/2008:18:00:40 +0000] "GET /node/7119 HTTP/1.1" 403 906 216-15-74-85.c3-0.tlg-ubr1.atw-tlg.pa.cable.rcn.com - - [19/Apr/2008:02:20:04 +0000] "GET /node/feed HTTP/1.1" 403 906 The bot in the log files received the "reason for denial of access" page. If a human checked the urls using a tool that masks their indentity, eg;http://passport.dmoz.org/?pp_cat=pmoz.info&pp_tool=%2Findex.php5, the human would have received the "reason for denial of access" page. Should an editor verifying a url "jump through the hoops" I've put in place. (this ones a basic no brainer) No! You shouldn't. The tools in place were never designed and/or used to incumber regular visitors or editors. The tools are being used to free up resources, human and hardware. (Let's see if we can reduce the these drops in the bucket) There is open source software available that is almost impossible to detect when used to access a site to "GET" a few urls providing it doesn't originate from a known problem data center. It does leave a detectable "foot print" when used for malicious purposes. Re: "Conversely, you could configure your site so that simple access requests are not blocked, but state-changing requests (e.g., POSTs) are subjected to greater scrutiny." Rob most of the tools I have in use are open source and are becoming popular on open source Blogs and CMSs. The one trapping the "bot" is actually designed to catch comment spam. I could make changes as per your suggestion but, I'm just "a drop in the bucket and I'm trying to reduce many drops in the bucket. Regards
Meta pvgool Posted July 18, 2008 Meta Posted July 18, 2008 Nearly a thousand listings go bad every day in the directory. In removing those, mistakes are inevitable - and reversible, by category editors. Overall, the system works quite well From my experience looking at many of those removed sites my estimation is that we don't even have one mistake every day and probably not one every week. I have also seen that if a site is marked several times by the tool as unavailable but available to a human editor that rob takes special mesures for the tool not to mark it anymore. As from the ca 4.500.000 listed sites only some 1000 per day go bad and are removed it seems to me that most sites don't have those extreme protections which would remove them wrongly. The same conclusion can I deduct from the very low number of wrongly removed sites amongs all the rightly removed sites. and what you're really trying to say is that you won't put the extra time to upgrade the bot, because you don't care about the few false positives. Let's be honest. Yes, for me personaly (but I can't speek for any other editor or for the DMOZ as an organsiation) I don't care about a few false positives. I will not answer PM or emails send to me. If you have anything to ask please use the forum.
Tech Owner Posted July 18, 2008 Posted July 18, 2008 Tools In Action This thread is actually about the tools used by websites and, in this case, dmoz. I'm not claiming my use of tools is the only way they can be utilised. Having said that, I'd like you to see the output of a script that is designed to combat scrapers and site rippers. The IP used in this attempt will be denied access of a period of time. The time period is a random number of hours between 12 and 24. When and if the IP in question tries to access the site during the denied time period they will recieve a page letting them know when they will be allowed to access the site. Scraper File: 18/07/2008 12:32:18 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:19 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:20 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:21 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:22 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:22 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:22 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:23 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:23 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:24 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:25 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:26 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:26 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:26 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:27 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:27 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:27 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:28 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:28 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:28 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:29 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:30 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:31 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:32 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:32 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:32 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:33 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:33 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:36 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:37 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper 18/07/2008 12:32:38 64.140.245.236 Mozilla/4.0 compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.9 build 01863; .NET CLR 2.0.50727 fast scraper
plantrob Posted July 19, 2008 Posted July 19, 2008 Rob most of the tools I have in use are open source and are becoming popular on open source Blogs and CMSs. The one trapping the "bot" is actually designed to catch comment spam. Interesting... I would think that malicious bots are the ones that try to be sneaky and assume a common browser's useragent - while above-board bots provide a means for tracing/explaining their purpose. But I've never ventured into comment spam practices, so what do I know
Editall Callimachus Posted July 19, 2008 Editall Posted July 19, 2008 Please note the "AFAIK" in my previous response. It stands for "as far as I know". Only AOL tech and some admin level editors have access to the code for Robozilla and those editors who write their own tools may or may not have released it to general editorial access. I have a bot trap on my own website that works well and simply. It will catch ill behaved bots, even those that ignore robots.txt (the reading of which does not guarantee proper behaviour - indeed it can be used for the opposite purpose), and even some ill behaved visitors. ODP Editor callimachus Any opinions expressed are my own, and do not represent an official opinion or communication from the ODP. Private messages asking for submission status or preferential treatment will be ignored.
Tech Owner Posted July 20, 2008 Posted July 20, 2008 Callimachus, When your response was quoted "AFAIK" wasn't omitted. This thread has become very informative for site owners, editors and anyone reading who is willing to try and understand our use of tools. Neither use of our respective tools is perfect, nor is anyone claiming they are. They do however free up needed resources for those using them. "I have a bot trap on my own website .....". I started with a bot trap in the early stages of the site. However, as the site grew we were pickup by Google News, Slash dotted many times, Digg and others. The log files revealed that regular visitors were only using 1/4 of the total bandwith used each month. 3/4s of the bandwith was being used by bots, scraper, proxies, botnets and other undesirables. This was too much for me and the hosting provider wanted the equivalent of an entry level dedicated server or, the account was going to be closed. A server was selected to provide the site with the tools and resources needed to combat what the undesirables had done. The site had been copied, was available on many different data centers with my adsense code replaced with that of someone else. In my eyes this ment war, nothing short of full scale combat . Time For Tools (Combat Fatigues Optional A script that was designed to stop comment spam but, does have many other capabilities was selected first. Another script was implemented to combat scrapers and site rippers ( scrapers/rippers copy all and or part of a page or site). The scripts were and are a very good starting point but, protection/security/combat is an ongoing building process and security was identifies as the next level that needed to be addressed. Mod_Security, an Apache module was added to the arsenal. Log Files The log files from all the scripts and module were compared. The comparison would surprise those who thaught they knew, me included. 90% of the recurring problems were all coming from select few easy to identify locations. - Hosting data centers world wide (Hosting includes, website hosts, co-location servers and dedicated servers) - Non democratic countries (that isn't a political statement) - Countries with "Shakey" (I can't think of another way to discribe this group) enviroments. The Arsenal Expands The last addition has been a firewall with a very basic configuration. The server(hardawre) was now going to be protected. A decession was made, by me, that the site was for regular/human visitors. The scripts and module mentioned above don't stop server access. Successful attacks are still possible if a server can be reached. IP ranges of the recurring problems were loaded in to the firewall. The configuration of the firewall is to "Drop Connection" when an undesirable IP ranges try to gain access. Drop Connection, no server response, is like you hit an unknown blackhole in cyberspace, full scale combat. Oops, The End Result Of Full Scale Combat Includeds Collaterial Damage The scripts blocked "zilla" and later the firewall blocked pmoz and anything that originated from a known recurring problem IP range. Can I change the past? No. Should an exception be made for a small group of sites that have restrictive/non-human access controls. That isn't my decession to make. Did I block more than just the "directory" trying to do the best they can? Yes, take a look: Mod_Security log file: Request: http://www.mysite.com 204.146.25.202 - - [24/May/2008:06:12:55 +0000] "HEAD /node/85$ ---------------------------------------- HEAD /node/852 HTTP/1.1 Accept: application/octet-stream, application/*, audio/*, image/gif, image/jpeg, image/pjpeg, image$ Accept-Language: en-us Cache-Control: no-cache Connection: Keep-Alive Content-Length: 0 Host: http://www.mysite.com Pragma: no-cache Referer: http://www.ibm.com/developerworks/blogs/page/woolf?tag=j2ee User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) mod_security-action: 403 mod_security-message: Access denied with code 403. Pattern match "!^$" at HEADER("Content-Length") $ HTTP/1.1 403 Forbidden Keep-Alive: timeout=15, max=99 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 --06a71e5c-- _____________________________________________________________________________________ The above is IBM trying to verify if a page linked to from one of their developer's blogs is available. Conclusion I learn, listen and go forward knowing that one day I am going to "shoot myself in the foot" again. I hope the non-tech oriented who have chosen to read the above understand it. Regards
Stern123 Posted April 17, 2009 Author Posted April 17, 2009 Hello... Newman! The more thoughtful developers of spiders out there are taking the extra step of reassuring webmasters with identifiable IP ranges and reverse dns checks and all that. While you may not have the resources to take it that far, wouldn't it be better to use a DMOZ or AOL IP at the very least? Sure, pmoz may have a unique agent, but we also saw a fraudulent Googlebot from the same hosting range, so clearly, the user agent (which can so easily be faked) is not a sign of trust whatsoever. In my opinion, it's actually quite Newman-like to shrug off any culpability for your dubious neighbors and see us as just a drop in the bucket Update: Our server continues to receive suspicious activity from pmoz's hosting server -- including a hit by "Morfeus F***ing Scanner" ... No, I'm not being obscene -- that is the exact (albeit censored) name and user agent of a hacking tool that is scanning for php vulnerabilities. To summarize, ODP relies on a scanning tool from a bad "neighborhood" containing the likes of "Morfeus F***ing Scanner" and other potential sources of spamming and hacking activity. BUT the manager of that tool has expressed a total lack of concern, and so webmasters cannot take the precaution of denying visits from that bad neighborhood IF they wish to remain listed in DMOZ. Please note that no reputable global web-based organization would use a tool like pmoz.info and risk associating itself with shared hosting and Morfeus F***ing Scanners. I once again beseech the powers-that-be to review this thread and eliminate DMOZ's dependency on pmoz.info. There is the alternative -- use only your reliable and professional tool called "Robozilla" which comes from dmoz.aol.com and which can be safely accepted by all websites.
Meta tschild Posted April 17, 2009 Meta Posted April 17, 2009 Does "from pmoz's hosting server" mean "from the same IP that pmoz.info resolves to"?
Meta pvgool Posted April 17, 2009 Meta Posted April 17, 2009 As Morfeus is a well known and old scanner that is running from many servers from all over the world you should either block that specific scanner (and not some ipnumber from which one of the scanner is running at the moment) or better yet you should keep your server upto date and save from possible attacks. Blaming others for things they can not control will be of no help. pmoz is doing a very good job for DMOZ. Accidentaly marking a few sites as not accesible is not a problem for us. Much better than letting the real errors exist. But that has been explained to you before. I will not answer PM or emails send to me. If you have anything to ask please use the forum.
Stern123 Posted April 17, 2009 Author Posted April 17, 2009 Blaming others for things they can not control will be of no help. I don't comprehend the defensive and defiant attitude. I am not blaming you for your neighbors. I am questioning the choice of setting up shop in a 3rd party spammy neighborhood, instead of better alternatives and suggested improvements. I've written so many posts about this, I don't know how else to clarify myself and stop you from running me around in circles. Much better than letting the real errors exist. But that has been explained to you before. Yes, but I already addressed that when I wrote: You evade the point. You make it sound like it's a choice of either a) a few sites incorrectly removed, or b) a few sites not removed correctly, and that's misleading. <SNIP> ...But I think you know that, and what you're really trying to say is that you won't put the extra time to upgrade the bot, because you don't care about the few false positives. Let's be honest. Which you acknowledged when you wrote: Yes, for me personaly (but I can't speek for any other editor or for the DMOZ as an organsiation) I don't care about a few false positives. So for the sake of everyone's time and patience, please stop misleading everyone that this either a black-and-white choice of either "accidentaly marking a few sites as not accesible" OR "letting the real errors exist". Because that's simply evasive and dishonest.
jimnoble Posted April 17, 2009 Posted April 17, 2009 There are several possibilities here including: The volunteer author of the excellent pmoz.info tools goes to the trouble of moving them to a better hosting neighbourhood - which might or might not remain that way in perpetuity. You adjust your blocking strategy to permit access from pmoz.info. Nobody does anything Please don't assume that the tools' volunteer author has the resources available to make changes in any meaningful time scale. You're the guy motivated to get something done so your most realistic solution is to take option 2. That isn't a defensive or dishonest attitude, it's logic .
Stern123 Posted April 17, 2009 Author Posted April 17, 2009 You're the guy motivated to get something done so your most realistic solution is to take option 2. Agreed, and I have already implemented that. But the point is that it's a bandaid solution, and I'm here to make a reasonable criticism of the situation plus suggestions for improvements. That isn't a defensive or dishonest attitude, it's logic . I consider it defensive/dishonest if the primary responses are 'no, you change' or 'we're blameless, stop blaming us' and 'we don't care' and so forth. More honest is a balanced response (like yours) along the lines of 'Yes, we acknowledge this, and we are interested in moving forward, but it will take time, so in the meantime, please consider this temporary solution' Unfortunately, many came across to me as overly-defensive, lazy and/or uncaring -- a far cry from progressive thinking. To expand on your list above, I would wish to suggest one or more of the following: -Move pmoz to any other shared hosting which genuinely enforces a no-spam policy -Move pmoz to dmoz.org or dmoz.aol.com (like Robozilla) for authenticity and verifiable reverse dns checks (like yahoo, google, or any other professional search engine/directory) -Just stop running pmoz BUT continue to use Robozilla to auto-tag websites -Use pmoz as a secondary authority; only auto-remove sites if Robozilla also "agrees" -Use pmoz to auto-remove site only if it gets consecutive 404s (page not found) -If pmoz gets a 403 (forbidden) or 50x (error) response code, always defer to Robozilla as the higher authority Some of these changes are more time and resource intensive, others can be relatively quick and easy to implement, but any of the above is worthwhile, unlike the time that it takes to argue in circles over non-sequiturs. To those who protest about protecting the integrity of DMOZ, I can't imagine why any of these improvements would compromise ODP's ability to weed out bad websites.
Meta pvgool Posted April 17, 2009 Meta Posted April 17, 2009 > Because that's simply evasive and dishonest No it is not. We have the choice between running tools developed by editors (like the one you see from pmoz) or not running them. We don't have access to the aol servers so can not put our tools on those servers. Than the choice for us is easy. Run the tools. Accept that they are not perfect. Each month several 1000 sites get removed by these tools. That is good. If they also remove some sites a few sites that by accident, so be it. All sites marked by the tools are marked for an editor to review. If during such a review the editor noitces that the tool was wrong he can mark the site so the tools will know about their mistake. That is all fine for us. And we have no problem with this process. Will the owner of such a site think we have a problem. Possibly yes, as you have shown. But with several millions sites listed, several millions sites being rejected over the years and millions of sites waiting review (either being suggested or not) the status of a single site is irrelevant for us. I will not answer PM or emails send to me. If you have anything to ask please use the forum.
Stern123 Posted April 17, 2009 Author Posted April 17, 2009 > Because that's simply evasive and dishonest No it is not. Yes, it is, because... > We have the choice between running tools developed by editors (like the one you see from pmoz) or not running them 3rd option: Improve the tools (see above) 4th option: Run the tools from a better host (see above) 5th option: Defer to Robozilla (see above) etc. etc. With your black-and-white stasist attitude, we'd only be allowed to use Windows 95, Apple Macintosh, or no computer at all! I hope that Zhuo Zhang et. al. will take a more progressive open-minded approach to DMOZ 2.0!
Meta pvgool Posted April 17, 2009 Meta Posted April 17, 2009 'Yes, we acknowledge this, and we are interested in moving forward, but it will take time, so in the meantime, please consider this temporary solution' No, I do not acknowledge that what you have mentioned in this thread is a problem for DMOZ. Yes, we are interested in moving forward. It is just that we prefer to move forward on different paths. Everything that will improve the way editors can work and that would improve the directory as a whole have our priotities. A small issue of some sites being marked in error by mistake is not one of them. > -Move pmoz to any other shared hosting which genuinely enforces a no-spam policy How do you know it isn't already hosted on such a server. The Morfeus thing is most probably run from that server because one of the sites on that server got hacked. Did you already contact the hosting provider? > -Move pmoz to dmoz.org or dmoz.aol.com (like Robozilla) for authenticity and verifiable reverse dns checks (like yahoo, google, or any other professional search engine/directory) Not possible. We don't have access to those servers. > -Just stop running pmoz BUT continue to use Robozilla to auto-tag websites Not realistic. I repeat again: these tools do very usefull work for DMOZ and its editors. > -Use pmoz as a secondary authority; only auto-remove sites if Robozilla also "agrees" Not possible. We do not control Robozilla. AOL does. > -Use pmoz to auto-remove site only if it gets consecutive 404s (page not found) > -If pmoz gets a 403 (forbidden) or 50x (error) response code, always defer to Robozilla as the higher authority This would imply an interaction between the tools. There isn't one and we don't have the options and resources to make these changes. > but any of the above is worthwhile In your opinion yes. In my opinion no. We have many more worthwhile improvements to make. And as we are very limited in resources we have to decide what to do and what to do not. The list of improvements is already large. Improvements that will make the live of editors easier or be for the overall quality fo the directory will have the highest priority. I can't see that we will ever make low priotity improvements as we jusy don't have the resources. > I can't imagine why any of these improvements would compromise ODP's ability to weed out bad websites. Most probably because you don't have insight in the internal workings of DMOZ. I will not answer PM or emails send to me. If you have anything to ask please use the forum.
jimnoble Posted April 17, 2009 Posted April 17, 2009 The OP no longer has a problem and thus no longer needs us to fix anything. His suggestions may or may not be possible to implement, but it's for sure that AOL won't be spending any money/effort on this topic and that there's no way to force the pmoz.info owner to do so either. The thread would thus seem to have reached its natural conclusion - an agreement to disagree. Closing.
Recommended Posts