Dead and otherwise non 200 OK links

tachyon

Member
Joined
Jul 12, 2004
Messages
16
Don't know where else to put this. I have just trawled the 4+ million odd links in DMOZ over the weekend. About 7.5 % result in non 200 OK response codes. The actual urls in each error cat are available at http://devel3.smartsurf.org/dmoz/ if this is of any use to anyone.

I don't know if there is any form of QA but pulling the 404 list, reconciling the cat to an editor and emailing them a note that links foo bar baz are 404 might be worthwhile.

BTW are any god level editors interested in a list of DMOZ sites that are not in Top/Adult that have a statistical probability of > 99.5% sensitivity/specificity of currently rendering porn?

Code:
[root@devel3 dmoz]# wc dmoz-*
1       1      35 dmoz-10
1       1      32 dmoz-100
1       1      21 dmoz-201
1       1      26 dmoz-202
6       6     218 dmoz-204
4       4      87 dmoz-205
1       1      18 dmoz-299
859     859   38420 dmoz-401
16      16     520 dmoz-402
7656    7656  257221 dmoz-403
123068  123068 7049452 dmoz-404
1       1      23 dmoz-406
3       3     102 dmoz-407
1       1      25 dmoz-408
327     327   10717 dmoz-410
18602   18602 1091169 dmoz-414
1       1      83 dmoz-415
3       3     103 dmoz-418
4       4     135 dmoz-419
1       1      24 dmoz-420
40      40    1588 dmoz-423
7       7     178 dmoz-449
109927  109929 4064789 dmoz-500
84      84    2932 dmoz-501
103     103    5403 dmoz-502
873     873   38363 dmoz-503
1       1      25 dmoz-5030
27718   27718 1059829 dmoz-504
2       2      48 dmoz-507
247     247   10874 dmoz-508
15891   15891  721798 dmoz-510
1       1      34 dmoz-550
20      20     997 dmoz-999
305471  305473 14355289 total
[root@devel3 dmoz]#
 

andysands

Curlie Meta
Joined
Nov 24, 2003
Messages
698
We do have a list of non 200 sites generated as internal QA, in a checklist form (where editors can check off links they've verified). As you can tell from your exercise - getting up to date is a mammoth task though!

- re: inappropriate porn links

I am sure they would be. Removing inappropriately placed adult content is always a priority.
 

hutcheson

Curlie Meta
Joined
Mar 23, 2002
Messages
19,136
We do have an internal automatic link checker, which is run periodically. But ... you're showing lots more nonrespondents than we typically do.

One thing our process does is recheck the failing sites after about a week. This keeps us from chasing after sites that are temporarily down.

Another thing that seems odd is the number of 500 sites. Is it possible that a lot of sites are bouncing your spider simply because they don't know it?

Finally, your "pure porn filter" idea seems VERY interesting. Yes, I believe we would very much like to see a sample run for it. If it really turns out to be effective, we might grovel _really_ hard to get you to donate it to our editor toolbin.
 

tachyon

Member
Joined
Jul 12, 2004
Messages
16
As you correctly note not all the 500 errors are 'real' although they are not bounces per se. We use 500 as the return code for a few of non classical 500 events. There were however 83,899 straight socket connect failures and 4,263 Internal Server Errors.

The porn engine is happily running in the background at the moment. It is a variant on fisher/robinson inverse chi square statistical model. I just kicked it off this morning - it will take just over 3 days to complete. Currently it is seeing 1.6% porn accross dmoz so it is going to end up with 60-70,000 odd urls at the end. Where/to whom should I post the results when they arrive?
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
Maybe the link should be hidden from public view and he should send a private message to you instead.
 

brmehlman

Member
Joined
Nov 6, 2002
Messages
3,080
Hmmm. Lemme think about that one. We're ok with posting one or a few dozen porn hijack links in the abuse forum, but maybe sixty thousand would be a bit much.

OK, private message to me would probably be better.
 

tachyon

Member
Joined
Jul 12, 2004
Messages
16
I have sent you, hutcheson and brmehlman a few links and some details. I don't have time to fill in a single url form a few thousand times and would feel like I was spamming you if I automated it to send you thousands of url. As you can see the results are interesting and there is certainly attention required.

I have only been on this forum for a day but have noted that you don't lilke links posted per se so am wondering the about the best way to deal with it. As you can see there is porn outside Top/Adult as you might expect with all the BS that goes on with people trying to make a buck.

Is dmoz like the mythical medusa? Should I forward info privately to the Meta Admins who have commented here or is the somewhere/someone else I should contact?
 

brmehlman

Member
Joined
Nov 6, 2002
Messages
3,080
The sample you sent me looks very promising indeed. I don't have time to go into it in any detail right now, I'm on my way to work and I need my job too much to be evaluating porn links from my desk there.

My opinion: The best thing would be to post the list on a web page and send the link privately to a few of us with the understanding that we'd share it with senior editors who have directory-wide access.
 

tachyon

Member
Joined
Jul 12, 2004
Messages
16
Yes, I thought the results were interesting too. The under construction hundred hidden porn links sites are interesting. I have seen the odd one before but..... I wonder what search engine spider they are optimised for?

I had to shut it down for a while to do some other things. It should get a pretty clear run for the next couple of days. About 20% of the way through at 800K. Looks like it will flag 65K as expected. It should be statistically representative at this stage.
Code:
mysql> select count(*) from dmoz where porn >0.95;
+----------+
| count(*) |
+----------+
|    11913 |
+----------+
1 row in set (1 min 9.54 sec)

mysql> select count(*) from dmoz where flags & 1 = 1;
+----------+
| count(*) |
+----------+
|   812358 |
+----------+
1 row in set (1 min 7.31 sec)

mysql> select count(*) from dmoz;
+----------+
| count(*) |
+----------+
|  4426678 |
+----------+
1 row in set (0.00 sec)

mysql> select 4426678*(11913/812358) as predict;
+----------+
| predict  |
+----------+
| 64915.98 |
+----------+
1 row in set (0.01 sec)

Before you ask there are no indexes on the porn and flags fields while the insert is happening, thus it is glacial. Thanks for the heads up on the 500. Just fixed an interesting bug where a couple of domains (with lots of subdomains) don't send any headers if you admit to being User-Agent: Mozilla blah blah. Quite odd really as they send the usual HTTP headers if the User-Agent is not a Mozilla/Netscape/IE valid agent string.

I will send you a link when it completes in a couple of days time. If you want to see some more interim results let me know. Do you want it .htaccess password secured or just security by obscurity?
 

motsa

Curlie Admin
Joined
Sep 18, 2002
Messages
13,294
What I had meant when I said submit an abuse report was to submit a link to a page that lists the sites in question (i.e. what brmehlman had suggested) rather than emailing said link to someone in particular. Sorry if that was confusing. :) But emailing will work just as well as you know that we're paying attention to this.

I'm looking forward to seeing the full results as a random checking of the samples you sent showed they were definitely porn.

Security by obscurity should be enough, although feel free to use htaccess if you feel more comfortable with that.
 

tachyon

Member
Joined
Jul 12, 2004
Messages
16
The first batch of results are now up. There are a good chunk of *intersting* sites looking at a random selection, without too many false positives. I have sent private messages to the admins on this thread as well as the link to abuse so that should cover all the bases.

Thanks for your suggestions.
 
This site has been archived and is no longer accepting new content.
Top