Crawler

eyecon

Member
Joined
Jan 6, 2005
Messages
118
I would think that DMOZ has considerable bandwidth to work with. I'm curious why there are so many redirects and dead links that are not detected. The crawler cannot detect certain types but it should identify most of the redirects and, certainly, the dead links. For example:

[dch@traif ~]# wget --spider --force-html http://list-business.com/opt-in-list-systems
--23:44:15-- http://list-business.com/opt-in-list-systems
=> `opt-in-list-systems'
Resolving list-business.com... 216.235.79.13
Connecting to list-business.com|216.235.79.13|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://emailuniverse.com/ [following]
--23:44:15-- http://emailuniverse.com/
=> `index.html'
Resolving emailuniverse.com... 216.235.79.13
Connecting to emailuniverse.com|216.235.79.13|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
200 OK​
 

sfromis

Member
Joined
Mar 25, 2002
Messages
202
Many redirects are ok, as we do not want to list subpage-of-the-day which the base domain redirects to. However, improved QC efforts mean that you should find the count of outdated URLs going down. :)
 

sfromis

Member
Joined
Mar 25, 2002
Messages
202
Note that the report also says "the other tools are used in shorter periods or continuously". Link checking is much more than just Robozilla :)
 

eyecon

Member
Joined
Jan 6, 2005
Messages
118
sfromis said:
Note that the report also says "the other tools are used in shorter periods or continuously". Link checking is much more than just Robozilla :)

That MIGHT necessitate some robots.txt changes. What is the user agent string?
 
This site has been archived and is no longer accepting new content.
Top