Crawler

eyecon · March 28, 2006

I would think that DMOZ has considerable bandwidth to work with. I'm curious why there are so many redirects and dead links that are not detected. The crawler cannot detect certain types but it should identify most of the redirects and, certainly, the dead links. For example:

[dch@traif ~]# wget --spider --force-html

http://list-business.com/opt-in-list-systems

--23:44:15--

http://list-business.com/opt-in-list-systems

=> `opt-in-list-systems'

Resolving list-business.com... 216.235.79.13

Connecting to list-business.com|216.235.79.13|:80... connected.

HTTP request sent, awaiting response... 302 Found

Location:

http://emailuniverse.com/

[following]

--23:44:15--

http://emailuniverse.com/

=> `index.html'

Resolving emailuniverse.com... 216.235.79.13

Connecting to emailuniverse.com|216.235.79.13|:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: unspecified [text/html]

200 OK

sfromis · March 28, 2006

Many redirects are ok, as we do not want to list subpage-of-the-day which the base domain redirects to. However, improved QC efforts mean that you should find the count of outdated URLs going down.

jeanmanco · March 29, 2006

Our link checker does not run constantly. If you take a look at http://research.dmoz.org/publish/chris2001/odp_reports/report_2005.htm

you will see that it only ran twice last year. There are other QC efforts going on in between Robozilla runs, but it is an enormous task.

sfromis · March 30, 2006

Note that the report also says "the other tools are used in shorter periods or continuously". Link checking is much more than just Robozilla

eyecon · March 30, 2006

Note that the report also says "the other tools are used in shorter periods or continuously". Link checking is much more than just Robozilla

That MIGHT necessitate some robots.txt changes. What is the user agent string?

sfromis · March 30, 2006

That question cannot be answered.

Sign In

Crawler

Recommended Posts

eyecon

sfromis

jeanmanco

sfromis

eyecon

sfromis

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity