Spidering the DMOZ DATABASE

lucasmd

Member
Joined
Aug 2, 2004
Messages
24
Hi,

I have opened a topic today for suggesting a partnership with your community.

I have developped a search engine which is based on the ODP. I am crawling pages of all websites registered in your database, using the category structure too.

I was thinking to extend my project to a community of crawlers because my bot can scan up to 150 000 per day from a low cost computer.

But it seems that you have deleted my previous post. My first question is why ? maybe you do not want to have such discussion ?

Best regards
Luc Michalski
 

makrhod

Member
Joined
Apr 5, 2004
Messages
1,899
it seems that you have deleted my previous post. My first question is why ?
It was not deleted. The forum software automatically flagged it as needing moderation because of the number of URLs you included, which is against this forum's TOS. It is still awaiting moderation.
 

Callimachus

Member
Joined
Mar 15, 2004
Messages
704
Hopefully your enthusiastic crawler obeys the robots.txt and related http meta protocols.
 

lucasmd

Member
Joined
Aug 2, 2004
Messages
24
Hi,

Yes of course, I am analyzing robots.txt and metas informations (NOINDEX,NOFOLLOW,...)

Currently, I am working on a function for grabbing sitemap.xml files.

I do not know how should I take contact with ODP staff for suggesting a partnership. Is there an interested administrator ? :)

Best Regards
Luc Michalski
 

chaos127

Curlie Admin
Joined
Nov 13, 2003
Messages
1,344
From what you've posted, I'm not sure if you're planning to get a list of the ODP-listed sites by crawling the pages on Curlie But in case you are, https://curlie.org/robots.txt states
# Please do not crawl us faster than 1 hit/second.
#
# If you need to examine many dmoz pages, please download the rdf file from
# http://rdf.dmoz.org/ instead of crawling us.
#
User-agent: *
Crawl-Delay: 1
Disallow: /cgi-bin/
Disallow: /editors/
Disallow: /World/.m
Failure to follow those instructions if/when you access Curlie may result in your IP being banned...
 
Last edited by a moderator:

chaos127

Curlie Admin
Joined
Nov 13, 2003
Messages
1,344
I'm also not sure exactly what you're proposing and how it would benefit anyone in the "community" that may feel like helping you, and specifically how it might benefit the Open Directory Project. If you're interested in some sort of partnership with the ODP, then you'll presumably need to be offering something in return. To be honest though, most outsiders' views of what the problems are facing the project are mostly quite inaccurate. You'd get a much better view of things if you tried your hand at editing for a few months...
 

lucasmd

Member
Joined
Aug 2, 2004
Messages
24
Hi,

Before trying to explain our potential partnership, you need to now that :
- The first idea of this search engine is/was an educational purpose.
- How works a search engine having an high volum of indexed pages ?
- How to create ranking criterias for optimizing searching and crawling (Server Rank, Link rank, Site Rank) ?
- How to crawl the web efficiently and quickly with a low cost computer ?
- How to keep the best performances with a MySQL and PHP system ?
- Understand technical limits of MySQL databases and find a flexible database layout
- How to reduce/control energy costs of a potential farm of servers ?
- How to convert the DMOZ's content file (RDF) into a SQL DUMP ?

So the target is not to be the best but try to do the same for understanding the how to. Well, I have no diploma <url removed>.

After finishing the first beta version of my search engine, I though to link to E-Commerce Portal with the Search Engine in order to provide more information to users (Web results/Products).

I started to develop the DealGates Network 4 years ago ...

About our potential partneship :

Why ?
I used your most of the category structure and I have converted your list of websites into a MySQL Directory...

This technology can be an extension of your directory... Of course, you have Google but the idea is to try because it is just costing time and to find some passionnated webmasters.

I can create an encrypted bot and share it with a community of webmasters using any LAMP server. A list of 500 or 1000 websites to crawl will be send to each bot, and a system will send the DUMP files to a special FTP account.
If we established a community of 10 000 webmasters, we can crawl a billion of pages very quickly.

Maybe you can ask me some precise questions if it will help you to understand better the project ? :eek:

Ps. The main server is a little bit overloaded and I am doing my best for correcting it... It is a personal project

Best regards
Luc Michalski
 

lucasmd

Member
Joined
Aug 2, 2004
Messages
24
Hi,

I have added some pages for explaining this project, you will see these links in the footer of each pages of the search engine.

Best Regards
Luc Michalski
 

pvgool

kEditall/kCatmv
Curlie Meta
Joined
Oct 8, 2002
Messages
10,093
lucasmd said:
Hi,

Before trying to explain our potential partnership, you need to now that :
Most of the editors do not have knowlegde about any of the points you mentioned. Technical development is done by our owners, AOL. You will have to contact them.

About our potential partneship :

Why ?
I used your most of the category structure and I have converted your list of websites into a MySQL Directory...
Many people have done this. That is why the DMOZ data is available for anybody to use. The only thing DMOZ asks is to honor our ownership. See http://www.dmoz.org/license.html

Editors will not be able to discus this license. We not only lack the knowlegde but we are also prohibited by AOL legal department.
 

lucasmd

Member
Joined
Aug 2, 2004
Messages
24
Hi,

I like such answers because it is efficient and well oriented... :)

Thanks, I saw that somebody from AOL France checked my profile yesterday at viadeo.com so maybe I can ask him some questions...

Best regards
Luc Michalski
 
This site has been archived and is no longer accepting new content.
Top