How do I get a list of URLs to crawl?

Volitics

Member
Joined
Aug 30, 2004
Messages
2
Hello;

I'm thinking about designing a search engine. I somehow need to get a list of domain names or URLs for the search engine to crawl and add to the index.

Does anyone know how I can get a listing of just the raw URLs in the dmoz database?

Thanks.
 

photofox

Curlie Admin
RZ Admin
Joined
Jun 9, 2010
Messages
2,092
Location
[Right here]
Hi there,

There is actually a specific forum for questions like this, it can be found here. You might want to take a look at some of the past threads there to see if your question has already been answered...

You may also want to take a look at http://rdf.dmoz.org/
These are the rules you must follow http://dmoz.org/license.html

Again, if you search the forums, you should be able to find much of the information you need in order to set things up.
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
The above is correct, however, there is currently a problem accessing http://dmoz.org/license.html and some of the other related information.

You can download a content RDF and extract URL's from that, note that you would have to write code to elimate duplicate entries, and not all URL's are the "home" pages of sites.
 

shilpesh

Member
Joined
Sep 17, 2004
Messages
4
hello volitics i read your message
you can get the help from the dmoz site itself as it allows to download the whole database which is in .gz form you have to download it and then convert into mysql database by using a third party software
and then you can get the raw listings of the urls of dmoz database

if you get my message then reply to me as i can give u further information regarding how the urls can be extracted

regards,
shilpesh
 
This site has been archived and is no longer accepting new content.
Top