rdf to search text

fairwinds

Member
Joined
Aug 26, 2005
Messages
10
I was hoping someone on the forum here can explain how rdf data on dmoz eventually ends up as searchable text on the site. How does dmoz itself use the rdf data it has harvested to provide full text search.

Many thanks.
David
 

hutcheson

Curlie Meta
Joined
Mar 23, 2002
Messages
19,136
I'm not sure what you're asking. The source for the search engine is, um, open-sourced. (It's in C, as I remember.)

And I wouldn't call it "full text search". Only the category names and keywords, site titles and descriptions, are indexed. That in itself is a fairly large hunk -- the equivalent of about twenty linear feet of books on pulped-rain-forest media.

And, emphatically, ODP search has no concept of priority or rank. Every site is either "in" or "out" of any given search. Sites are displayed in some random order (indeterminate but deterministic, to be more precise) -- probably not what you want for a search engine for end users.)
 

giz

Member
Joined
May 26, 2002
Messages
3,112
Think of ODP search as a means to find categories - categories where all the sites you are looking for are actually listed.

Don't think of it as a keyword search to find sites: that is what Google, Yahoo, etc are for.
 

fairwinds

Member
Joined
Aug 26, 2005
Messages
10
Thanks for your replies. Where would I be able to see the source of the search engine component of DMOZ if open source. I am relatively new to RDF and have been interested in how RDF is being used after it is gathered. In a triples store it could be queried but would be very slow. I am interested in which fields are indexed specifically. I was thinking of putting something together with RDF data but full text searching is important which is reason for me to explore DMOZ because it is a tangible example. I was thinking of using Lucene but want to explore real situations such as DMOZ before I start making something.

Regards
David
 

fairwinds

Member
Joined
Aug 26, 2005
Messages
10
Many thanks for the link to the search source. I will start by looking at the code. Another question. Its the DMOZ itself a site of static pages? What is the process for generating the pages from the RDF for browsing the directory hierarchy?
 

fairwinds

Member
Joined
Aug 26, 2005
Messages
10
Ok, this is what I am interpreting. DMOZ starts with a database backend that is used to collect/edit data, pages are generated from the data. The data is also serialized as RDF for distribution / consumption by others. Is this correct?

Here are additional questions that I have in relation to data storage and RDF serialization. What is the database backend for DMOZ. Are pages dynamically generated? Is RDF used as an input to DMOZ in any case or is it used solely as the output format? I was reading about a one week cycle for new RDF data to become available. What is happening in the one week cycle to build the RDF?
Many thanks for your replies as I learn the role and relationship of RDF at DMOZ.

Regards
David
 

giz

Member
Joined
May 26, 2002
Messages
3,112
I don't know the full story, but the editor side has a Berkeley flat-file database with each category as a folder, and many tools to manipulate the category data between unreviewed and reviewed status, move to other categories, etc. All editing actions are also logged, by user, and by category, etc.

The site navigation uses the folder heirarchy with breadcrumb navigation, and @links and RelatedCategory links to bridge to other categories that could be said to be child or related categories.

After an edit, the editor side HTML pages are generated within seconds, but the data can take hours or days to be copied over to the public side. There is a "job queue" that lists pages to be regenerated.

In the background, some spidered process runs for 4 or 5 days gethering the category data into the RDF. It works slowly as it has half a million categories to traverse. The RDF pops out about weekly and is made available on a separate part of the site (along with old archived copies from way back).

The RDF file is also then fed to the search server, and after a couple of days the search database is up to date with whatever was in the RDF, but already the search is a week behind the reality of what the editors are seeing in their categories.
 

fairwinds

Member
Joined
Aug 26, 2005
Messages
10
Many thanks for taking time to explain this. If there is any other info that may helpful in filling in the blanks as far as processes it would also be appreciated; particularly what is happening with RDF or how it is used specifically in DMOZ situation.

Regards
David
 
This site has been archived and is no longer accepting new content.
Top