rdf to search text

fairwinds · Aug 26, 2005

I was hoping someone on the forum here can explain how rdf data on dmoz eventually ends up as searchable text on the site. How does dmoz itself use the rdf data it has harvested to provide full text search.

Many thanks.
David

hutcheson · Aug 27, 2005

I'm not sure what you're asking. The source for the search engine is, um, open-sourced. (It's in C, as I remember.)

And I wouldn't call it "full text search". Only the category names and keywords, site titles and descriptions, are indexed. That in itself is a fairly large hunk -- the equivalent of about twenty linear feet of books on pulped-rain-forest media.

And, emphatically, ODP search has no concept of priority or rank. Every site is either "in" or "out" of any given search. Sites are displayed in some random order (indeterminate but deterministic, to be more precise) -- probably not what you want for a search engine for end users.)

giz · Aug 27, 2005

Think of ODP search as a means to find categories - categories where all the sites you are looking for are actually listed.

Don't think of it as a keyword search to find sites: that is what Google, Yahoo, etc are for.

fairwinds · Aug 27, 2005

Thanks for your replies. Where would I be able to see the source of the search engine component of DMOZ if open source. I am relatively new to RDF and have been interested in how RDF is being used after it is gathered. In a triples store it could be queried but would be very slow. I am interested in which fields are indexed specifically. I was thinking of putting something together with RDF data but full text searching is important which is reason for me to explore DMOZ because it is a tangible example. I was thinking of using Lucene but want to explore real situations such as DMOZ before I start making something.

Regards
David

pvgool · Aug 27, 2005

I think you would best start out at http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/ .

ishtar · Aug 27, 2005

http://dmoz.org/ODPsearch/

giz · Aug 27, 2005

Koo!

I never knew that was there.

fairwinds · Aug 27, 2005

Many thanks for the link to the search source. I will start by looking at the code. Another question. Its the DMOZ itself a site of static pages? What is the process for generating the pages from the RDF for browsing the directory hierarchy?

bekahm · Aug 27, 2005

It's the other way around. The RDF is generated from the directory.

fairwinds · Aug 27, 2005

Ok, this is what I am interpreting. DMOZ starts with a database backend that is used to collect/edit data, pages are generated from the data. The data is also serialized as RDF for distribution / consumption by others. Is this correct?

Here are additional questions that I have in relation to data storage and RDF serialization. What is the database backend for DMOZ. Are pages dynamically generated? Is RDF used as an input to DMOZ in any case or is it used solely as the output format? I was reading about a one week cycle for new RDF data to become available. What is happening in the one week cycle to build the RDF?
Many thanks for your replies as I learn the role and relationship of RDF at DMOZ.

Regards
David

giz · Aug 27, 2005

I don't know the full story, but the editor side has a Berkeley flat-file database with each category as a folder, and many tools to manipulate the category data between unreviewed and reviewed status, move to other categories, etc. All editing actions are also logged, by user, and by category, etc.

The site navigation uses the folder heirarchy with breadcrumb navigation, and @links and RelatedCategory links to bridge to other categories that could be said to be child or related categories.

After an edit, the editor side HTML pages are generated within seconds, but the data can take hours or days to be copied over to the public side. There is a "job queue" that lists pages to be regenerated.

In the background, some spidered process runs for 4 or 5 days gethering the category data into the RDF. It works slowly as it has half a million categories to traverse. The RDF pops out about weekly and is made available on a separate part of the site (along with old archived copies from way back).

The RDF file is also then fed to the search server, and after a couple of days the search database is up to date with whatever was in the RDF, but already the search is a week behind the reality of what the editors are seeing in their categories.

fairwinds · Aug 28, 2005

Many thanks for taking time to explain this. If there is any other info that may helpful in filling in the blanks as far as processes it would also be appreciated; particularly what is happening with RDF or how it is used specifically in DMOZ situation.

Regards
David

rdf to search text

fairwinds

Member

hutcheson

giz

Member

fairwinds

Member

pvgool

kEditall/kCatmv

ishtar

Member

giz

Member

fairwinds

Member

bekahm

Member

fairwinds

Member

giz

Member

fairwinds

Member