fairwinds Posted August 27, 2005 Posted August 27, 2005 I was hoping someone on the forum here can explain how rdf data on dmoz eventually ends up as searchable text on the site. How does dmoz itself use the rdf data it has harvested to provide full text search. Many thanks. David
Meta hutcheson Posted August 27, 2005 Meta Posted August 27, 2005 I'm not sure what you're asking. The source for the search engine is, um, open-sourced. (It's in C, as I remember.) And I wouldn't call it "full text search". Only the category names and keywords, site titles and descriptions, are indexed. That in itself is a fairly large hunk -- the equivalent of about twenty linear feet of books on pulped-rain-forest media. And, emphatically, ODP search has no concept of priority or rank. Every site is either "in" or "out" of any given search. Sites are displayed in some random order (indeterminate but deterministic, to be more precise) -- probably not what you want for a search engine for end users.)
giz Posted August 27, 2005 Posted August 27, 2005 Think of ODP search as a means to find categories - categories where all the sites you are looking for are actually listed. Don't think of it as a keyword search to find sites: that is what Google, Yahoo, etc are for.
fairwinds Posted August 27, 2005 Author Posted August 27, 2005 Thanks for your replies. Where would I be able to see the source of the search engine component of DMOZ if open source. I am relatively new to RDF and have been interested in how RDF is being used after it is gathered. In a triples store it could be queried but would be very slow. I am interested in which fields are indexed specifically. I was thinking of putting something together with RDF data but full text searching is important which is reason for me to explore DMOZ because it is a tangible example. I was thinking of using Lucene but want to explore real situations such as DMOZ before I start making something. Regards David
Meta pvgool Posted August 27, 2005 Meta Posted August 27, 2005 I think you would best start out at http://dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/ . I will not answer PM or emails send to me. If you have anything to ask please use the forum.
fairwinds Posted August 27, 2005 Author Posted August 27, 2005 Many thanks for the link to the search source. I will start by looking at the code. Another question. Its the DMOZ itself a site of static pages? What is the process for generating the pages from the RDF for browsing the directory hierarchy?
bekahm Posted August 27, 2005 Posted August 27, 2005 It's the other way around. The RDF is generated from the directory.
fairwinds Posted August 27, 2005 Author Posted August 27, 2005 Ok, this is what I am interpreting. DMOZ starts with a database backend that is used to collect/edit data, pages are generated from the data. The data is also serialized as RDF for distribution / consumption by others. Is this correct? Here are additional questions that I have in relation to data storage and RDF serialization. What is the database backend for DMOZ. Are pages dynamically generated? Is RDF used as an input to DMOZ in any case or is it used solely as the output format? I was reading about a one week cycle for new RDF data to become available. What is happening in the one week cycle to build the RDF? Many thanks for your replies as I learn the role and relationship of RDF at DMOZ. Regards David
giz Posted August 27, 2005 Posted August 27, 2005 I don't know the full story, but the editor side has a Berkeley flat-file database with each category as a folder, and many tools to manipulate the category data between unreviewed and reviewed status, move to other categories, etc. All editing actions are also logged, by user, and by category, etc. The site navigation uses the folder heirarchy with breadcrumb navigation, and @links and RelatedCategory links to bridge to other categories that could be said to be child or related categories. After an edit, the editor side HTML pages are generated within seconds, but the data can take hours or days to be copied over to the public side. There is a "job queue" that lists pages to be regenerated. In the background, some spidered process runs for 4 or 5 days gethering the category data into the RDF. It works slowly as it has half a million categories to traverse. The RDF pops out about weekly and is made available on a separate part of the site (along with old archived copies from way back). The RDF file is also then fed to the search server, and after a couple of days the search database is up to date with whatever was in the RDF, but already the search is a week behind the reality of what the editors are seeing in their categories.
fairwinds Posted August 28, 2005 Author Posted August 28, 2005 Many thanks for taking time to explain this. If there is any other info that may helpful in filling in the blanks as far as processes it would also be appreciated; particularly what is happening with RDF or how it is used specifically in DMOZ situation. Regards David
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now