Use of ODP data - optimizing searches etc.

bjorn

Member
Joined
Jun 30, 2004
Messages
10
I just downloaded the odp rdf's and imported them into my local MySQL ..

The result is 1.7GB less disk space and the whole directory in my own db ;)

Anyhow, how does dmoz.org use this data? (when browsing, searching, etc.)

For example, the SQL query:
select*from content_links WHERE topic LIKE('%java%');

resulted in:

4871 rows in set (2 min 27.29 sec)

.. of course, this isn't a *real* example, since we would most likely use LIMIT 0,10, etc. to only show the ten first.

But still - does anyone know how the ODP uses this data? Any articles or 'inside info' on the subject?
 

giz

Member
Joined
May 26, 2002
Messages
3,112
Regarding the RDF data, the ODP does not use that data - it makes it!

The unreviewed and published site listings actually use a file and folder database structure. Once per week or so, a background process runs which parses the published data, and over a period of several days builds the RDF file. The RDF file is published about weekly, with an occasional gap if there is a technical problem.

That file is used by hundreds of downstream users who can incorporate the data into their own version of the directory (as long as the rules for licensing and attribution are followed). There are a number of innovative uses of the data: the Google directory, and the Thumbshots versions being just two of them.



As of last week the RDF file is 100% fully UTF-8 compliant, having had several corrupted characters in each of the RDF files in previous weeks over the last few months. If you look at the RDF archive, you'll also see that there were substantially more errors earlier in 2004 and late 2003 while the actual conversion to UTF-8 was taking place.
 
This site has been archived and is no longer accepting new content.
Top