Use of ODP data - optimizing searches etc.

bjorn · June 30, 2004

I just downloaded the odp rdf's and imported them into my local MySQL ..

The result is 1.7GB less disk space and the whole directory in my own db

Anyhow, how does dmoz.org use this data? (when browsing, searching, etc.)

For example, the SQL query:

select*from content_links WHERE topic LIKE('%java%');

resulted in:

4871 rows in set (2 min 27.29 sec)

.. of course, this isn't a *real* example, since we would most likely use LIMIT 0,10, etc. to only show the ten first.

But still - does anyone know how the ODP uses this data? Any articles or 'inside info' on the subject?

giz · June 30, 2004

Regarding the RDF data, the ODP does not use that data - it makes it!

The unreviewed and published site listings actually use a file and folder database structure. Once per week or so, a background process runs which parses the published data, and over a period of several days builds the RDF file. The RDF file is published about weekly, with an occasional gap if there is a technical problem.

That file is used by hundreds of downstream users who can incorporate the data into their own version of the directory (as long as the rules for licensing and attribution are followed). There are a number of innovative uses of the data: the Google directory, and the Thumbshots versions being just two of them.

As of last week the RDF file is 100% fully UTF-8 compliant, having had several corrupted characters in each of the RDF files in previous weeks over the last few months. If you look at the RDF archive, you'll also see that there were substantially more errors earlier in 2004 and late 2003 while the actual conversion to UTF-8 was taking place.

Sign In

Use of ODP data - optimizing searches etc.

Recommended Posts

bjorn

giz

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity