bjorn Posted June 30, 2004 Posted June 30, 2004 I just downloaded the odp rdf's and imported them into my local MySQL .. The result is 1.7GB less disk space and the whole directory in my own db Anyhow, how does dmoz.org use this data? (when browsing, searching, etc.) For example, the SQL query: select*from content_links WHERE topic LIKE('%java%'); resulted in: 4871 rows in set (2 min 27.29 sec) .. of course, this isn't a *real* example, since we would most likely use LIMIT 0,10, etc. to only show the ten first. But still - does anyone know how the ODP uses this data? Any articles or 'inside info' on the subject?
giz Posted June 30, 2004 Posted June 30, 2004 Regarding the RDF data, the ODP does not use that data - it makes it! The unreviewed and published site listings actually use a file and folder database structure. Once per week or so, a background process runs which parses the published data, and over a period of several days builds the RDF file. The RDF file is published about weekly, with an occasional gap if there is a technical problem. That file is used by hundreds of downstream users who can incorporate the data into their own version of the directory (as long as the rules for licensing and attribution are followed). There are a number of innovative uses of the data: the Google directory, and the Thumbshots versions being just two of them. As of last week the RDF file is 100% fully UTF-8 compliant, having had several corrupted characters in each of the RDF files in previous weeks over the last few months. If you look at the RDF archive, you'll also see that there were substantially more errors earlier in 2004 and late 2003 while the actual conversion to UTF-8 was taking place.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now