Jump to content

Recommended Posts

Posted

The RDF Dump (or at least the provided example) seems to adhere to some very old draft specification of RDF/XML. Are there plans to fix it to adhere to the final spec so that it can be read by standards RDF processors?

 

The main problems seem to be a wrong namespace, the unqualified use of rdf:about and a missing namespace prefix on the root element (which thus appears in the dmoz-ns rather than rdf).

 

Cheers,

Reto

 

References

W3C validators result with DMOZ example content: http : // www . w3 . org/RDF/Validator/ARPServlet?URI=http%3A%2F%2Frdf.dmoz.org%2Frdf%2Fcontent.example.txt&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_TRIPLES&FORMAT=PNG_EMBED

 

RDF/XML Syntax specification:

http : // www . w3 . org/TR/rdf-syntax-grammar/

Posted

People are aware of the fact that the "RDF dump" isn't actually RDF. AFAIK, the reason is that it pre-dates the final RDF specification, and so was indeed based on an earlier draft. It should, however, at least be well-formed XML.

 

The non-RDF-ness is known about, and I think the following two options have both been proposed: (1) Stop calling it an RDF dump, or (2) change the structure/syntax to become valid RDF. Neither would be entirely straightforward, and it has long been the case that our limited technical resources have been used elsewhere, and unfortunately nothing has been done.

 

For more background you might like to read: http : // www . rainwaterreptileranch . org/steve/sw/odp/rdflist . html

Posted (edited)

why is it so hard?

 

It doesn't appear obvoius to me why fixing the RDF/XML is not straight-forward.

 

I found some sed-scripts on http : // dbpubs . stanford . edu:8091/diglib/ginf/download/dmoz/

 

and adapted them as follows:

content.sed:

s/ about=/ r:about=/;

s/r:id=/r:ID=/;

s/rdf"/rdf\/"/;

s/TR\/RDF\//1999\/02\/22-rdf-syntax-ns#/;

s/<RDF /<r:RDF /;

s/<\/RDF/<\/r:RDF/

 

s/r:ID="Top"/r:about="\&dmoz;"/

s/\(r:ID=\"\)\(Top\/\)\(.*\)/r:about="\&dmoz;\3/;

 

content.sh:

#!/bin/sh

 

DATA=./content.rdf.u8.gz

SED=./content.sed

 

(echo '<?xml version="1.0" encoding="UTF-8"?>' && echo && \

echo '<!DOCTYPE rdf:RDF [<!ENTITY dmoz "http : // dmoz . org/">]>' && echo && \

(cat $DATA | gunzip)) |tail +2 | sed -f $SED \

> out

 

I still get some warnings about invalid URIs and strings not being in NFC by the rdf parser, but it's much closer to RDF now.

Edited by Elper
Posted
When a format has been out for a long time, fixing things to what they should be is unfortunately likely to break existing parsers. For that reason, my guess is that fixing is not likely to happen until there are other reasons for an overhaul.
  • 2 weeks later...
Posted

Fixing

 

Changing the format to be RDF/XML would almost certainly break all non-RDF aware parsers, RDF-aware parsers would probably fix the bugs first and may not fail on not finding the bugs.

 

However a couple of fixes would most likely not break any client:

- contain only valid URIs

- have all strings in canonical form C

 

For the rest of the fixes I wonder if it would be a big deal to offer an RDF/XML version plus the old "pseudo RDF" XML version during a transitional period.

Posted

I have no numbers, but I'd suspect rather few parsers in the wild to be RDF-aware, and and some may even (for performance reasons) process it as a text file, looking for simple string matches; thats at least what I do when parsing the data dumps for tool usage.

 

NFC strings would mean introducing non-trivial normalization in the process, which sounds like it could slow down a process in need of being sped up.... Is this a significant issue? Do you see many unnormalized strings?

 

I tried the validator link from your first posting, and got "Your RDF document validated successfully".....?

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...