Will the RDF dump be fixed to be valid RDF/XML?

reto · Oct 27, 2006

The RDF Dump (or at least the provided example) seems to adhere to some very old draft specification of RDF/XML. Are there plans to fix it to adhere to the final spec so that it can be read by standards RDF processors?

The main problems seem to be a wrong namespace, the unqualified use of rdf:about and a missing namespace prefix on the root element (which thus appears in the dmoz-ns rather than rdf).

Cheers,
Reto

References
W3C validators result with DMOZ example content: http : // www . w3 . org/RDF/Validator/ARPServlet?URI=http%3A%2F%2Frdf.dmoz.org%2Frdf%2Fcontent.example.txt&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_TRIPLES&FORMAT=PNG_EMBED

RDF/XML Syntax specification:
http : // www . w3 . org/TR/rdf-syntax-grammar/

chaos127 · Oct 27, 2006

People are aware of the fact that the "RDF dump" isn't actually RDF. AFAIK, the reason is that it pre-dates the final RDF specification, and so was indeed based on an earlier draft. It should, however, at least be well-formed XML.

The non-RDF-ness is known about, and I think the following two options have both been proposed: (1) Stop calling it an RDF dump, or (2) change the structure/syntax to become valid RDF. Neither would be entirely straightforward, and it has long been the case that our limited technical resources have been used elsewhere, and unfortunately nothing has been done.

For more background you might like to read: http : // www . rainwaterreptileranch . org/steve/sw/odp/rdflist . html

reto · Oct 27, 2006

why is it so hard?

It doesn't appear obvoius to me why fixing the RDF/XML is not straight-forward.

I found some sed-scripts on http : // dbpubs . stanford . edu:8091/diglib/ginf/download/dmoz/

and adapted them as follows:
content.sed:
s/ about=/ r:about=/;
s/r:id=/r:ID=/;
s/rdf"/rdf\/"/;
s/TR\/RDF\//1999\/02\/22-rdf-syntax-ns#/;
s/<RDF /<r:RDF /;
s/<\/RDF/<\/r:RDF/

s/r:ID="Top"/r:about="\&dmoz;"/
s/$r:ID=\"$$Top\/$$.*$/r:about="\&dmoz;\3/;

content.sh:
#!/bin/sh

DATA=./content.rdf.u8.gz
SED=./content.sed

(echo '<?xml version="1.0" encoding="UTF-8"?>' && echo && \
echo '<!DOCTYPE rdf:RDF [<!ENTITY dmoz "http : // dmoz . org/">]>' && echo && \
(cat $DATA | gunzip)) |tail +2 | sed -f $SED \
> out

I still get some warnings about invalid URIs and strings not being in NFC by the rdf parser, but it's much closer to RDF now.

sfromis · Oct 27, 2006

When a format has been out for a long time, fixing things to what they should be is unfortunately likely to break existing parsers. For that reason, my guess is that fixing is not likely to happen until there are other reasons for an overhaul.

reto · Nov 5, 2006

Fixing

Changing the format to be RDF/XML would almost certainly break all non-RDF aware parsers, RDF-aware parsers would probably fix the bugs first and may not fail on not finding the bugs.

However a couple of fixes would most likely not break any client:
- contain only valid URIs
- have all strings in canonical form C

For the rest of the fixes I wonder if it would be a big deal to offer an RDF/XML version plus the old "pseudo RDF" XML version during a transitional period.

sfromis · Nov 5, 2006

I have no numbers, but I'd suspect rather few parsers in the wild to be RDF-aware, and and some may even (for performance reasons) process it as a text file, looking for simple string matches; thats at least what I do when parsing the data dumps for tool usage.

NFC strings would mean introducing non-trivial normalization in the process, which sounds like it could slow down a process in need of being sped up.... Is this a significant issue? Do you see many unnormalized strings?

I tried the validator link from your first posting, and got "Your RDF document validated successfully".....?

Will the RDF dump be fixed to be valid RDF/XML?

reto

Member

chaos127

reto

Member

sfromis

Member

reto

Member

sfromis

Member