reto Posted October 27, 2006 Posted October 27, 2006 The RDF Dump (or at least the provided example) seems to adhere to some very old draft specification of RDF/XML. Are there plans to fix it to adhere to the final spec so that it can be read by standards RDF processors? The main problems seem to be a wrong namespace, the unqualified use of rdf:about and a missing namespace prefix on the root element (which thus appears in the dmoz-ns rather than rdf). Cheers, Reto References W3C validators result with DMOZ example content: http : // www . w3 . org/RDF/Validator/ARPServlet?URI=http%3A%2F%2Frdf.dmoz.org%2Frdf%2Fcontent.example.txt&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_TRIPLES&FORMAT=PNG_EMBED RDF/XML Syntax specification: http : // www . w3 . org/TR/rdf-syntax-grammar/
chaos127 Posted October 27, 2006 Posted October 27, 2006 People are aware of the fact that the "RDF dump" isn't actually RDF. AFAIK, the reason is that it pre-dates the final RDF specification, and so was indeed based on an earlier draft. It should, however, at least be well-formed XML. The non-RDF-ness is known about, and I think the following two options have both been proposed: (1) Stop calling it an RDF dump, or (2) change the structure/syntax to become valid RDF. Neither would be entirely straightforward, and it has long been the case that our limited technical resources have been used elsewhere, and unfortunately nothing has been done. For more background you might like to read: http : // www . rainwaterreptileranch . org/steve/sw/odp/rdflist . html
reto Posted October 27, 2006 Author Posted October 27, 2006 (edited) why is it so hard? It doesn't appear obvoius to me why fixing the RDF/XML is not straight-forward. I found some sed-scripts on http : // dbpubs . stanford . edu:8091/diglib/ginf/download/dmoz/ and adapted them as follows: content.sed: s/ about=/ r:about=/; s/r:id=/r:ID=/; s/rdf"/rdf\/"/; s/TR\/RDF\//1999\/02\/22-rdf-syntax-ns#/; s/<RDF /<r:RDF /; s/<\/RDF/<\/r:RDF/ s/r:ID="Top"/r:about="\&dmoz;"/ s/\(r:ID=\"\)\(Top\/\)\(.*\)/r:about="\&dmoz;\3/; content.sh: #!/bin/sh DATA=./content.rdf.u8.gz SED=./content.sed (echo '<?xml version="1.0" encoding="UTF-8"?>' && echo && \ echo '<!DOCTYPE rdf:RDF [<!ENTITY dmoz "http : // dmoz . org/">]>' && echo && \ (cat $DATA | gunzip)) |tail +2 | sed -f $SED \ > out I still get some warnings about invalid URIs and strings not being in NFC by the rdf parser, but it's much closer to RDF now. Edited January 29, 2019 by Elper
sfromis Posted October 27, 2006 Posted October 27, 2006 When a format has been out for a long time, fixing things to what they should be is unfortunately likely to break existing parsers. For that reason, my guess is that fixing is not likely to happen until there are other reasons for an overhaul.
reto Posted November 5, 2006 Author Posted November 5, 2006 Fixing Changing the format to be RDF/XML would almost certainly break all non-RDF aware parsers, RDF-aware parsers would probably fix the bugs first and may not fail on not finding the bugs. However a couple of fixes would most likely not break any client: - contain only valid URIs - have all strings in canonical form C For the rest of the fixes I wonder if it would be a big deal to offer an RDF/XML version plus the old "pseudo RDF" XML version during a transitional period.
sfromis Posted November 5, 2006 Posted November 5, 2006 I have no numbers, but I'd suspect rather few parsers in the wild to be RDF-aware, and and some may even (for performance reasons) process it as a text file, looking for simple string matches; thats at least what I do when parsing the data dumps for tool usage. NFC strings would mean introducing non-trivial normalization in the process, which sounds like it could slow down a process in need of being sped up.... Is this a significant issue? Do you see many unnormalized strings? I tried the validator link from your first posting, and got "Your RDF document validated successfully".....?
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now