Parsing ChefMoz RDF with PHP

l008com2

Member
Joined
Sep 19, 2005
Messages
12
I hope that specifics like this are ok for this forum. I made a relatively simple PHP script for parsing XML files. I'm trying to parse the ChefMoz data into a mysql database. But alas, it doesn't work. Not matter how many ways I try to clean the data before passing it to the xml_parser, I still get invalid character and other errors, well before I make it more than 10% of the way through the dump. I've tried everything I could think of, and everything I've been able to find on the web, with really very little success. I'm hoping someone in here can help me get past this hurdle. I can post my script if you want to see it?
 

informator

kEditall/kCatmv
Curlie Meta
Joined
Aug 19, 2003
Messages
1,697
Location
Sweden
I´m afraid that we´re not good at answering specific technical questions regarding the parsing of RDF-files. The files are offered, but we don´t have any tech support for those.

From my understanding, it´s not a trivial everyday exercise to try to use the files through php and mysql. :2cents:
 

windharp

Meta/kMeta
Curlie Meta
Joined
Apr 30, 2002
Messages
9,204
Are you using XML specific routines? The system Chefmoz uses is older than the RDF standard and the XML standards, and the character encoding used has some errors (which is a known bug for several years)


I am usually parsing Chefmoz RDFs manually, without using any ready made components, and that is working pretty well. Unfortunately I have no clue about PHP, so I am sorry I won't be of any help.


Since the structure of the RDF file is - at least as far as I needed it - without errors, I assume your main problem lies in badly encoded data being rejected by MySQL. That should be fixable with an intelligent routine, that makes sure every line you read is properly encoded before handing it to MySQL, either removing or encoding everything MySQL would reject.


We do have a set of Chefmoz related things in http://dmoz.org/Computers/Internet/...ry_Project/Tools_for_Editors/ChefMoz_Editors/ - maybe there even is one or another piece of software linked that you can use, I didn't check.
 

l008com2

Member
Joined
Sep 19, 2005
Messages
12
windharp said:
Are you using XML specific routines? The system Chefmoz uses is older than the RDF standard and the XML standards, and the character encoding used has some errors (which is a known bug for several years)

I am usually parsing Chefmoz RDFs manually, without using any ready made components, and that is working pretty well. Unfortunately I have no clue about PHP, so I am sorry I won't be of any help.

Since the structure of the RDF file is - at least as far as I needed it - without errors, I assume your main problem lies in badly encoded data being rejected by MySQL. That should be fixable with an intelligent routine, that makes sure every line you read is properly encoded before handing it to MySQL, either removing or encoding everything MySQL would reject.

We do have a set of Chefmoz related things in http://dmoz.org/Computers/Internet/...ry_Project/Tools_for_Editors/ChefMoz_Editors/ - maybe there even is one or another piece of software linked that you can use, I didn't check.

Yes I am using the built in xml parsing functions. But my problems are not with mysql at all. My problems are from the xml parser itself, complaining about random characters throughout the file.

Heres a link to my code:
http://pastebin.com/933659

As you can see its a pretty basic xml parser. I just can't figure out how to get it through. Is there some way I can clean up all these illegal characters? I planned on making a similar parser for the main dmoz rdf dump, since now i just have a string matching based parser that is easily broken. But I need to clean the data first. Even if illegal characters are just deleted, that would be a perfectly acceptable solution for my needs.
 

l008com2

Member
Joined
Sep 19, 2005
Messages
12
Anyone out there have any advice. I'm having no luck at all trying to parse the chefmoz feed into a mysql database. And as I said, my problem isn't at all with the mysql side, its with getting the php xml parser to not choke on the chefmoz dump. Is there anywhere I might find more into on this?
 

sdang

Member
Joined
Jan 16, 2008
Messages
6
parsing chefmoz rdf???

Hi l008com2, did you ever find a solution? I've tried 2 or 3 scripts from the list of 'tools' for parsing rdf and having no luck. GEtting random errors. The script that has worked is suck_DMOZ in php but the data is not correct.
 

Rated

Member
Joined
May 7, 2008
Messages
2
sdang said:
Hi l008com2, did you ever find a solution? I've tried 2 or 3 scripts from the list of 'tools' for parsing rdf and having no luck. GEtting random errors. The script that has worked is suck_DMOZ in php but the data is not correct.

There is one I found that worked for dmoz rdf files at...

http://phpodpworld.sourceforge.net/

If you use mysql it will run into errors but they they can easily be fixed. It uses perl instead of php to extract the rdf files and insert data into the db. I'm also in the process of writing the same script for chefmoz. I will post it when I finish if your still looking.
 

hansfn

Curlie Meta
Joined
Aug 4, 2005
Messages
26
If you use mysql it will run into errors but they they can easily be fixed.
Hm, I haven't gotten any reports about such problems (and I know it has worked for MySQL before), but please tell me so I can fix it. And if you modify phpODPWorld so it also works with the Chefmoz RDF, I would very much like to include the modifications into the next release.

PS! The project isn't dead - I have just been busy with other projects. I dod reply to e-mail and read the mailing list.
 

weglobenet

Member
Joined
Oct 7, 2007
Messages
16
l008com2 said:
Anyone out there have any advice. I'm having no luck at all trying to parse the chefmoz feed into a mysql database. And as I said, my problem isn't at all with the mysql side, its with getting the php xml parser to not choke on the chefmoz dump. Is there anywhere I might find more into on this?

there is a chefmoz rdf converted to Mysql dump at

http://www.we-globe.net/WebLab/Download/DmozRdf2MySQL.html
 

Fluesse09

Member
Joined
Oct 3, 2009
Messages
8
What I am looking to do is this say I have 5 PHP files on a website and they all use the same database and mysql_connect string.
Code: db = "myDatabase";link = mysql_connect"localhost", "UserName", "Password" or die"Could not connect to server";What I am looking to do is put this in a seperate file and just call it so if I ever need to change databases and/or location I only have to do it in one place.

You can not just do the standard include:
include"data/myConnection.php";

I just need to pass the 2 variables back to the main page so it will work any ideas?
 
This site has been archived and is no longer accepting new content.
Top