how to use odp parse software????

waygrumpy

Member
Joined
Aug 23, 2004
Messages
12
OK I can see that this has been asked before but am afraid that I am a bit dense and do not get it. :confused:

I am working on a school project that requires me to put together a search engine. Using the ODP database seems to be a perfect choice but am having a great deal of difficulty figuring out how to parse the data into SQL or some other larger database structure.

My budget is about 0 :eek: for this project so I am using an existing XP computer with PHP, apache web server and MYSQL. This seemed like a logical choice of equipment and software.

I have read many different bulletin boards and it seems that everyone that is getting the data to parse are either using ODP Data Parser from http://www.ohardt.com/computer/dev/java/ or ODP2DB from http://rainwaterreptileranch.org/steve/sw/odp/rdflist.html. I would be happy use either one but cannot seem to figure out how to get either of them to run. :(

The Java from ODP Data Parser is neither a jar file nor does it appear to be entirely a script and ODP2DB seems to be getting called from a text line but for the life of me I have not been able to get it to do anything at all. Not sure if I need modperl installed there or what.

Is there kind of a step by step that anyone has done on how to parse this data?

If someone would help me though this perhaps I can make a step for step.

Any help on this project would be greatly appreciated.

Bob aka Waygrumpy
 

Callimachus

Member
Joined
Mar 15, 2004
Messages
704
Frankly, using XP as a database/web server is not a good choice IMHO.

As per the rainwater page, you need perl installed as well as the DBI and XML::parse modules. You may also need modperl in Apache if you are using perl scripts on the web server.

While I didn't download and look at the Java program, it is possible that it's source code that needs to be compiled to run. This information should be in the documentation.
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
I have not used any of the available software - for my needs I've made my own, but you should aware that the computing demands of parsing are likely to be high.

I'm not a mySQL expert, but I found it was not up to the task for what I wanted to do. Expect your computer to be sitting there for a long time doing the parsing.
 

waygrumpy

Member
Joined
Aug 23, 2004
Messages
12
OK kewl glad I asked.

I do not have a problem with this project if it takes a while to parse as long as it does. I think that I should be able to get fairly quick querys after that. If not what do you suggest as the Data Base to put this in??

Now If XP is not the OS to use what are you recommending?? Linux?? That is not a worry if that would be prefered. I just need to keep the cost down because this is a school project. You said that there were instructions on a page that said what needed to be used. Do you have a link for that???

At this point in the project I am really open to changes so if there are reasons to change what I am doing now is the time to do it.

Thanks for all the help so far.

Bob
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
First you have to parse the data and then you have to stuff the wanted data into a table.

My parsing routine in PERL, which needs tuning and rewriting, since it was hacked together this year, from something that I did last year, runs about 2 hours. That's with nothing else running on the machine, it's very sensitive to other processes running, disk defragmentation, anti-virus software etc. If those are left running, in a worst case scenario it has taken as long as seven hours. If I rewrote it, and got an up to date machine I would expect a much better time. At the moment, I actually can split the parsing into subtasks, and run them on different machines and merge the results back together.

I believe one of the problems you run into, is that doing a mass insert of new data - that is 4 million sites may take a long time. One way to reduce this is to remove all the indexes on the table and build them again after the data is loaded. Another way is to parse all the data into a rather large text file, and then load from that. Once you have it loaded, then performance should not be so bad.

But you have to think out what you do when you get the next RDF dump, are you going to do the whole reload from scratch, or do a sophisticated match of the new RDF against the previous weeks SQL table, and do a merge - update/change/delete.

I suspect the SQL performance might improve with RAID drives, but I can't test that, since my RAID equipped machines, need drives that are now obsolete and have become overly expensve because of the difficulty of finding them

As I said, I gave up on SQL and I'm not an expert - my comments are from some posts I remember from somewhere.

See http://resource-zone.com/forum/index.php?showtopic=9146 psor #8
 

Callimachus

Member
Joined
Mar 15, 2004
Messages
704
Now If XP is not the OS to use what are you recommending?? Linux?? That is not a worry if that would be prefered. I just need to keep the cost down because this is a school project. You said that there were instructions on a page that said what needed to be used. Do you have a link for that???

My personal preference woud be Linux or one of the BSD's and either MySQL for a bit extra speed, or PostgreSQL for better SQL compatibility. If you do stay with MS products, their server products are much better than using a desktop product for a server.

The information I mentioned can be found at the site links you yourself provided in your first post.
 

Callimachus

Member
Joined
Mar 15, 2004
Messages
704
Off-Topic for bobrat

I suspect the SQL performance might improve with RAID drives, but I can't test that, since my RAID equipped machines, need drives that are now obsolete and have become overly expensve because of the difficulty of finding them

Well it will certainly improve reliability and help insure integrity. You might consider migrating to a newer RAID configuration. A colleague just finished setting up a test server using a 3Ware controller and SATA drives. The performance so far has been impressive, and the cost compared to SCSI is considerably less.
 

waygrumpy

Member
Joined
Aug 23, 2004
Messages
12
OK I took everyones advice and I setup a new computer just to parse this data.

It is running a full install of fedora now.

I have installed the odp2db.

I tried to install the tables into the mysql and found I had to modify them a bit.

here is what I built

CREATE TABLE `resources` (
`catid` int(11) NOT NULL default '0',
`rtype` varchar(100) NOT NULL default '',
`resource` blob NOT NULL
) TYPE=MyISAM;


CREATE TABLE `structure` (
`catid` int(11) NOT NULL default '0',
`topic` blob NOT NULL,
`title` blob NOT NULL,
`description` blob NOT NULL,
`lastupdate` timestamp(14) NOT NULL,
PRIMARY KEY (`catid`)
) TYPE=MyISAM;

# Table structure for table `xurls`
CREATE TABLE `xurls` (
`catid` int(11) default NULL,
`priority` int(11) default NULL,
`ages` varchar(32) default NULL,
`mediadate` varchar(32) default NULL,
`url` blob,
`title` blob,
`description` blob
) TYPE=MyISAM;


I went down and modified lines 80 and 81 to look something like the following

my $dbh = DBI->connect("dbi:pg:dbname=test;host=localhost","username","password")
or die "Can't connect to database2: $DBI::errstr\n";

I was thinking that I was finally on track when I went to go and run the script and got the following error.

[root@localhost odp2db]# perl content2db.pl
DBI connect('dbname=test;host=localhost','wirthit',...) failed: could not connec t to server: ���8�/ at content2db.pl line 80
Can't connect to database2: could not connect to server: ���8�/


I think I am getting close but am not there yet. Has anyone had this same problem??

Any help would be appreciated.
 

waygrumpy

Member
Joined
Aug 23, 2004
Messages
12
Does this look like I am on at least the right track??

Has anyone got any suggestions what I might be doing wrong?

Thanks in advance
Bob :confused: :confused: :eek: :confused: :confused:
 

drmike

Member
Joined
Aug 6, 2004
Messages
38
Callimachus said:
While I didn't download and look at the Java program, it is possible that it's source code that needs to be compiled to run. This information should be in the documentation.

For the record, there is no documentation in the downloadable file. I was a bit lost as well as I don't normally deal with Java.

-drmike
 

waygrumpy

Member
Joined
Aug 23, 2004
Messages
12
Sorry I have been out of town for a while now.

Dr Mike did you ever figure out how to get the jave to work?

Does anyone know where I went wrong above?
 

dizzlewizzle

Member
Joined
Sep 21, 2004
Messages
14
Hello,

Follow this guide and you will be on your way to parsing the rdf files quite quickly.

First off Im assuming your using windows 2000 or higher and a high speed connection.

Go here http://www.en.wampserver.com/index.php and download the wamp server.

This is already preconfigured to install ,Apache/php/mysql/and phpmyadmin

The people who built this no what there doing and it works right out of the box! :)


Next go here https://sourceforge.net/projects/dmoz2mysql/ and download the php script.

Now open the config file and add your database username and the name of your data base

For example mine is user:root and database name :dmoz

Now set your download speed it is by default 50KB but if you have a good connection like myself you can set it at 500 or whatever.

Next close and save your config.php and copy all of the files to the www directory on your computer.

Next run http://localhost/create_tables.php

after that run http://localhost/start_script.php and your done! :D

Sit back and the script will download the correct files from dmoz.org unzip them/clean them/ and insert the info into the mysql database that you chose in your config file.

Oh yeah and one more thing, Be paitent once your run start_script.php it will hang for about 1 minute then it will change pages and start downloading and unzipping. Sit back and wait it takes about 6 hours for it to finish all the data and structure.

This is the easiest and fastest way for novice beginers to use the rdf files.

Hope this helps let me know. :)
 

drmike

Member
Joined
Aug 6, 2004
Messages
38
Yup, took me about 5.5 hours to run the dmoz2mysql thing and get the files into a database. Strange thing though was that whatever I search for using mysql command line commands, I never found anything. Not sure what to make of it.

Plus there's no web interface for actually searching afterwards.

latest version was buggy for me as well. went back a version which worked fine.

Oh well. Guess no search engine for me. Plus they won't let me be an editor here either. Guess it's back to Zeal.

-drmike
 

waygrumpy

Member
Joined
Aug 23, 2004
Messages
12
I guess i am cursed. I am trying to run that software and am only getting errors.

Has anyone gotten start_script.php to work.

well I will try and debug this php file looks like it is pretty well written and documented.

If someone has a version that is working please let me know.
 

dizzlewizzle

Member
Joined
Sep 21, 2004
Messages
14
Hello,

What errors are you getting? Can you copy and paste here? If so I can probably help you out. Send me a pm with your email and I will email you the version that Im using right now.

Also if anyone is intrested we will be performing rdf extractions into mysql,and mssql, along with XML feeds of the open directory.

We are currently finishing our new site, and will post a link here in about 48 hours so that everyone can get more information. Anyone who would like to reserve a spot or speak further please PM me.

Our team can do custom php programming if you need to import the mysql,or mssql, or XML into a existing script or build one from scratch.
 

dizzlewizzle

Member
Joined
Sep 21, 2004
Messages
14
Waygrumpy are you trying to use this script locally? or on a actuall http server? If your using it locally make sure that you have permissions chmod set so that you can excute, Normally 755 locally.

On a real http server (datacenter) server try settings permissions to 777 I had to work with the script a bit to make it work on a actual server.

Another thing to keep in mind is if your on shared hosting make sure you have enough disk space. The file may only be 250 megs but once you gunzip it, it becomes about one and a half gig, the above script I mentioned doesnt verify space needed. Also if your on shared hosting talk to the system admin and make sure you have proper permssions to excute such a script, some hosts dont like people doing this.

If your on a dedicated server make sure your logged in as (root) this can sometimes make a big differnece on whether or not you can excute this script.

If you try and verfiy all the above and still no luck pm me and I will try to help you out.
 

Callimachus

Member
Joined
Mar 15, 2004
Messages
704
On a real http server (datacenter) server try settings permissions to 777 I had to work with the script a bit to make it work on a actual server.

If you want to retain any semblence of control or security I wouldn't recommend setting the files on a public accessible server to globally read-write-execute (777). 755 should be sufficient. If it isn't then you have another problem somewhere.
 

Miller

Member
Joined
Oct 12, 2004
Messages
12
web interface for dmoz2mysql

Hello,

It took me 4 hours to download, parse and import data to mysql.
Now what? :) I've found out that it doesn't come with any frontend interface. I wonder what does "Output in colors - how cool is that!" statement mean in the dmoz2mysql documentation.

I am trying to figure out how to use the data. I would highly appriciate if anybody could help me with that simple issue. Is there any ready-to-use web interface for dmoz2mysql? I am not good in php and it would take me months to write it myself. I need only a couple categories from dmoz.

Thanks in advance!
 

Bill_Gates

Member
Joined
May 3, 2005
Messages
4
Is the script working?

I run startscript and the status download bar gets to the end
after a few minutes of downloading. Then It just sits there with a white screen with numbers no error messages.

Is it working? I am running it locally, So I am not positive my permissions ar right. I tried Chmoding with PHP but I have no way of checking if the chmod worked. Right?

Anyway the screen spits out tons of these:
And notice the number it stops on? Is it broken?

[0m.[01;32m..[0m.[01;32m..[0m.[01;32m..[0m.[01;32m..[0m.[01;3

In my folder I have these two files with sizes:

structure.rdf.u8.gz - 493,181
structure.rdf - 60,229

Nothing is posted in my databse tables?

Is there another solution?

Thanks everyone
 

Bill_Gates

Member
Joined
May 3, 2005
Messages
4
Please Help

The script isn't working and I am not getting any style when start scripts begins.
I think I am having permissions problems.??
 
This site has been archived and is no longer accepting new content.
Top