Jump to content

Recommended Posts

Posted

OK I can see that this has been asked before but am afraid that I am a bit dense and do not get it. :confused:

 

I am working on a school project that requires me to put together a search engine. Using the ODP database seems to be a perfect choice but am having a great deal of difficulty figuring out how to parse the data into SQL or some other larger database structure.

 

My budget is about 0 :eek: for this project so I am using an existing XP computer with PHP, apache web server and MYSQL. This seemed like a logical choice of equipment and software.

 

I have read many different bulletin boards and it seems that everyone that is getting the data to parse are either using ODP Data Parser from http://www.ohardt.com/computer/dev/java/ or ODP2DB from http://rainwaterreptileranch.org/steve/sw/odp/rdflist.html. I would be happy use either one but cannot seem to figure out how to get either of them to run. :(

 

The Java from ODP Data Parser is neither a jar file nor does it appear to be entirely a script and ODP2DB seems to be getting called from a text line but for the life of me I have not been able to get it to do anything at all. Not sure if I need modperl installed there or what.

 

Is there kind of a step by step that anyone has done on how to parse this data?

 

If someone would help me though this perhaps I can make a step for step.

 

Any help on this project would be greatly appreciated.

 

Bob aka Waygrumpy

  • Editall
Posted

Frankly, using XP as a database/web server is not a good choice IMHO.

 

As per the rainwater page, you need perl installed as well as the DBI and XML::Parse modules. You may also need modperl in Apache if you are using perl scripts on the web server.

 

While I didn't download and look at the Java program, it is possible that it's source code that needs to be compiled to run. This information should be in the documentation.

ODP Editor callimachus

Any opinions expressed are my own, and do not represent an official opinion or communication from the ODP.

Private messages asking for submission status or preferential treatment will be ignored.

Posted

I have not used any of the available software - for my needs I've made my own, but you should aware that the computing demands of parsing are likely to be high.

 

I'm not a mySQL expert, but I found it was not up to the task for what I wanted to do. Expect your computer to be sitting there for a long time doing the parsing.

Posted

OK kewl glad I asked.

 

I do not have a problem with this project if it takes a while to parse as long as it does. I think that I should be able to get fairly quick querys after that. If not what do you suggest as the Data Base to put this in??

 

Now If XP is not the OS to use what are you recommending?? Linux?? That is not a worry if that would be prefered. I just need to keep the cost down because this is a school project. You said that there were instructions on a page that said what needed to be used. Do you have a link for that???

 

At this point in the project I am really open to changes so if there are reasons to change what I am doing now is the time to do it.

 

Thanks for all the help so far.

 

Bob

Posted

First you have to parse the data and then you have to stuff the wanted data into a table.

 

My parsing routine in PERL, which needs tuning and rewriting, since it was hacked together this year, from something that I did last year, runs about 2 hours. That's with nothing else running on the machine, it's very sensitive to other processes running, disk defragmentation, anti-virus software etc. If those are left running, in a worst case scenario it has taken as long as seven hours. If I rewrote it, and got an up to date machine I would expect a much better time. At the moment, I actually can split the parsing into subtasks, and run them on different machines and merge the results back together.

 

I believe one of the problems you run into, is that doing a mass insert of new data - that is 4 million sites may take a long time. One way to reduce this is to remove all the indexes on the table and build them again after the data is loaded. Another way is to parse all the data into a rather large text file, and then load from that. Once you have it loaded, then performance should not be so bad.

 

But you have to think out what you do when you get the next RDF dump, are you going to do the whole reload from scratch, or do a sophisticated match of the new RDF against the previous weeks SQL table, and do a merge - update/change/delete.

 

I suspect the SQL performance might improve with RAID drives, but I can't test that, since my RAID equipped machines, need drives that are now obsolete and have become overly expensve because of the difficulty of finding them

 

As I said, I gave up on SQL and I'm not an expert - my comments are from some posts I remember from somewhere.

 

See http://resource-zone.com/forum/index.php?showtopic=9146 psor #8

  • Editall
Posted
Now If XP is not the OS to use what are you recommending?? Linux?? That is not a worry if that would be prefered. I just need to keep the cost down because this is a school project. You said that there were instructions on a page that said what needed to be used. Do you have a link for that???

 

My personal preference woud be Linux or one of the BSD's and either MySQL for a bit extra speed, or PostgreSQL for better SQL compatibility. If you do stay with MS products, their server products are much better than using a desktop product for a server.

 

The information I mentioned can be found at the site links you yourself provided in your first post.

ODP Editor callimachus

Any opinions expressed are my own, and do not represent an official opinion or communication from the ODP.

Private messages asking for submission status or preferential treatment will be ignored.

  • Editall
Posted

Off-Topic for bobrat

 

I suspect the SQL performance might improve with RAID drives, but I can't test that, since my RAID equipped machines, need drives that are now obsolete and have become overly expensve because of the difficulty of finding them

 

Well it will certainly improve reliability and help insure integrity. You might consider migrating to a newer RAID configuration. A colleague just finished setting up a test server using a 3Ware controller and SATA drives. The performance so far has been impressive, and the cost compared to SCSI is considerably less.

ODP Editor callimachus

Any opinions expressed are my own, and do not represent an official opinion or communication from the ODP.

Private messages asking for submission status or preferential treatment will be ignored.

Posted

OK I took everyones advice and I setup a new computer just to parse this data.

 

It is running a full install of fedora now.

 

I have installed the odp2db.

 

I tried to install the tables into the mysql and found I had to modify them a bit.

 

here is what I built

 

CREATE TABLE `resources` (

`catid` int(11) NOT NULL default '0',

`rtype` varchar(100) NOT NULL default '',

`resource` blob NOT NULL

) TYPE=MyISAM;

 

 

CREATE TABLE `structure` (

`catid` int(11) NOT NULL default '0',

`topic` blob NOT NULL,

`title` blob NOT NULL,

`description` blob NOT NULL,

`lastupdate` timestamp(14) NOT NULL,

PRIMARY KEY (`catid`)

) TYPE=MyISAM;

 

# Table structure for table `xurls`

CREATE TABLE `xurls` (

`catid` int(11) default NULL,

`priority` int(11) default NULL,

`ages` varchar(32) default NULL,

`mediadate` varchar(32) default NULL,

`url` blob,

`title` blob,

`description` blob

) TYPE=MyISAM;

 

 

I went down and modified lines 80 and 81 to look something like the following

 

my $dbh = DBI->connect("dbi:Pg:dbname=test;host=localhost","username","password")

or die "Can't connect to database2: $DBI::errstr\n";

 

I was thinking that I was finally on track when I went to go and run the script and got the following error.

 

[root@localhost odp2db]# perl content2db.pl

DBI connect('dbname=test;host=localhost','wirthit',...) failed: could not connec t to server: ����8�/ at content2db.pl line 80

Can't connect to database2: could not connect to server: ����8�/

 

 

I think I am getting close but am not there yet. Has anyone had this same problem??

 

Any help would be appreciated.

Posted
While I didn't download and look at the Java program, it is possible that it's source code that needs to be compiled to run. This information should be in the documentation.

 

For the record, there is no documentation in the downloadable file. I was a bit lost as well as I don't normally deal with Java.

 

-drmike

Posted

Sorry I have been out of town for a while now.

 

Dr Mike did you ever figure out how to get the jave to work?

 

Does anyone know where I went wrong above?

Posted

Hello,

 

Follow this guide and you will be on your way to parsing the rdf files quite quickly.

 

First off Im assuming your using windows 2000 or higher and a high speed connection.

 

Go here http://www.en.wampserver.com/index.php and download the wamp server.

 

This is already preconfigured to install ,Apache/php/mysql/and phpmyadmin

 

The people who built this no what there doing and it works right out of the box! :)

 

 

Next go here https://sourceforge.net/projects/dmoz2mysql/ and download the php script.

 

Now open the config file and add your database username and the name of your data base

 

For example mine is user:root and database name :dmoz

 

Now set your download speed it is by default 50KB but if you have a good connection like myself you can set it at 500 or whatever.

 

Next close and save your config.php and copy all of the files to the www directory on your computer.

 

Next run http://localhost/create_tables.php

 

after that run http://localhost/start_script.php and your done! :D

 

Sit back and the script will download the correct files from dmoz.org unzip them/clean them/ and insert the info into the mysql database that you chose in your config file.

 

Oh yeah and one more thing, Be paitent once your run start_script.php it will hang for about 1 minute then it will change pages and start downloading and unzipping. Sit back and wait it takes about 6 hours for it to finish all the data and structure.

 

This is the easiest and fastest way for novice beginers to use the rdf files.

 

Hope this helps let me know. :)

Posted

Yup, took me about 5.5 hours to run the dmoz2mysql thing and get the files into a database. Strange thing though was that whatever I search for using mysql command line commands, I never found anything. Not sure what to make of it.

 

Plus there's no web interface for actually searching afterwards.

 

latest version was buggy for me as well. went back a version which worked fine.

 

Oh well. Guess no search engine for me. Plus they won't let me be an editor here either. Guess it's back to Zeal.

 

-drmike

Posted

I guess i am cursed. I am trying to run that software and am only getting errors.

 

Has anyone gotten start_script.php to work.

 

well I will try and debug this php file looks like it is pretty well written and documented.

 

If someone has a version that is working please let me know.

Posted

Hello,

 

What errors are you getting? Can you copy and paste here? If so I can probably help you out. Send me a pm with your email and I will email you the version that Im using right now.

 

Also if anyone is intrested we will be performing rdf extractions into mysql,and mssql, along with XML feeds of the open directory.

 

We are currently finishing our new site, and will post a link here in about 48 hours so that everyone can get more information. Anyone who would like to reserve a spot or speak further please PM me.

 

Our team can do custom php programming if you need to import the mysql,or mssql, or XML into a existing script or build one from scratch.

Posted

Waygrumpy are you trying to use this script locally? or on a actuall http server? If your using it locally make sure that you have permissions chmod set so that you can excute, Normally 755 locally.

 

On a real http server (datacenter) server try settings permissions to 777 I had to work with the script a bit to make it work on a actual server.

 

Another thing to keep in mind is if your on shared hosting make sure you have enough disk space. The file may only be 250 megs but once you gunzip it, it becomes about one and a half gig, the above script I mentioned doesnt verify space needed. Also if your on shared hosting talk to the system admin and make sure you have proper permssions to excute such a script, some hosts dont like people doing this.

 

If your on a dedicated server make sure your logged in as (root) this can sometimes make a big differnece on whether or not you can excute this script.

 

If you try and verfiy all the above and still no luck pm me and I will try to help you out.

  • Editall
Posted
On a real http server (datacenter) server try settings permissions to 777 I had to work with the script a bit to make it work on a actual server.

 

If you want to retain any semblence of control or security I wouldn't recommend setting the files on a public accessible server to globally read-write-execute (777). 755 should be sufficient. If it isn't then you have another problem somewhere.

ODP Editor callimachus

Any opinions expressed are my own, and do not represent an official opinion or communication from the ODP.

Private messages asking for submission status or preferential treatment will be ignored.

  • 2 weeks later...
Posted

web interface for dmoz2mysql

 

Hello,

 

It took me 4 hours to download, parse and import data to mysql.

Now what? :) I've found out that it doesn't come with any frontend interface. I wonder what does "Output in colors - how cool is that!" statement mean in the dmoz2mysql documentation.

 

I am trying to figure out how to use the data. I would highly appriciate if anybody could help me with that simple issue. Is there any ready-to-use web interface for dmoz2mysql? I am not good in php and it would take me months to write it myself. I need only a couple categories from dmoz.

 

Thanks in advance!

  • 6 months later...
Posted

Is the script working?

 

I run startscript and the status download bar gets to the end

after a few minutes of downloading. Then It just sits there with a white screen with numbers no error messages.

 

Is it working? I am running it locally, So I am not positive my permissions ar right. I tried Chmoding with PHP but I have no way of checking if the chmod worked. Right?

 

Anyway the screen spits out tons of these:

And notice the number it stops on? Is it broken?

 

[0m.[01;32m..[0m.[01;32m..[0m.[01;32m..[0m.[01;32m..[0m.[01;3

 

In my folder I have these two files with sizes:

 

structure.rdf.u8.gz - 493,181

structure.rdf - 60,229

 

Nothing is posted in my databse tables?

 

Is there another solution?

 

Thanks everyone

Posted
Since ODP does not support any of these scripts, an editor is unlikely to provide an answer. Someone who happens to use the script might see this post and help you.
  • 5 months later...
Posted

i know this thread is a year old but i hope someone answers me..

 

i have used the dmoz2mysql scipt but i always get error when im running the start_script.php. the erro is this:

Warning: fopen(structure.rdf.u8.gz): failed to open stream: Permission denied

 

does this occur in my host server or with dmoz? i use the v3. i already change the permission to its maximum (777)!

 

tnx.

  • 7 months later...
Posted

Hi dane,

 

If the permissions change didn't work, it sounds like you have an open_base_dir restriction on your web server. You will need to add an open_base_dir path to your web server config file for fopen to work in the directory you are trying to read from.

 

Cheers,

David

  • 1 year later...
Posted

I am not sure if this thread is still active or not----

After a few modifications to the original scripts, when I tried parsing the short examples on dmoz web site.

It seems structure.example.txt works fine but when it comes to content.example.txt, got error...

am i missing something here????...please help!!!!

------

�[01;32mCHECKING FOR UPDATES... �[0m �[01;32mOK. Ready for an update! �[0m �[01;32mWarning: Could not delete the old file. Maybe it isn't created yet. �[0m �[01;32mWarning: Could not delete the old RDF file. Maybe it isn't created yet. �[0m�[01;32mDownloading file information: �[0m�[01;32mDOWNLOADED�[0m �[01;32m The script is downloading http://rdf.dmoz.org/rdf/structure.example.txt The file is downloaded with 500 KB/s Size of the file is: 42 KB The DMOZ data dumps were last updated: Wed, 09 Aug 2006 20:27:27 GMT Downloading please wait �[0m�[01;32m.�[0m �[01;32m STRUCTURE.EXAMPLE.example.txt WAS SUCCESSFULLY DOWNLOADED Download URL: http://rdf.dmoz.org/rdf/structure.example.txt �[0m�[01;32m Extracting the example file...�[0m �[01;32m STRUCTURE.EXAMPLE.example.txt WAS SUCCESSFULLY EXTRACTED! Filename: structure.example �[0m �[01;32mCleaning structure.example This could take some time! �[0m �[01;32m Status: �[0m�[01;32m.�[0m �[01;32m Finished! �[0m �[01;32mFatal error: Could not do the following query: �[0m

--

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...