trinity Posted December 24, 2009 Posted December 24, 2009 Hi all , I'm doing a mini-project on automated URL classification.. For this , I'd like to obtain about 10000 URLs from Dmoz ODP , which are uniformly distributed in all categories.. Is there any way to do this , please help . Thanks in advance !
RZ Admin photofox Posted December 24, 2009 RZ Admin Posted December 24, 2009 You would probably need to use a custom script to pull URLs from our RDF - http://rdf.dmoz.org/ Curlie Admin photofox
trinity Posted December 24, 2009 Author Posted December 24, 2009 Hi, Thanks .But Can u be more explanatory as to what do u mean by custom script to pull URLs from our RDF - http://rdf.dmoz.org/. iam new to this area. thanks in advance , Trinity.
RZ Admin photofox Posted December 24, 2009 RZ Admin Posted December 24, 2009 I'm not that familiar with using the RDF myself. I do know that we don't offer any scripts directly that would do what you want. You'd probably need to write some sort of script that would scan the RDF and extract 1000 URLs from various categories. You might find something that would work in http://www.dmoz.org/Computers/Internet/Searching/Directories/Open_Directory_Project/Use_of_ODP_Data/Upload_Tools/ The only problem I see, is that you say you'd like around 1000 URLs which are uniformly distributed in all categories, but we have over 590,000 categories. With such a small sample I'm not sure you'll be able to get a uniformly distributed set of URLs from *all* categories. Or maybe I'm not understanding exactly what you are looking for... Curlie Admin photofox
trinity Posted December 24, 2009 Author Posted December 24, 2009 I want to build an application that'll automate classification of URLs into different categories.. For this , i first want to begin with just two broad categories - educational sites and non-educational sites.. I want about 10,000 URLs to train and test my classifier.. < So these URLs should be a combo of edu and non edu sites >. Is there any way to do this..
Inspirovations Posted September 20, 2010 Posted September 20, 2010 Take a look at http://rdf.dmoz.org/, you'll need to find a way to parse the RDF into your database. I did this the other day using the odp2db scripts from Steve's Software. They're old, but the format hasn't changed significantly so they work fine. I found I didn't need to do the iconv and xmlclean.pl steps suggested in the readme, just uncompressed the dumps and ran the structure2db.pl and content2db.pl scripts. You'll need to create the database tables manually (see the SQL at top of script for that) and modify the connection details in the scripts before you start. <link drop/pseudo sig removed>
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now