Jump to content

collecting sequences of hyperlink labels


Recommended Posts

Guest sperugin
Posted

Hello,

 

I am interested in collecting sequences of hyperlink

labels (the text anchoring the ahref) for various

sub-branches of ODP. These sequences are essentially the

breadcrumbs at the top of each page. The URL also happens

to mirror the current sequence.

 

For example, the News sub-branch contains the

following selected sequences:

 

News: Breaking News: Business and Economy ...

News: Breaking News: Official Press Releases ...

...

...

 

I however want to preserve crosslinks from one

branch of the directory to another, which the

breadcrumb and URL do not do.

 

I'm interested in collecting such sequences on the

order of thousands in selected sub-branches.

I know that the ODP dump is available in RDF.

My question is what is the best method to collect such

sequences in bulk from the available data (e.g., use XSLT)?

 

Thank You,

Saverio

Posted
You know that there also is a RDF dump available at http://rdf.dmoz.org/ which only contains the category hierarchy information, don't you? It shouldn't be to difficult to parse this one and extract the information you're looking for.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...