collecting sequences of hyperlink labels

November 6, 2003

Hello,

I am interested in collecting sequences of hyperlink

labels (the text anchoring the ahref) for various

sub-branches of ODP. These sequences are essentially the

breadcrumbs at the top of each page. The URL also happens

to mirror the current sequence.

For example, the News sub-branch contains the

following selected sequences:

News: Breaking News: Business and Economy ...

News: Breaking News: Official Press Releases ...

...

I however want to preserve crosslinks from one

branch of the directory to another, which the

breadcrumb and URL do not do.

I'm interested in collecting such sequences on the

order of thousands in selected sub-branches.

I know that the ODP dump is available in RDF.

My question is what is the best method to collect such

sequences in bulk from the available data (e.g., use XSLT)?

Thank You,

Saverio

senox · November 8, 2003

You know that there also is a RDF dump available at http://rdf.dmoz.org/ which only contains the category hierarchy information, don't you? It shouldn't be to difficult to parse this one and extract the information you're looking for.

Sign In

collecting sequences of hyperlink labels

Recommended Posts

Guest sperugin

senox

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity