Jump to content

Recommended Posts

Posted
I also would be happy with the old format as I have tools written for it. I have a copy of the March 2017 dump of dmoz that I still use, but it is getting stale.
  • Like 1
  • 1 month later...
Posted
I second the request for this data to be available; quite a lot of scientific research in computing uses "the ODP dataset" and it would be great if we didn't have to take it from the Internet Archive any more, but be able to use an up-to-date version as you maintain! Would be very helpful. The exact format is not important, but leaving it as it has been is not a bad idea.
  • Like 1
  • 5 weeks later...
  • RZ Admin
Posted

All I can really do is continue to prod the other people involved in the decision making and technical processes... ;)

We are rather resource limited, so any ideas for economically creating and then hosting such large files would be welcome; either start a new thread here, or send PM... TIA

elper {moz}:blue_arrow1::curlie:

All opinions expressed are my own, and do not necessarily represent the official point of view of the administration of either this forum or the directory.

Posted

I would be happy to organize in my institute (Radboud University) a mirror of these files for download, if you would need that support.

 

Also, using academictorrents could be an attractive route - we can even combine those two approaches, such that there's some level of guarantee that the dataset would be available.

  • 4 weeks later...
Posted

I'm throwing my hat in the ring to offer assistance.

 

If human resource constraints are an issue, I'm willing to assist as needed.

 

I've been periodically returning here (even to DMOZ, before this) for quite some time, looking for downloads of the RDF data (or other formats, say, JSON-LD).

 

I'm a long-time polyglot developer. What can I do to help?

 

This directory is too valuable to keep "locked up" in HTML. It's a treasure trove of curated data. The RDF data (and linked sites) could serve as labeled training data any number of ML projects.

  • Meta
Posted

Thanks for your offer of assistance!

 

Without being directly involved I think the technical people handling this issue have the resources needed. Also we prefer to use our internal editors when developing (althogh anyone is welcome to apply to be one).

 

I think there are internal support for the data to be made available again, it's just a matter of priorities now...

Curlie (Dmoz) Meta editor informator
  • RZ Admin
Posted

If you'd like to see how Curlie is from the inside, by all means sign up as editor to a category that interests you. You'll then have access to our internal fora, and maybe add a few interesting sites to the directory at the same time...

(if you had a Dmoz or Curlie Editor account which has expired, they can be reactivated)

elper {moz}:blue_arrow1::curlie:

All opinions expressed are my own, and do not necessarily represent the official point of view of the administration of either this forum or the directory.

  • 5 weeks later...
Posted

The old RDF might be a bit risky to use as some of the sites may have been deleted and or reregistered since the last known good version. Link rot was a problem with the old RDF and Dmoz didn't scale well. The only way of dealing with it effectively is by tracking domain names. That's doable for the gTLDs but many ccTLDs do not publish their zones.

 

Regards...jmcc

  • Meta
Posted

Yes, any new data dump would be made of fresh data from the current directory.

 

We have automatic tools that audit listings but we also believe that a human touch is of big importance in spotting changed/dead urls. :)

Curlie (Dmoz) Meta editor informator
Posted

We have automatic tools that audit listings but we also believe that a human touch is of big importance in spotting changed/dead urls. :)

I took a quick look at the last known good version of the RDFs and the URLs. The gTLDs domain names were still the main group. From what I remember of Dmoz, it used to use a crawler to check domain names. That's a very time-consuming way of doing it. There are some indicators for changed and dead urls that can be checked more efficiently.

 

Regards...jmcc

  • 3 weeks later...
  • RZ Admin
Posted

The last version of the RDF dates back several years now (and is used on dmoztools dot net), or isn't that what you meant?

There are some indicators for changed and dead urls that can be checked more efficiently.
We are all ears for increased efficiency :cool:

elper {moz}:blue_arrow1::curlie:

All opinions expressed are my own, and do not necessarily represent the official point of view of the administration of either this forum or the directory.

Posted

Yes. That's the RDF I was using. Basically I have a database that tracks domain name transactions across the legacy and new gTLDS and some ccTLDs and periodically build IP address maps of all the sites for active domain names.

 

Regards...jmcc

  • 11 months later...
Posted

Rdf dump site of Curlie is showing , " Curlie's data (RDF) is currently unavailable due to technical issues."..

Any information when it will be available for download ?

TIA

  • Like 1
  • 2 weeks later...
Posted

To all the editors at Curlie:

I understand these things take time.

Just wondering if four years might be adequate time to make a decision about this minor technical matter.

I posted my original question on this topic in 2018 :)

 

The data dump is of course not necessary, merely helpful. It would take me a few days to write a crawler to crawl Curlie and generate the same data myself. But it would make much more sense for Curlie to do a more formal data dump that would benefit all users as well.

  • Like 2
  • 4 months later...
Posted (edited)
Hey there. Just wondering what the status is for the RDF dumps? I actually wrote an DMOZ RDF converter several years ago, that'll get the data categorized into CSV files. So, the RDF dumps are all I'm really interested in at the moment. Let me know if there's anything I can do to help. Thanks! Edited by vectorman

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...