cpollett Posted August 28, 2020 Posted August 28, 2020 I also would be happy with the old format as I have tools written for it. I have a copy of the March 2017 dump of dmoz that I still use, but it is getting stale. 1 Quote
arjenpdevries Posted October 11, 2020 Posted October 11, 2020 I second the request for this data to be available; quite a lot of scientific research in computing uses "the ODP dataset" and it would be great if we didn't have to take it from the Internet Archive any more, but be able to use an up-to-date version as you maintain! Would be very helpful. The exact format is not important, but leaving it as it has been is not a bad idea. 1 Quote
RZ Admin Elper Posted November 10, 2020 RZ Admin Posted November 10, 2020 All I can really do is continue to prod the other people involved in the decision making and technical processes... We are rather resource limited, so any ideas for economically creating and then hosting such large files would be welcome; either start a new thread here, or send PM... TIA Quote elper {moz}:curlie: All opinions expressed are my own, and do not necessarily represent the official point of view of the administration of either this forum or the directory.
arjenpdevries Posted November 10, 2020 Posted November 10, 2020 I would be happy to organize in my institute (Radboud University) a mirror of these files for download, if you would need that support. Also, using academictorrents could be an attractive route - we can even combine those two approaches, such that there's some level of guarantee that the dataset would be available. Quote
tdfunk Posted December 4, 2020 Posted December 4, 2020 I'm throwing my hat in the ring to offer assistance. If human resource constraints are an issue, I'm willing to assist as needed. I've been periodically returning here (even to DMOZ, before this) for quite some time, looking for downloads of the RDF data (or other formats, say, JSON-LD). I'm a long-time polyglot developer. What can I do to help? This directory is too valuable to keep "locked up" in HTML. It's a treasure trove of curated data. The RDF data (and linked sites) could serve as labeled training data any number of ML projects. Quote
Meta informator Posted December 5, 2020 Meta Posted December 5, 2020 Thanks for your offer of assistance! Without being directly involved I think the technical people handling this issue have the resources needed. Also we prefer to use our internal editors when developing (althogh anyone is welcome to apply to be one). I think there are internal support for the data to be made available again, it's just a matter of priorities now... Quote Curlie (Dmoz) Meta editor informator
RZ Admin Elper Posted December 6, 2020 RZ Admin Posted December 6, 2020 If you'd like to see how Curlie is from the inside, by all means sign up as editor to a category that interests you. You'll then have access to our internal fora, and maybe add a few interesting sites to the directory at the same time... (if you had a Dmoz or Curlie Editor account which has expired, they can be reactivated) Quote elper {moz}:curlie: All opinions expressed are my own, and do not necessarily represent the official point of view of the administration of either this forum or the directory.
jmcc Posted January 5, 2021 Posted January 5, 2021 The old RDF might be a bit risky to use as some of the sites may have been deleted and or reregistered since the last known good version. Link rot was a problem with the old RDF and Dmoz didn't scale well. The only way of dealing with it effectively is by tracking domain names. That's doable for the gTLDs but many ccTLDs do not publish their zones. Regards...jmcc Quote
Meta informator Posted January 6, 2021 Meta Posted January 6, 2021 Yes, any new data dump would be made of fresh data from the current directory. We have automatic tools that audit listings but we also believe that a human touch is of big importance in spotting changed/dead urls. Quote Curlie (Dmoz) Meta editor informator
jmcc Posted January 6, 2021 Posted January 6, 2021 We have automatic tools that audit listings but we also believe that a human touch is of big importance in spotting changed/dead urls. I took a quick look at the last known good version of the RDFs and the URLs. The gTLDs domain names were still the main group. From what I remember of Dmoz, it used to use a crawler to check domain names. That's a very time-consuming way of doing it. There are some indicators for changed and dead urls that can be checked more efficiently. Regards...jmcc Quote
RZ Admin Elper Posted January 22, 2021 RZ Admin Posted January 22, 2021 The last version of the RDF dates back several years now (and is used on dmoztools dot net), or isn't that what you meant? There are some indicators for changed and dead urls that can be checked more efficiently. We are all ears for increased efficiency Quote elper {moz}:curlie: All opinions expressed are my own, and do not necessarily represent the official point of view of the administration of either this forum or the directory.
jmcc Posted January 22, 2021 Posted January 22, 2021 Yes. That's the RDF I was using. Basically I have a database that tracks domain name transactions across the legacy and new gTLDS and some ccTLDs and periodically build IP address maps of all the sites for active domain names. Regards...jmcc Quote
CurlieRocks Posted January 17, 2022 Posted January 17, 2022 Hey, do you have any update? The data would be very useful! Thanks 1 Quote
jmcc Posted January 21, 2022 Posted January 21, 2022 There doesn't seem to have been any movement on this for about a year. Regards...jmcc Quote
sha Posted January 27, 2022 Posted January 27, 2022 Rdf dump site of Curlie is showing , " Curlie's data (RDF) is currently unavailable due to technical issues.".. Any information when it will be available for download ? TIA 1 Quote
Meta informator Posted January 28, 2022 Meta Posted January 28, 2022 There are discussions about how to make our data available again. Unfortunately it takes time, no timetable is available. Quote Curlie (Dmoz) Meta editor informator
HilbertIdeals5 Posted February 3, 2022 Posted February 3, 2022 Thought I'd drop in a friendly reminder as well. I want to use the Curlie database for a web search project addressed to an old-computers niche, so I'm anxiously awaiting the return of RDFs! 1 Quote
Jeyendran Balakrishnan Posted February 16, 2022 Author Posted February 16, 2022 To all the editors at Curlie: I understand these things take time. Just wondering if four years might be adequate time to make a decision about this minor technical matter. I posted my original question on this topic in 2018 The data dump is of course not necessary, merely helpful. It would take me a few days to write a crawler to crawl Curlie and generate the same data myself. But it would make much more sense for Curlie to do a more formal data dump that would benefit all users as well. 2 Quote
Meta informator Posted February 16, 2022 Meta Posted February 16, 2022 Better late than never... keep watching this thread. Quote Curlie (Dmoz) Meta editor informator
Jeyendran Balakrishnan Posted February 18, 2022 Author Posted February 18, 2022 I am But seems to be a race as to who will outlive the other - me, or this unresolved issue. My bet is on the latter... Quote
thehelper Posted February 22, 2022 Posted February 22, 2022 RDF Dump! Yes! Sometimes you have to just do it! Quote
vectorman Posted June 27, 2022 Posted June 27, 2022 (edited) Hey there. Just wondering what the status is for the RDF dumps? I actually wrote an DMOZ RDF converter several years ago, that'll get the data categorized into CSV files. So, the RDF dumps are all I'm really interested in at the moment. Let me know if there's anything I can do to help. Thanks! Edited June 27, 2022 by vectorman Quote
Meta informator Posted June 27, 2022 Meta Posted June 27, 2022 We are working on it. It is not uncomplicated though. Quote Curlie (Dmoz) Meta editor informator
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.