Jump to content

Relation of ODP to Google, AltaVista, etc


Recommended Posts

Posted

I have a URL ( Alamax Consulting ) currently listed in a regional directory under dmoz.org, listed as Alamax Consulting Computer Services. When I type in a simple phrase like "alamax" into a Google or Alta-Vista search box, I don't come up.

 

I thought that those search engines utilized ODP/DMOZ directories for their own listings.

 

What is the nature of the realtionship between ODP and the popular search engines?

 

Do I need to do something else to get those guys to pick up my listing?

Posted
Your site has been listed in October. The data for external users like Google are a bit older (from the end of September) and could not be updated yet due to technical problems. But staff is working on that. So you - and we all - just have to be patient.
Posted
Thank you. I presume that means that Google, AltaVista and others are all somewhat linked via the ODP.
  • Meta
Posted

Remember that a search engine doesn't have to ask permission either to spider dmoz.org, or to download and parse the RDF. We know what Google is doing, because they publicize it -- good publicity for them, they musta thought. We don't know what AltaVista is doing with ODP data -- if they are doing anything special with it, they must think keeping it secret is a competitive advantage. Either way, their choice.

 

Only DIRECTORIES (e.g. directory.google.com) have to include ODP attribution.

Guest richard123
Posted

I think it may be a bit older than end of September (??) My site was published earlier this year on September 19, 2002.

 

I am hoping for an early xmas present, but I'm not really that hopeful. If they can't fix it in 2 months, what's the chance they can do it in 3? Or 4 even.... Who knows? It may require a major hardware upgrade and that could easily take 6 months or more, I'd imagine.

 

Still... I live in hope <img src="/images/icons/smile.gif" alt="" />

Guest richard123
Posted
I'd rather not say. It's adult oriented. But it appeared the very first time on September 19, 2002 and has been there every day since.
Guest richard123
Posted

Thanks for the link! It looks like the last successful one happened on the one before Sep 22, because the the one on Sep 22 wasn't complete. Mayve that's when they discovered they had a problem. So... that would have been (??) September 17, 2002. That's the most recent "content.rdf.u8.gz".

 

Another thing is that my site still doesn't show up in "search" after all this time. I suppose that's also on the "to do" list <img src="/images/icons/smile.gif" alt="" />

Posted

Yep, the ODP search engine normally updates around 2 days after the RDF dump has been produced (I believe staff have got the search running off a copy of the dump to try and relieve a bit of pressure on the main server). However, this does mean that when the RDF dump is out of date, search is too.

 

ODP staff members are more than aware of the issue and are working on resolving it as soon as possible (in fact, as I type, there is another attempt to produce the RDF dump going ahead - but it'll be a minimum of 24 hours before we know if the problem has been sorted).

Guest richard123
Posted

This is what I don't get, really... I think the ODP is a great resource, but it's 2 months out of date with no credible signs of being "fixed" anytime soon.

 

Would it not be better to tell people how long before the update will happen? I have read in various places that the update is the "highest priority" and it really just gives credence to those critisizing the ODP for being slow to get things done. I mean: If the update, as a "high priority" takes over 2 months (and possibly 3 or 4??) then what hope have we got? I mean, really!

 

(Of course I realise computers can be finicky things, but there are limits as to just how poor time estimates are allowed to be <img src="/images/icons/smile.gif" alt="" /> )

  • Meta
Posted

Some facts about ODP you might not know:

 

--> staff programming "team" is one person.

--> RDF dump generation takes about a week if it is performing normally - sometimes even longer if it crashes.

--> A task that takes that long clearly has to be optimized for speed. That means less debug information and so on.

--> DMOZ link database contains almost every kind of foreign characters (you ever had to implement latin languages and japanese in one database?), lots of different encodings and almost any stupid stuff you could imagine.

 

Combine all of these and you will realize that tracking down bugs in RDF generation is a very time consuming task, especially since the programmer has not done all the software herself, so has to gather knowledge first.

 

We cant tell you how long it will take because we simply do not know when it will be fixed.

 

Every software related project that grows rapidly - like the ODP - reaches a limit when they discover that the current software has bugs that show only under heavy load and/or under weired circumstances.

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

Guest richard123
Posted

Thanks for the very informative post. I knew some of the stuff, but not other things (especially about it taking a week to generate an RDF dump).

My impression was that "running an update" took a couple of hours at most <img src="/images/icons/blush.gif" alt="" />

All the more reason for me to not hold my breath and keep wishing for an update before xmas!

Posted
It's "only" been running for three or four days now. We've got our fingers crossed. Here is a example of the character set nightmare referred to above, although I think it's gotten a bit worse. I've read a lot about transitioning to UTF-8, whatever that is.
  • Meta
Posted

... and now to some very basic information about "Unicode UTF-8" which may sound more familiar to some :-)

 

Unicode is a type of encoding that can handle all (uhhhh... Say at least most <img src="/images/icons/wink.gif" alt="" /> ) of those chaotic charsets used around the world - so it would make everything easier for communities like the ODP. If it wasnt a bit more complicate than those simple charsets everybody used yet. <img src="/images/icons/smile.gif" alt="" />

 

Some further readings:

 

[*] Unicode Homepage

[*]example how it might look like (if your browser supports UTF): http://www.unicode.org/iuc/iuc10/x-utf8.html

[*] Some PDF about it on http://www.unicode.org/unicode/uni2book/u2.html

[/list:u]

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

  • Meta
Posted

There is a so called "UTF-8" which I think makes Unicode somehow work on 8bit. More about this can be found in the links I mentioned above <img src="/images/icons/wink.gif" alt="" />

 

(And you could check the internal fora for some more information about the ODP and unicode if you like)

Curlie Meta/kMeta Editor windharp

 

d9aaee9797988d021d7c863cef1d0327.gif

  • 1 month later...
Posted
There are UTF-8 for 8-Bit Character and UTF-16 for 16-Bit Character like those called "Doublebyte-Characters" from east-asia. <img src="/images/icons/smile.gif" alt="" />
  • 4 weeks later...
Guest eqfan7v
Posted

windharp,

Very informative posts.

 

But I still don´t understand: ONE week to update a database the size of Dmoz's? When I remember that google processes millions of daily searches, over a much bigger database, and returns *ranked* results in fractions of a second, I think I have reasons to still be surprised, don´t you agree? <img src="/images/icons/confused.gif" alt="" />

Posted
But I still don´t understand

 

Well, Google was funded by venture capital (I know of one round of $25 Million, I don't know if there were other rounds or not), plus they make oodles of money from search agreements they've signed, AdWords, etc.

 

All that money allows them to buy a fair bit of processing horsepower. I know of 10 or 12 servers accessible to the public, and I don't doubt they have many times that for internal data massaging.

 

ODP has some very nice hardware, but nothing like that.

Posted
Indeed, the entire ODP runs on a mere 4 machines, if my memory serves my correctly. And at least 1 one of those was only added in the past 2 weeks or so.
Posted
And Google doesn't produce their index in seconds. They produce it in at least days, perhaps weeks. We don't know and they aren't telling, but their spider crawls over my own web site about once a month. It's the searches within that index that the user sees, and those are indeed blazingly fast.
Posted
But I still don´t understand: ONE week to update a database the size of Dmoz's? When I remember that google processes millions of daily searches, over a much bigger database, and returns *ranked* results in fractions of a second, [...]

 

Google takes a long time to generate a complete update, too. They update their index towards the end of the month, based on data gathered during the "deep crawl" at the beginning of the month. On my server, the "deep crawler" showed up during 3-12 January. The actual update began around 26 January, so it took about two weeks to generate the new index.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...