DMOZ "No results" Problems

wallyneal

Member
Joined
Apr 26, 2004
Messages
2
In DMOZ, for entries in the catagory below, a search of either "the entire directory", or the specific catagory, produces a return of "No results found" for a half-dozen sites where I have copied/pasted the web address, and then the DMOZ title, directly into the DMOZ search window.

the directory is:
Top: Regional: North America: United States: Arizona: Localities: P: Phoenix: Business and Economy: Real Estate: Residential: Agents

I don't understand how text in the directory that I am looking at on my computer screen, and/or the exact hyperlink behind it, cannot be found.

Please explain what it is that I don't understand here.
 
G

gimmster

I think I know what you are asking.

The page you are viewing is a static html file, not a dynamically generated listing from within a database.

Search does not scrape the html pages for data, but instead runs from an index which is generated (at best) weekly. The index the search uses is quite often up to 7 days lagging to updates, and sometimes may be further behind when technical problems intervene with the index generation.

The search is currently showing "Search database last updated on: Thu Apr 22 02:05:11 EDT 2004 " at the bottom of the search page. Sites updated within a few days of that date may not be in that index yet (it takes multiple days to compile).

Any help?
<added>
And what xixtas01 said, the last index does not appear to be working properly.
</added>
:tree:
 

giz

Member
Joined
May 26, 2002
Messages
3,112
The indexes are being rebuilt. It was hoped to have a working version as of two days ago, but they still need further work. This could take another week or more to resolve.

I have no idea, but it may be tied up with the recent conversion of the ODP data to UTF-8. There might be an invalid character somewhere in the 1 800 000 000 characters of data, which is causing the, as yet unresolved, problem.
 

dinf

Member
Joined
May 3, 2004
Messages
20
Are the indexes still broken?

Hoping you are not angry about posting this question in this thread, i'm asking because we do have some very equal problem:

every search words do reply a "no results found" message, sometimes followed by some category listings. This depends on no certain categorie search/website but applies to every search we started (on different computers) at every website (dmoz.org as well as dmoz.de).

So now we aren't able to check whether submitted websites are reviewed or not.

Depends this on the same indexing/database problem?

Thank you for listening.
 

dinf

Member
Joined
May 3, 2004
Messages
20
err, well.. i think that means something like 'yes, dumb guy, the search engine isn't still working and we all know and do what we can' .. ?

Thank you for your helpful sig :)
 

bobrat

Member
Joined
Apr 15, 2003
Messages
11,061
No it means I'm red faced and embarrassed to tell you that our search engine continues to be dysfunctional.
 

lufiaguy

KEditall/KCatmv
Joined
Feb 7, 2004
Messages
58
I think the search is meant to get updated each time the RDF dump is updated, but don't quote me on that. ;)
 
Joined
May 25, 2004
Messages
16
lufiaguy said:
I think the search is meant to get updated each time the RDF dump is updated, but don't quote me on that. ;)

OK - How often does that happen? It's been a week since you've responded, and more than two weeks since my site has been added to the ODP. Searching still yields no results.
 
G

gimmster

It *should* happen weekly, but it appears to be around 5-6 weeks behind at the moment (judging on that category).

Don't stress about it, its a very minor function as far as we are concerned.

:tree:
 

giz

Member
Joined
May 26, 2002
Messages
3,112
There have been several bugs related to the search for some time, but what really hampered things was that the underlying directory data used a wide variety of character encodings, whilst the RDF file upon which search is built was supposed to be all in UTF-8.

The directory itself used ISO-8859-1 for English Language, and other Western European categories, and a whole variety of different encodings for the rest of the world. When the RDF was built either the data was stored there with a note about its encoding, or it was converted to UTF-8. Some categories didn't have a default encoding defined, and this made the data somewhat flaky for some non-English categories. The RDF file is built about once per week, taking several days to build as a background process, and then it is posted to http://rdf.dmoz.org/rdf/ as 2 files (content and structure).

It has taken a good 6 months to convert all of the underlying directory data over to UTF-8, that is: every category path and name, and then every site URL, title and description within. Some could be done using automated scripts, and some had to be done by hand editing each entry (and we're talking about half a million items here). Some clever editors managed to build some tools which parsed the RDF and then built a screenful of HTML links that took you directly to an edit screen for the error found, with a note of what to look for. Over the last few months the errors have steadily decreased, and the last 4 or 4 RDF files had just a handful of errors (just 2 in the last one, and 1 in the one before that).

In parallel with that was a project to find out how encoding errors get into the directory data in the first place, and then ways to check the data and remove errors before the data is stored (hence the small amount of errors that crept in last week). We think that everything is now converted, and the checking functions are now robust.

The next RDF should have zero errors in it. Having achieved that, it will then be easier for the search to be revisited and its functionality improved, and some bugs to be ironed out; but search is not the highest priority job as far as editors are concerned (same too probably for staff, but I cannot speak for them or their priorites at all). However, whatever happens with search, it is likely to have a timescale of many months, not days or weeks too. The ODP consists of a complex set of software tools, and work goes on on all of them.
 
This site has been archived and is no longer accepting new content.
Top