IZh Posted February 28, 2006 Posted February 28, 2006 Hello! It's me again. ;-) Here is my review what you can encounter in <mediadate> tag in content.rdf.u8 ODP dump file. As I understand, the desired content for that tag is date in the format "YYYY.MM.DD". First of all is separators. The desired separator is dot but I have found many dates with '-', '/', ',', '_' and ' ' characters. I grab with sed programm all of contents of <mediadate> tags from dump and use simple PHP function to check if a date is bad: function is_good_date ($d) { if (!ereg ("^([0-9]{4})[-./,_ ]([0-9]{1,2})[-./,_ ]([0-9]{1,2})$", $d, $regs) || $regs[1] > 2006 || $regs[1] < 1000 || !checkdate ($regs[2], $regs[3], $regs[1])) return false; return true; } First, it matches a date against regular expression and later check if a date is valid. Than I have sorted all bad dates and remove duplicates (many equal dates both bad and good can be found in the dump). My task was to import all data correctly to MySQL database. So, by term "bad" I mean dates which cannot be recognized by MySQL (not in "YYYY.MM.DD" where separator characters can be any). And I want to store date in field of DATE type and not as text string. ;-) I list only bad dates below (without external pages itself) because in other case this message will be veeery looong. So, there are completely bad dates which cannot be imported: June, 1998; yyyy.mm.dd; 200.01.28 ; 200.05.01 ; 200.07.05 ; 1997.06.ne; 1999.02.29; 1999.07.99; 2001.00.01; 2001.02.ry; 2001.06.31; 2001.09.31; 2002/03/62; 2003.0.04 ; 19995.09.0; 20001.02.0; 200310.05 Also I have found dates in bad format ("DD.MM.YYYY", "YYYY.DD.MM", "DD.MM.YY") which cannot be imported directly without converting by hands: 27/05/1999; 2000.20.05; 02.05.2001; 04.02.2002; 2002.22.02; 23.03.03 ; 11/04/2003; 25.04.2003; 22/06/03 ; 24.08.2003; 2003.24.12; 2004.27.05; 4.6.2004 ; 2004.19.07; 08-12-04 ; 16/01/2005; 01/11/2005; 19/11/2005 And who knows how many dates stored in dump in "YYYY.DD.MM" format with day number less than 13 that they cannot be distinguished from "YYYY.MM.DD". MySQL allows to store dates with day or day and month set to zero (I call that dates incomplete) to indicate that day or both day and month are not known. MySQL depents on settings will or will not consider as error inserting incomlete dates to tables. So, here is the list ("good" incompletes): 1856.02.00; 1856.06.00; 1894.11.00; 1896.09.00; 1897.03.00; 1898.02.00; 1907.00.00; 1910.12.00; 1911.08.00; 1937.06.00; 1949.06.00; 1949.12.00; 1951.11.00; 1958.09.00; 1959.06.00; 1959.12.00; 1960.01.00; 1960.03.00; 1960.05.00; 1960.08.00; 1960.11.00; 1960.12.00; 1961.01.00; 1961.02.00; 1961.03.00; 1961.04.00; 1961.05.00; 1961.07.00; 1961.09.00; 1961.10.00; 1961.11.00; 1962.02.00; 1962.03.00; 1964.03.00; 1964.04.00; 1964.12.00; 1965.05.00; 1965.09.00; 1965.11.00; 1966.03.00; 1966.07.00; 1967.01.00; 1967.11.00; 1968.06.00; 1968.11.00; 1969.09.00; 1969.12.00; 1970.09.00; 1971.01.00; 1971.04.00; 1971.07.00; 1973.03.00; 1973.06.00; 1974.00.00; 1974.04.00; 1975.01.00; 1976.00.00; 1976.01.00; 1977.03.00; 1977.09.00; 1977.12.00; 1978.02.00; 1978.04.00; 1978.05.00; 1978.09.00; 1980.00.00; 1980.08.00; 1980.10.00; 1981.04.00; 1981.06.00; 1981.07.00; 1981.08.00; 1981.09.00; 1981.12.00; 1982.00.00; 1982.03.00; 1982.06.00; 1982.09.00; 1982.10.00; 1983.00.00; 1983.04.00; 1983.05.00; 1983.06.00; 1983.07.00; 1983.09.00; 1983.11.00; 1984.00.00; 1984.01.00; 1984.03.00; 1984.04.00; 1984.05.00; 1984.07.00; 1984.08.00; 1984.09.00; 1984.10.00; 1984.11.00; 1985.00.00; 1985.01.00; 1985.02.00; 1985.03.00; 1985.04.00; 1985.05.00; 1985.06.00; 1985.07.00; 1985.09.00; 1985.10.00; 1985.11.00; 1985.12.00; 1986.00.00; 1986.01.00; 1986.02.00; 1986.03.00; 1986.04.00; 1986.05.00; 1986.06.00; 1986.07.00; 1986.10.00; 1986.11.00; 1987.00.00; 1987.01.00; 1987.04.00; 1987.05.00; 1987.06.00; 1987.07.00; 1987.08.00; 1987.09.00; 1987.10.00; 1987.11.00; 1987.12.00; 1988.00.00; 1988.01.00; 1988.02.00; 1988.04.00; 1988.05.00; 1988.06.00; 1988.07.00; 1988.08.00; 1988.09.00; 1988.10.00; 1988.11.00; 1988.12.00; 1989.00.00; 1989.01.00; 1989.02.00; 1989.03.00; 1989.05.00; 1989.06.00; 1989.07.00; 1989.08.00; 1989.09.00; 1989.10.00; 1989.11.00; 1989.12.00; 1990.00.00; 1990.01.00; 1990.02.00; 1990.03.00; 1990.04.00; 1990.05.00; 1990.06.00; 1990.09.00; 1990.10.00; 1990.12.00; 1991.00.00; 1991.01.00; 1991.02.00; 1991.03.00; 1991.04.00; 1991.05.00; 1991.06.00; 1991.08.00; 1991.09.00; 1991.10.00; 1991.11.00; 1991.12.00; 1992.0.0 ; 1992.00.00; 1992.01.00; 1992.02.00; 1992.03.00; 1992.06.00; 1992.07.00; 1992.08.00; 1992.09.00; 1992.10.00; 1992.11.00; 1992.12.00; 1993.00.00; 1993.01.00; 1993.02.00; 1993.03.00; 1993.04.00; 1993.05.00; 1993.06.00; 1993.09.00; 1993.10.00; 1993.11.00; 1993.12.00 1994.00.00; 1994.01.00; 1994.03.00; 1994.05.00; 1994.06.00; 1994.07.00; 1994.08.00; 1994.09.00; 1994.10.00; 1994.11.00; 1994.12.00; 1995.00.00; 1995.01.00; 1995.02.00; 1995.03.00; 1995.04.00; 1995.05.00; 1995.06.00; 1995.07.00; 1995.08.00; 1995.09.00; 1995.10.00; 1995.11.00; 1995.12.00; 1996.00.00; 1996.01.00; 1996.02.00; 1996.03.00; 1996.04.00; 1996.05.00; 1996.06.00; 1996.07.00; 1996.08.00; 1996.09.00; 1996.10.00; 1996.11.00; 1996.12.00; 1997.00.00; 1997.01.00; 1997.02.00; 1997.03.00; 1997.04.00; 1997.05.00; 1997.06.00; 1997.07.00; 1997.08.00; 1997.09.00; 1997.10.0 ; 1997.10.00; 1997.11.00; 1997.12.00; 1998.00.00; 1998.01.00; 1998.02.00; 1998.03.00; 1998.04.00; 1998.05.00; 1998.06.00; 1998.07.00; 1998.08.00; 1998.09.00; 1998.10.00; 1998.11.00; 1998.12.00; 1999-09-00; 1999.00.00; 1999.00.04; 1999.01.00; 1999.02.00; 1999.03.00; 1999.04.00; 1999.05.00; 1999.06.00; 1999.07.00; 1999.08.00; 1999.09.00; 1999.10.00; 1999.11.00; 1999.12.00; 2000.00.00; 2000/09/00; 2000.01.00; 2000.02.00; 2000.03.00; 2000.04.00; 2000.05.00; 2000.06.00; 2000.07.00; 2000.08.00; 2000.09.00; 2000.10.00; 2000.11.00; 2000.12.00; 2001.00.00. 2001/07/00. 2001.01.00; 2001.02.00; 2001.03.00; 2001.04.00; 2001.05.00; 2001.06.00; 2001.07.00; 2001.08.00; 2001.09.00; 2001.10.00; 2001.11.00; 2001.12.00; 2002.00.00; 2002/00/00; 2002.01.00; 2002.02.00; 2002.03.00; 2002.04.00; 2002.05.00 2002.06.00; 2002.07.00; 2002.08.00; 2002.09.00; 2002.10.00; 2002.11.00; 2002.12.00; 2003.00.00; 2003.01.00; 2003.02.00; 2003.03.00; 2003.04.00; 2003.05.00; 2003.06.00; 2003.07.00; 2003.08.00; 2003.09.00; 2003.10.00; 2003.11.00; 2003.12.00; 2004.00.00; 2004.01.00; 2004.02.00; 2004.03.00; 2004.04.00; 2004.05.00; 2004.06.00; 2004.07.00; 2004.08.00; 2004.09.00; 2004.10.00; 2004.11.00; 2004.12.00; 2005-09-00; 2005.00.00; 2005.01.00; 2005.02.00; 2005.03.00; 2005.04.00; 2005.05.00; 2005.06.00; 2005.07.00; 2005.08.00; 2005.09.00; 2005.10.00; 2005.11.00; 2005.12.00; 2006.00.00; 2006.02.00 But there are many incomplete dates which cannot be recognized by MySQL because of missing day number or both day and month. Here is the list ("bad" incompletes): 1971-02 ; 1980.06 ; 1983.09. ; 1984.05...; 1987.09 ; 1987.12 ; 1988.01...; 1990.08...; 1991......; 1992.01...; 1992.06 ; 1993.07...; 1994......; 1994.05...; 1995 ; 1995.04 ; 1995..10.3; 1996 ; 1996......; 1995.05 ; 1995.10...; 1996.04 ; 1996.05 ; 1996.10 ; 1996.10...; 1996.11 ; 1997 ; 1997.07. ; 1997.08...; 1997.12 ; 1997.12...; 1998 ; 1998-01 ; 1998. ; 1998......; 1998.01 ; 1998.01..1; 1998.03 ; 1998.04. ; 1998.04...; 1998.05 ; 1998.06 ; 1998.10...; 1998.12 ; 1999 ; 1999. ; 1999.01...; 1999.02 ; 1999.03 ; 1999.04 ; 1999.07 ; 1999.07. ; 1999.07...; 1999.08 ; 1999.08...; 1999.09 ; 1999.10 ; 1999.12 ; 2000-06 ; 2000......; 2000.03 ; 2000.04 ; 2000.05 ; 2000.06...; 2000.08 ; 2000.10 ; 2000.10...; 2000.11 ; 2000.11...; 2000.12 ; 2001 ; 2001- ; 2001......; 2001.01...; 2001.02...; 2001.03 ; 2001.03. ; 2001.04 ; 2001.04...; 2001.04.??; 2001.05...; 2001.07...; 2001.08 ; 2001.08...; 2001.09...; 2001.10...; 2001.11 ; 2001.11...; 2001.12...; 2002......; 2002.01...; 2002.02...; 2002.06. ; 2002.06...; 2002.08 ; 2002.10 ; 2002.11 ; 2002.11...; 2002.12 ; 2002.12...; 2003 ; 2003-03 ; 2003. ; 2003......; 2003.01...; 2003..08.2; 2003.02 ; 2003.04 ; 2003.08 ; 2003.09...; 2004......; 2005 ; 2005.04 Also I have seen many dates in the past (starting from 1795.12.31). I cannot check whether it is incorrect date or the date of some event in the past. I will not list them here. Next, I have found dates in the future. It may be errors or dates of future events (but I don't know whether you store information about events or not). And here is the list (near future with incompletes): 2006.05.00; 2006.07.00; 2006.08.00; 2007.00.00; 2007.01.01; 2007.01.02; 2007.04.11; 2007.05.01; 2007.05.28; 2007.06.15; 2007.06.30; 2007.07.05; 2007.07.07; 2007.07.25; 2007.07.28; 2007.08.31; 2007.09.22; 2007.12.30; 2008.06.01; 2008.08.29; 2008.11.00; 2009.00.00; 2009.08.01; 2010.09.01 Later I have found dates in far future. I'm not sure will I be alive at that times. :-) Do you have a time machine? ;-) So, this is the list (far future with incomplete): 2201.11.30; 2204.01.16; 2991.00.00 And the last, I have found some dates in "YYYYMMDD" format without separators. MySQL can handle them correctly but they are not in common format. So I list them here: 19990121; 20010919; 20011201; 20030225; 20030227; 20050306; 20050325; 20050408; 20050508; 20050604; 20050912; 20051213 The total number of external pages with dates which MySQL cannot understand is 18070. Thanks for your attention. :-) P.S. Maybe you need at least validate new values of the form? ;-)
sfromis Posted March 1, 2006 Posted March 1, 2006 Well, the "good" incompletes are not a problem. Thats how we represent dates. Future dates are also perfectly allowed, like when a site is indeed about a conference to be held on a specific date. Dates far back into the past is also expectable, like when referring old articles, papers or events. Worrying about YYYY.DD.MM is a fake problem, as that is never a valid date format. Of course, listings may have a bad date, but thats a more general QC problem. I'd not be surprised if some of the more unusual formatted dates suddently gets fixed :-)
giz Posted March 3, 2006 Posted March 3, 2006 That is a fantastic piece of research, and some editors that are Quality Control minded may very well use that list for fixing the problems in the coming weeks and months.
sfromis Posted March 3, 2006 Posted March 3, 2006 Well, the listing itself cannot be used for QC work, it simply exhibits the issue. I've made an extract which is being worked on In a couple of weeks, the RDF dump should have fewer anomalous dates.
giz Posted March 3, 2006 Posted March 3, 2006 A note about dates like: 1988.12.00 and 1989.00.00 I believe that those are "allowed" in ODP data and point to a single specific year, or to a specific year and month.
IZh Posted March 3, 2006 Author Posted March 3, 2006 And since we want "good incompletes" to represent dates with unknown day or both month and day, so here is the new version for checking if a date is good (if it helps you): function is_good_date ($d) { if (!ereg ("^([0-9]{4})[-./,_ ]([0-9]{1,2})[-./,_ ]([0-9]{1,2})$", $d, $regs) || $regs[1] < 1000 || $regs[1] > 2099 || ($regs[2] == 0 && $regs[3] != 0)) return false; return checkdate ( ($regs[2] == 0 ? 1 : $regs[2]), $regs[3], ($regs[1] == 0 ? 1 : $regs[1])); } We consider date to be good if it is in format YEAR<sep>MONTH<sep>DAY where YEAR is a four-digit number, MONTH and DAY is one- or two-digit numbers. <sep> - is a separator: one character form a set '-', '.', '/', '_', ' '. Year must be within the range of [1000; 2099]. Day and month can be zero to represent unknown values but there cannot be situation where we know day but doesn't know month. And number of days for a given month is checked by PHP function checkdate ($month, $day, $year).
giz Posted March 3, 2006 Posted March 3, 2006 A further refinement or two: The year can be any 4 digit number, always with leading zeroes. Both the Month and Day values must be exactly 2 digits long. The Month value must not be greater than 12 (and can be zero, but only if Day value is zero). The Day value must not be greater than 31 (and can be zero) and the max value, as you note, is Month dependant. The separator is supposed to be a dot in the ODP internal data, even though the official date standards now point towards the hyphen for this.
IZh Posted March 3, 2006 Author Posted March 3, 2006 This function firstly designed for separate totally bad dates. :-) Of course YYYY.MM.DD is preferable format, but if pages with completely bad dates is several thousands, then I'm not sure your QC can fix even more dates with just ommitted leading zeros. :-) I think that fixing dates should be done in two stages. First is to fix completely bad dates and second reformat all dates to fit in YYYY.MM.DD format (parse it and print in correct format - because of absense of bad dates, it will be possible).
giz Posted March 3, 2006 Posted March 3, 2006 People are already writing tools to find all variants that are incorrect, and making changes to correct the data. When the directory changed from using different character encodings per category, to using UTF-8 in all categories, I recall that the automated conversion failed to convert all entries due a large number that had incorrect character codes. So, many tens of thousands of edits were done using partly automated tools, and the final thousands were done by hand by a small group of editors that wanted to see the problem entirely fixed, directory wide. Within weeks, the number of errors was down from many tens of thousands to just several dozen; and the odd one or two problems re-appeared every few weeks and were hand-corrected, and the source of that error also found and amended. After a few months, there were no errors, and all the sources of errors (tools that didn't check their encoding or made wrong conversions) had all been fixed too. I suspect that a similar thing will happen in correcting the mediadate information directory-wide, and all tools that interact with it, in the coming weeks and months too.
sfromis Posted March 5, 2006 Posted March 5, 2006 There's not much point in discussing a piece of code to check the dates; the program I've written works quite differently anyway FTR, the count of bad dates is about one thousand, and many are already fixed. Thanks for alerting us about this QC issue
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now