SUMMARY: undetected data corruption reading 600Mb files

From: <emanuele.lombardi_at_casaccia.enea.it>
Date: Wed, 31 Jan 2001 14:19:01 +0100 (CET)

Dear friends,
I have good news for all of you (but not for me).
The problems are related to the HSZ80 on wich the data is kept. I forgot
to mention the HSZ80 in my original mail (wich is after my signature).

Following Francini's and Roetman's suggestions, I let the script run on
a file system not belonging to the HSZ80, but in the system disk cage.
At last I got good (and stable) results !!!

So the problem is only for data under the HSZ80. I have 2 raidsets over
there and the errors occurs on both. This is not a good news for me
since 99% of my data belong to the HSZ80 and especially because the
controller (not dual) seems not to have any error.

Now I opened a log to Compaq reguarding the HSZ80.

Special thanks to:
        "O'Brien, Pat" <pobrien_at_mitidata.com>
        "John J. Francini" <francini_at_zk3.dec.com>
        "Joerg Lehners" <Joerg.Lehners_at_Informatik.Uni-Oldenburg.DE>
        "Davis, Alan" <Davis_at_tessco.com>
        John Venier <venier_at_odin.mdacc.tmc.edu>
        "Roetman, Paul" <PRoetman_at_CSXLines.com>
        Davis_at_tessco.com
        Hallstein Lohre <hallstein.lohre_at_alphasystem.no>
        Claudio Tantignone <C_Tantignone_at_sondaarg.com.ar>
        "Schau, Brian" <Brian.Schau_at_compaq.com>


-- 
$$$ Emanuele Lombardi
$$$ mail:  AMB-GEM-CLIM ENEA Casaccia
$$$        I-00060 S.M. di Galeria (RM)  ITALY
$$$ mailto:emanuele.lombardi_at_casaccia.enea.it
$$$ tel	+39 06 30483366 fax	+39 06 30483591
$$$
$$$                                |||
$$$                                \|/  ;_;
$$$ What does a process need        |   /"\
$$$ to become a daemon ?            |   \v/
$$$                                 |    | 
$$$ - a fork                        o---/!\---
$$$                                 |   |_|
$$$                                 |  _/ \_
$$$* Contrary to popular belief, UNIX is user friendly.
$$$  It's just very particular about who it makes friends with.
$$$* Computers are not intelligent, but they think they are. 
$$$* True programmers never die, they just branch to an odd address
$$$* THIS TRANSMISSION WAS MADE POSSIBLE BY 100% RECYCLED ELECTRONS
 -----Original Message-----
> From: emanuele.lombardi_at_casaccia.enea.it
> [mailto:emanuele.lombardi_at_casaccia.enea.it]
> Sent: Tuesday, January 30, 2001 4:44 PM
> To: tru64-unix-managers_at_ornl.gov
> Cc: emanuele.lombardi_at_casaccia.enea.it
> Subject: undetected data corruption reading 600Mb files
> 
> 
> Hardware:	ES40 6/500 4CPUS 3Gb RAM (it happened with 4GB as well)
> Firmware:	5.8	
> Software:	T64 Unix 5.1 2nd patch applied
> 		WEBES   V3.1 Build 12 09/28/2000 SP 1 Build 4 1 Dec 2000
> File System:	Advfs version 4 
> Problem:	managing large data files (600Mb), data is changed
> 		without any notice to the user		
> 
> 
> Dear friends,
> 
> This was supposed to be the summary of my mail having the subject
> "gzip & gunzip not always returning original data" but I prefer to
> "open" a new subject since it proved to be a different (and worst)
> matter.
> 
> The probles is that, managing large data files (600Mb), data 
> is changed
> without any notice to the user. 
> A user of mine discovered the problem gzipping/gunzipping his 
> large data
> file: gunzip sometimes returned strange errors, while other times (not
> always) the gunzipped data was different that the original data. 
> 
> At the beginning, soon after the "gzip & gunzip not always returning
> original data" mail I suspected a memory error detected by CA 
> to be the
> cause of the problem. Unfortunately the memory cards has been 
> replaced,
> CA does'nt see any hardware problem, but I still have strange 
> undetected 
> data corruptions (even without gzip/gunzip).
> 
> I have to thank very much our doctor, Tom Blinn, for his very fast and
> usefull help. Following his suggestion I found out that the 
> problem was
> NOT in gzip/gunzip since I get undetected data corruption even in the
> following few lines of code. In it I repeatetely copy an input file
> (../a) into file b and c and then I check differences among 
> the 3 files
> using "diff" and "cksum". Well, it happens that those 
> differences sometimes really 
> occurs and that there are no noticeble warning or error message.
> 
> #!/bin/csh
> unset verbose
> set echo
> echo pwd=`pwd`
> uname -a
> unlimit
> limit
> set n=0
> set echo
> loop:
>     _at_ n ++
>     echo " 
> ==================================================== begin loop $n" 
>     echo start loop n=$n at `date`
>     ls -ls ../a
>     cksum ../a
>     cp ../a b
>     cksum ../a b 
>     cp b c
>     cksum ../a b c 
>     ls -ls b c
>     diff ../a b >/dev/null || echo ERRROR 1: FILES ../a and b 
> DIFFERS at loop $n
>     cksum ../a b c
>     diff b c >/dev/null || echo ERRROR 2: FILES b and c 
> DIFFERS at loop $n
>     cksum ../a b c
>     diff c b >/dev/null || echo ERRROR 3: FILES c and b 
> DIFFERS at loop $n	
>     cksum ../a b c
>     diff ../a c >/dev/null || echo ERRROR 4: FILES ../ and c 
> DIFFERS at loop $n
>     cksum ../a b c
>     diff c ../a >/dev/null || echo ERRROR 5: FILES c and ../a 
> DIFFERS at loop $n
>     cksum ../a b c
>     diff ../a b >/dev/null || echo ERRROR 6: FILES ../a and b 
> DIFFERS at loop $n
>     cksum ../a b c
>     diff b ../a >/dev/null || echo ERRROR 7: FILES b and ../a 
> DIFFERS at loop $n
>     cksum ../a b c
>     diff  b c >/dev/null || echo ERRROR 8: FILES b and c 
> DIFFERS at loop $n
>     cksum ../a b c
>     echo end  loop n=$n at `date`
>     echo " 
> ==================================================== end   loop $n" 
>     goto loop
> 
> I run the above script using the file ../a which has the following:
> 	ls -ls ../a
> 	610032 -rw-r--r--   1 root     system   624672000 Jan 
> 18 17:34 ../a
> 	cksum ../a
> 	2785050943 624672000 ../a
> 
> 
> While I'm writing the script is running in background and 
> here are the results obtained
> up to now:
> 
> loop	ERROR1	ERROR2 	ERROR3 	ERROR4 	ERROR5 	ERROR6 	ERROR7 	ERROR8
> 1	no	no	no	no	no	no	no	no
> 2	no	no	no	no	no	no	no	no	
> 3	no	no 	no	no	no	no	no	no
> 4	YES	YES	YES	no	no	YES	YES	YES
> 5	no	no	YES	no	no	YES	YES	YES
> 6	no	YES	YES	no	no	YES	YES	YES
> 7 	no	no	no	no	no	no	no	no	
> ....
> ....
> 
> Of course, when some ERRORx occurs (that is some diff are found), the
> cksum values of the files are not what expected (2785050943 as file
> ../a).
>  
> Now I kill the background job and I edit the script 
> eliminating all the
> "diff" commands. The script now contains only the following commands:
> cp, ls, and cksum.
> 
> The results are ugly! The checksum of a given file often 
> changes within
> the same loop: the dimensions are always the same, but the contents of
> files varies  !!
> To prove my words I submited the script in background placing 
> stdout on
> a log file. Look at the following which shows the resulting 
> cksums (wich
> should all be the same):
> 
> grep 624672000 CHECK_nogz.log | grep -v system | sort -u 
> 
> 1680138362 624672000 b
> 2046682359 624672000 c
> 2095653778 624672000 b
> 218351582 624672000 b
> 2371670479 624672000 c
> 2785050943 624672000 ../a
> 2785050943 624672000 b
> 2785050943 624672000 c
> 2992181696 624672000 b
> 3216358513 624672000 b
> 3442014270 624672000 c
> 
> 
> What else to say ?
> Please, help me!
> 
> Thanks to everybody,
> Emanuele
> 
Received on Wed Jan 31 2001 - 13:21:02 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT