undetected data corruption reading 600Mb files

From: <emanuele.lombardi_at_casaccia.enea.it>
Date: Tue, 30 Jan 2001 16:44:23 +0100 (CET)

Hardware: ES40 6/500 4CPUS 3Gb RAM (it happened with 4GB as well)
Firmware: 5.8
Software: T64 Unix 5.1 2nd patch applied
                WEBES V3.1 Build 12 09/28/2000 SP 1 Build 4 1 Dec 2000
File System: Advfs version 4
Problem: managing large data files (600Mb), data is changed
                without any notice to the user


Dear friends,

This was supposed to be the summary of my mail having the subject
"gzip & gunzip not always returning original data" but I prefer to
"open" a new subject since it proved to be a different (and worst)
matter.

The probles is that, managing large data files (600Mb), data is changed
without any notice to the user.
A user of mine discovered the problem gzipping/gunzipping his large data
file: gunzip sometimes returned strange errors, while other times (not
always) the gunzipped data was different that the original data.

At the beginning, soon after the "gzip & gunzip not always returning
original data" mail I suspected a memory error detected by CA to be the
cause of the problem. Unfortunately the memory cards has been replaced,
CA does'nt see any hardware problem, but I still have strange undetected
data corruptions (even without gzip/gunzip).

I have to thank very much our doctor, Tom Blinn, for his very fast and
usefull help. Following his suggestion I found out that the problem was
NOT in gzip/gunzip since I get undetected data corruption even in the
following few lines of code. In it I repeatetely copy an input file
(../a) into file b and c and then I check differences among the 3 files
using "diff" and "cksum". Well, it happens that those differences sometimes really
occurs and that there are no noticeble warning or error message.

#!/bin/csh
unset verbose
set echo
echo pwd=`pwd`
uname -a
unlimit
limit
set n=0
set echo
loop:
    _at_ n ++
    echo " ==================================================== begin loop $n"
    echo start loop n=$n at `date`
    ls -ls ../a
    cksum ../a
    cp ../a b
    cksum ../a b
    cp b c
    cksum ../a b c
    ls -ls b c
    diff ../a b >/dev/null || echo ERRROR 1: FILES ../a and b DIFFERS at loop $n
    cksum ../a b c
    diff b c >/dev/null || echo ERRROR 2: FILES b and c DIFFERS at loop $n
    cksum ../a b c
    diff c b >/dev/null || echo ERRROR 3: FILES c and b DIFFERS at loop $n
    cksum ../a b c
    diff ../a c >/dev/null || echo ERRROR 4: FILES ../ and c DIFFERS at loop $n
    cksum ../a b c
    diff c ../a >/dev/null || echo ERRROR 5: FILES c and ../a DIFFERS at loop $n
    cksum ../a b c
    diff ../a b >/dev/null || echo ERRROR 6: FILES ../a and b DIFFERS at loop $n
    cksum ../a b c
    diff b ../a >/dev/null || echo ERRROR 7: FILES b and ../a DIFFERS at loop $n
    cksum ../a b c
    diff b c >/dev/null || echo ERRROR 8: FILES b and c DIFFERS at loop $n
    cksum ../a b c
    echo end loop n=$n at `date`
    echo " ==================================================== end loop $n"
    goto loop

I run the above script using the file ../a which has the following:
        ls -ls ../a
        610032 -rw-r--r-- 1 root system 624672000 Jan 18 17:34 ../a
        cksum ../a
        2785050943 624672000 ../a


While I'm writing the script is running in background and here are the results obtained
up to now:

loop ERROR1 ERROR2 ERROR3 ERROR4 ERROR5 ERROR6 ERROR7 ERROR8
1 no no no no no no no no
2 no no no no no no no no
3 no no no no no no no no
4 YES YES YES no no YES YES YES
5 no no YES no no YES YES YES
6 no YES YES no no YES YES YES
7 no no no no no no no no
....
....

Of course, when some ERRORx occurs (that is some diff are found), the
cksum values of the files are not what expected (2785050943 as file
../a).
 
Now I kill the background job and I edit the script eliminating all the
"diff" commands. The script now contains only the following commands:
cp, ls, and cksum.

The results are ugly! The checksum of a given file often changes within
the same loop: the dimensions are always the same, but the contents of
files varies !!
To prove my words I submited the script in background placing stdout on
a log file. Look at the following which shows the resulting cksums (wich
should all be the same):

grep 624672000 CHECK_nogz.log | grep -v system | sort -u

1680138362 624672000 b
2046682359 624672000 c
2095653778 624672000 b
218351582 624672000 b
2371670479 624672000 c
2785050943 624672000 ../a
2785050943 624672000 b
2785050943 624672000 c
2992181696 624672000 b
3216358513 624672000 b
3442014270 624672000 c


What else to say ?
Please, help me!

Thanks to everybody,
Emanuele


-- 
$$$ Emanuele Lombardi
$$$ mail:  AMB-GEM-CLIM ENEA Casaccia
$$$        I-00060 S.M. di Galeria (RM)  ITALY
$$$ mailto:emanuele.lombardi_at_casaccia.enea.it
$$$ tel	+39 06 30483366 fax	+39 06 30483591
$$$
$$$                                |||
$$$                                \|/  ;_;
$$$ What does a process need        |   /"\
$$$ to become a daemon ?            |   \v/
$$$                                 |    | 
$$$ - a fork                        o---/!\---
$$$                                 |   |_|
$$$                                 |  _/ \_
$$$* Contrary to popular belief, UNIX is user friendly.
$$$  It's just very particular about who it makes friends with.
$$$* Computers are not intelligent, but they think they are. 
$$$* True programmers never die, they just branch to an odd address
$$$* THIS TRANSMISSION WAS MADE POSSIBLE BY 100% RECYCLED ELECTRONS
Received on Tue Jan 30 2001 - 15:45:47 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT