SUMMARY: Disk performance problems, DU 4.0B

From: Anne Bennett <anne_at_alcor.concordia.ca>
Date: Wed, 21 May 97 14:20:59 -0400

I wrote concerning performance problems with the I/O subsystem of an
AlphaServer under DU 4.0.B. In short, when I transferred about 3GB of
data in about one million files between two RZ29B-VA disks on separate
controllers using "tar cf - . | (cd blah; tar xfp -)", I found the
transfer rate to be abysmal. I got the following suggestions.

alan_at_nabeth.cxo.dec.com pointed out:

        Each file created has to have a directory entry written
        for it (synchronously), each newly allocated inode or
        equivalent has to be written synchronously, all new data
        extensions require synchronous writes, etc. This is
        probably a significant contribution to the problem,
        especially if the files, since the overhead work will
        dominate the copy when the files are small. [...]

        As for the switch from Time to Space, I've noticed that
        UFS is very touchy here. I think the logic used to make
        the decision is broken, but I've never been able to make
        proper sense of it.

And after I asked a follow-up question, he added:

        For UFS all file system metadata writes are synchronous.
        I'm not sure what the order is, but I'd guess the inode
        allocation is first, since it allows the file to look
        allocated if unattached. Then the directory entry will
        probably be written. If a file is extended you have the
        added complication of (maybe) updating the inode with new
        block addresses and when the file spills over into in-
        direct blocks, the indirect block.
 
        Data writes follow their own rules, but aligned file
        system block size writes are close to synchronous;
        async. but scheduled as soon as the data is in the
        cache. Short and unaligned writes will be purely
        asynch unless you mount the file system so that all
        write are synchronous.

Knut Hellebų <Knut.Hellebo_at_nho.hydro.com> suggested upgrading the
drive firmware on the RZ29Bs to at least 0016; he has experienced
spinup problems and disks going offline for no obvious reason on
earlier firmware revisions, and has received reports of other such
problems on those revisions.


I did end up running some tests on the disks while the system was
down. To make a long story short, the initial transfer (from the
bad disk to the replacement disk) took 15 hours for 2841865 Kbytes in
1048022 files (inodes). A couple of test transfers from the
replacement disk to a borrowed disk (whose firmware revision was,
interestingly, 0016) showed slightly better performance, and no
appreciable difference based on the setting of delay_wbuffers. Note
that the tests lasted only an hour each, and the system did not have a
chance to flip the optimization between TIME and SPACE; during the
first hour of the long transfer, I was getting close to 300 MB/hour.
A summary of the stats:

Initial transfer:
   189 MB/hour = 52 KB/sec
   69868 files/hour = 19 files/sec = 51ms/file
   iostat avg/max, reading disk: 144/357 1KB-blocks/sec, 39/70 xfers/sec
   iostat avg/max, writing disk: 520/826 1KB-blocks/sec, 102/130 xfers/sec

Test 1 (no delay_wbuffers):
   280 MB/hour = 77 KB/sec
   104743 files/hour = 29 files/sec = 34 ms/file
   (no iostat data available)

Test 2 (delay_wbuffers):
   280 MB/hour = 77 KB/sec
   104176 files/hour = 28 files/sec = 35 ms/file
   iostat avg/max, reading disk: 83/431 1KB-blocks/sec, 29/67 xfers/sec
   iostat avg/max, writing disk: 468/840 1KB-blocks/sec, 91/111 xfers/sec


I did some reading of Unix internals books, and also took note of the
exaplanations above sent by alan_at_nabeth.cxo.dec.com. Assume we can get
(for the sake of argument) 1/4 of the rated SCSI bus throughput, or 2.5
MB/sec, or 256 KB/sec. The 52 KB/sec we're getting *should* take only
200ms, leaving 800ms for 19 files unaccounted for, or 42 ms/file. With
a seek time of 9ms for the RZ29B, that's 4.6 seeks per file. Why so
many? Well, if indeed all metadata updates are synchronous, what
happens is:

   Look for filename in namei cache (with luck the directory
        is in the cache). Assign free inode (with luck we get it from
   the incore free list). Synchronously write the new file inode
   (SEEK). Update and synchronously write the directory data (SEEK).
   Repeat the following on the average three times per file, since file
        sizes are on the average just over 2.5KB, and therefore require
        three 1024-B frags: Get a free block (with luck in a list
     incore). Update and synchronously (?) write the file inode
     (SEEK?). Write the data block asynchronously (but probably gets
     scheduled
           immediately).

That seems to be at least two synchronous seeks per file, and probably
at least one more since the data block writes get scheduled pretty
fast, so we can probably count on at least three seeks per file
creation. If we look at "test 1", 77 KB/sec should take about 300ms,
leaving 700ms for 29 files, or 24ms/file, or 2.6 seeks/file. (Of
course, we can ask whether my 1/4 of the SCSI bus estimate is any
good, but these figures should be in the right ballpark nevertheless.)

I think there's not a lot to be done to improve the situation, except
get PrestoServe to eliminate the long waits for the synchronous
metadata writes.

Thanks to all who responded, both the folks mentioned above, and at
least one more person who shared performance woe stories with me.


Anne.
-- 
Ms. Anne Bennett, Computing Services, Concordia University, Montreal H3G 1M8
anne_at_alcor.concordia.ca                                       (514) 848-7606
Received on Wed May 21 1997 - 20:43:39 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT