I wrote concerning performance problems with the I/O subsystem of an
AlphaServer under DU 4.0.B. In short, when I transferred about 3GB of
data in about one million files between two RZ29B-VA disks on separate
controllers using "tar cf - . | (cd blah; tar xfp -)", I found the
transfer rate to be abysmal. I got the following suggestions.
alan_at_nabeth.cxo.dec.com pointed out:
Each file created has to have a directory entry written
for it (synchronously), each newly allocated inode or
equivalent has to be written synchronously, all new data
extensions require synchronous writes, etc. This is
probably a significant contribution to the problem,
especially if the files, since the overhead work will
dominate the copy when the files are small. [...]
As for the switch from Time to Space, I've noticed that
UFS is very touchy here. I think the logic used to make
the decision is broken, but I've never been able to make
proper sense of it.
And after I asked a follow-up question, he added:
For UFS all file system metadata writes are synchronous.
I'm not sure what the order is, but I'd guess the inode
allocation is first, since it allows the file to look
allocated if unattached. Then the directory entry will
probably be written. If a file is extended you have the
added complication of (maybe) updating the inode with new
block addresses and when the file spills over into in-
direct blocks, the indirect block.
Data writes follow their own rules, but aligned file
system block size writes are close to synchronous;
async. but scheduled as soon as the data is in the
cache. Short and unaligned writes will be purely
asynch unless you mount the file system so that all
write are synchronous.
Knut Hellebų <Knut.Hellebo_at_nho.hydro.com> suggested upgrading the
drive firmware on the RZ29Bs to at least 0016; he has experienced
spinup problems and disks going offline for no obvious reason on
earlier firmware revisions, and has received reports of other such
problems on those revisions.
I did end up running some tests on the disks while the system was
down. To make a long story short, the initial transfer (from the
bad disk to the replacement disk) took 15 hours for 2841865 Kbytes in
1048022 files (inodes). A couple of test transfers from the
replacement disk to a borrowed disk (whose firmware revision was,
interestingly, 0016) showed slightly better performance, and no
appreciable difference based on the setting of delay_wbuffers. Note
that the tests lasted only an hour each, and the system did not have a
chance to flip the optimization between TIME and SPACE; during the
first hour of the long transfer, I was getting close to 300 MB/hour.
A summary of the stats:
Initial transfer:
189 MB/hour = 52 KB/sec
69868 files/hour = 19 files/sec = 51ms/file
iostat avg/max, reading disk: 144/357 1KB-blocks/sec, 39/70 xfers/sec
iostat avg/max, writing disk: 520/826 1KB-blocks/sec, 102/130 xfers/sec
Test 1 (no delay_wbuffers):
280 MB/hour = 77 KB/sec
104743 files/hour = 29 files/sec = 34 ms/file
(no iostat data available)
Test 2 (delay_wbuffers):
280 MB/hour = 77 KB/sec
104176 files/hour = 28 files/sec = 35 ms/file
iostat avg/max, reading disk: 83/431 1KB-blocks/sec, 29/67 xfers/sec
iostat avg/max, writing disk: 468/840 1KB-blocks/sec, 91/111 xfers/sec
I did some reading of Unix internals books, and also took note of the
exaplanations above sent by alan_at_nabeth.cxo.dec.com. Assume we can get
(for the sake of argument) 1/4 of the rated SCSI bus throughput, or 2.5
MB/sec, or 256 KB/sec. The 52 KB/sec we're getting *should* take only
200ms, leaving 800ms for 19 files unaccounted for, or 42 ms/file. With
a seek time of 9ms for the RZ29B, that's 4.6 seeks per file. Why so
many? Well, if indeed all metadata updates are synchronous, what
happens is:
Look for filename in namei cache (with luck the directory
is in the cache). Assign free inode (with luck we get it from
the incore free list). Synchronously write the new file inode
(SEEK). Update and synchronously write the directory data (SEEK).
Repeat the following on the average three times per file, since file
sizes are on the average just over 2.5KB, and therefore require
three 1024-B frags: Get a free block (with luck in a list
incore). Update and synchronously (?) write the file inode
(SEEK?). Write the data block asynchronously (but probably gets
scheduled
immediately).
That seems to be at least two synchronous seeks per file, and probably
at least one more since the data block writes get scheduled pretty
fast, so we can probably count on at least three seeks per file
creation. If we look at "test 1", 77 KB/sec should take about 300ms,
leaving 700ms for 29 files, or 24ms/file, or 2.6 seeks/file. (Of
course, we can ask whether my 1/4 of the SCSI bus estimate is any
good, but these figures should be in the right ballpark nevertheless.)
I think there's not a lot to be done to improve the situation, except
get PrestoServe to eliminate the long waits for the synchronous
metadata writes.
Thanks to all who responded, both the folks mentioned above, and at
least one more person who shared performance woe stories with me.
Anne.
--
Ms. Anne Bennett, Computing Services, Concordia University, Montreal H3G 1M8
anne_at_alcor.concordia.ca (514) 848-7606
Received on Wed May 21 1997 - 20:43:39 NZST