NSR recovery performance - questions from William H. Magill on 1998-05-01 (tru64-unix-managers)

From: William H. Magill <magill_at_isc.upenn.edu>
Date: Thu, 30 Apr 1998 14:52:10 -0400

Does anyone have any kind of performance information on recovering
reasonably large file domains with a reasonably large number of files with
NSR?

We recently had a KZPSC trash an AdvFS partition for us (and yes, it does
still cause a kernel panic under 4.0b). And the NSR recover, literally,
took longer than doing it twice with tar.

During the restoration "festivities" I was fortunate enough to have spare
disks, so we stuck them around the place.

The hosed system was a 2100A 5/300 running 4.0b and patch kit 6, with KZPSC
32meg write-back cache and a 6-RZ29B-VW Raid-5 array for the file-domain in
question. A RAID 0 array of 5 RZ28-VW drives was created on this
controller for the restoration work. The AdvFS fragment problems were in
the "home" fileset. The restoration consisted of using Vdump to dump the
"mail" and "tmp" filsets to the "newhome" domain. Then using Tar to
transfer individual user directories (scripted) one at a time to the new
domain, so that when we hit a problem area (6 out of roughly 6100 userid
-root directories) we could pickup with the next one. After everything was
tarred over, we then used Vdump to transfer things back to the newly
reinitialized and formatted "home_dmn" domain. This entire "heroic
recovery" effort (including all of the AdvFS panics every time we hit a bad
fragment) took roughly 7 hours. Nominally 12 gig and some 250,000 plus files.

df -k
Filesystem 1024-blocks Used Available Capacity Mounted on
home_dmn#home 16756736 4174764 8511568 33% /home
home_dmn#mail 16756736 3757788 8511568 31% /var/spool/mail
home_dmn#tmp 16756736 136112 8511568 2% /var/spool/mailtm

The NSR server is a 1000A 4/266 running 4.0d and patch kit1 with NSR 4.4.,
also with a KZPSC and a 32meg write-back cache, in this case for 6 RZ28D-VW
drives in a 0+1 array. The Jukebox is an MTI Infinity 1530 - BHTi Quad 7
(Breece Hill), with a pair of DLT 4000 drives; connected on a single KZPAA.

The NSR server routinely backs up some 25+ alphas with from 2-15gig per
server. There are no significant "single file" databases; everything is
typical time-sharing sized stuff.... ie lots of small files. However,
we do NOT backup /news/spool! A normal total dump of all 25 systems takes
approximately 7.5 hours.

While the above tar/vdump action was taking place, (actually started
before), we began to restore the "home" fileset to the NSR server machine,
using the "relocate" option.

This restoration took 3 tapes - 1 full and 2 incrementals... and 8 hours 11
minutes - to restore 1 4.1 gig fileset! Meanwhile the tar/vdump operation
had taken approximately 7 hours to do the same job plus more Twice...
3 filsets totalling roughly 8 gig were copied over and back!

Soooo..... does anyone have any ideas what is happening here.

I don't believe it is the raw speed difference between the 2100A and the
1000A, so the likely hood of it being an I/O bottleneck is low.

It is clearly not a problem with AdvFS file creation - the faster job
created twice as many files and had the raid 5 array to contend with.

To me it appears to be stricly a NSR throughput issue. Legato is optimized
to write fast to tape. It does this "buffer amalgamation" thing - combines
records from all running clients (we run with 8 parallel) in a single
buffer and writes that out to tape as a block of data. My guess is that
this is very efficient when writing to tape, but incredibly in-efficient
when restoring.

T.T.F.N.
William H. Magill Senior Systems Administrator
Information Services and Computing (ISC) University of Pennsylvania
Internet: magill_at_isc.upenn.edu magill_at_acm.org
magill_at_upenn.edu http://pobox.upenn.edu/~magill/
Received on Thu Apr 30 1998 - 20:53:18 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT