As you may recall, I had an AlphaServer 4100 with 4G RAM running 4.0D with
6 1 gig swap partitions and 1 3 gig swap partition all on one SCSI
channel:
> rz19b: 0.9 gig
> rz18b: 1 gig
> rz17b: 1 gig
> rz20b: 1 gig
> rz21b: 1 gig
> rz22b: 1 gig
> rz22h: 3 gig
Thanks go out to the following folks for providing much help:
Sudhir Rao <sudhir-rao_at_worldnet.att.net>
Partin.Kevin <KPartin_at_hou.mdc.com>
alan_at_nabeth.cxo.dec.com
Reginald Beardsley <esci_at_shell.fastlane.net>
Grant Schoep <grant_at_storm.com>
"Warren, John H." <JHWARREN_at_ESCOCORP.com>
Gavin Kreuiter <gavink_at_ust.co.za>
georg.tasman_at_db.com
K.McManus_at_greenwich.ac.uk
Jim Belonis <belonis_at_dirac.phys.washington.edu>
Burch Seymour RTPS <bseymour_at_ns.encore.com>
"Dr. Tom Blinn" <tpb_at_doctor.zk3.dec.com>
In summary, most suggested using 'swapon -s' to actually monitor swap
usage, particularly when the system was heavily loaded. I checked this
out for a day and found that a large job could take up to 2 gigs of
memory, and with 4 CPUs running 4 jobs using up to 8 gigs of memory could
be entirely possible. I also deduced myself that the rz22h 3 gig swap
partition was probably added to the system after the former system
administrator ran out of swap space at one time.
Many people suggested that systems should have 2-3 times as much swap as
RAM, and many people suggested that on todays large RAM systems (e.g. our
4 gig system) that this rule of thumb might be simple-minded. It appears
that the simple rule of thumb worked in this case. Most people who
suggested 2-3 times core also suggested that really empirical results with
'swapon -s' or something similar should be trusted more than guesstimates.
Also, I should have included that the system is using 'eager' rather than
'lazy' swapping. To turn on 'eager' swapping all that is required is a
symlink as /sbin/swapdefault pointing to one of the swap partitions (e.g.
/dev/19b). 'eager' swapping means that swap space is allocated as memory
is allocated, therefore you need more swap than you would if you used
'lazy' swapping -- the problem with 'lazy' swapping, however, is that the
system will kill idling jobs when the system runs out of swap space. I
chose to stick with 'eager' swapping since it seemed to be a more stable
solution if you're expecting that you might run out of swap. Many people
pointed out the issues surrounding lazy vs. eager swapping.
Probably the most quotable person who responsed was Dr. Tom Blinn
<tpb_at_doctor.zk3.dec.com>:
=========================================================
Only you can determine the optimal swap allocation. Here are some clues:
Digital UNIX *will* use multiple swap partitions, and if more than one is
available on the system and there is free space in more than one, it will
try to balance swap allocation among the available partitions using a
"round robin" allocation algorithm. As long as the partitions are on
different disks (and ideally on different SCSI busses and ideally on
different PCI busses), you will get better performance, because this will
allow more simultaneous I/O activity in most situations. I note you have
two different swap partitions of different sizes on the rz22 disk; this
may defeat the logic of the algorithm if in fact you manage to fill swap
very full, and in any case, as long as both of those partitions are being
actively used, you're going to have the problem of moving the heads
between the partitions (even if they are contiguous) when both are used,
which will happen a lot of the time. If I could recommend just one thing,
I would recommend retiring one of the two rz22 swap partitions, and I
would comment that the algorithm works best when all the swap partitions
are of about the same size, if in fact you sometimes fill up most of the
swap to the point where only one partition still has free space.
There might be some benefit in moving the swap around to spread it across
multiple SCSI busses (on different PCI busses where that is possible), but
in many cases that's not a big win.
Two things provide big performance wins: 1) more physical memory (so that
you don't have to page things in and out a lot), and 2) improving the
locality of reference in applications, so that as much as possible, once
data is present in memory, it will stay in memory, and related data is
likely to be in the same physical page. (Note that memory pages on Alpha
systems are presently 8K in size; some systems have far smaller page
sizes.)
Given the above, I can only remark that the "swapon" utility is your
friend; you can easily write a shell script that will run the swapon
utility from time to time, using the -s option (statistics), and monitor
the actually use of the swap on the system. If eager allocation is in
use, and you are seeing 80% or more peak allocation, then you probably do
need all the available swap; if you are using "lazy" allocation, then you
can monitor the actual swap used to get a sense of what's needed. Either
way, you have to monitor the system under load to tell what you need for
your workload. If in fact you don't need the 3MB swap on rz22, for
example, you can just disable that partition in your /etc/fstab and
reboot, and you'll have eliminated that swap; there is no way to disable a
swap partition on a running system. (But you can add swap on the fly if
you have an unused partition to put it in.)
You can also use vmstat to monitor the paging rates; page outs are
probably less meaningful than pageins. But if you're paging a lot, you
need to try to balance the swap as much as possible to optimize I/O
activity; if you are not paging a lot, but are using eager allocation,
then you might just need to have a lot of available space.
Many people recommend 3 times physical memory as the aggregate size of the
swap, but that is just an old rule of thumb, and you can't really tell
what you need until you monitor the system under load.
Have fun.. And you are welcome to quote me for a summary.
Tom
=============================================================
Also here's a comment from Reginald Beardsley <esci_at_shell.fastlane.net>:
=============================================================
"The installation notes for 4.0E say that multiple swap partitions will be
faster. I've not tested DU, but SunOS would actually run slower if they
weren't all the same speed and size. If you're in a position to do some
testing, write a simple program that makes multiple passes over a really
large array. In your case you may need to run several copies of the
program so that the total vm required exceeds 4 GB.
[...]
It really doesn't take long to test the swapping performance. My recollection
is that I found things flattened out as I added drives. The purpose of
using multiple swaps is to allow the controllers to overlap seeks. At a
certain point the controller becomes the bottleneck instead of the disk arms.
I seem to recall reading somehting recently indicating that 4 drives per
controller was the limit for disk limited performance."
==============================================================
So, here's what I finally wound up doing. After reclaiming the other
SCSI bus, I also got another 4 gig drive that was sitting in the rack
unused (bonus there). I moved 2 of the other 4 gig drives to the other
SCSI bus. I set it up so that I had 3 1.5 gig swap paritions on one bus
and 3 1.5 gig swap partitions on the other bus. I also split it so that
the root partition and /usr/local was on one bus, while /usr was on the
other bus. This leaves me with a lot of 2.5 gig partitions lying around
but hopefully we'll get that license for the polycenter AdvFS ultilities
package and be able to 'addvol' the partitions together to make a decent
sized 10 gig file domain.
Unfortunately, I don't have the empirical data to tell you all if it is
worth it to split up the swap partitions across multiple drives and
multiple buses like this -- but everyone seems to be of the opinion that
it is unlikely to hurt and that clearly having 9 gigs distributed across
multiple disks and multiple partitions ought to be better than just having
a single 9 gig swap drive, but there seems to be some question as to if 6
swap partitions might not be overkill (or just diminishing returns). I
had to go through and repartition everything anyways though (since the
root partition initially only had 90 *megs* on it) and it didn't take me
any extra time to setup the multiple swap partitions so i did it that
way...
One thing I should note is that in this whole process I learned that tar
loses suid bits, and therefore shouldn't be used in a 'tar -cvf|tar -xvf'
pipe to copy a partition -- instead you need to use a 'vdump | vrestore'
pipe to copy files between partitions.
--
Lamont Granquist lamontg_at_raven.genome.washington.edu
Dept. of Molecular Biotechnology (206)616-5735 fax: (206)685-7344
Box 352145 / University of Washington / Seattle, WA 98195
PGP pubkey: finger lamontg_at_raven.genome.washington.edu | pgp -fka
Received on Tue Jan 19 1999 - 20:25:45 NZDT