Hello Managers !
Thanks to Seldon E. Ball and Alan_at_nabeth.cxo.dec.com the problem with the
stuck ES40 (see
http://www.ornl.gov/its/archives/mailing-lists/tru64-unix-managers/2000/08/msg00243.html)could
be solved quickly. Indeed there was a broken disk, but the event viewer
in the SMS tool does not show the related events. After issuing a 'dia -i disk'
several thousand SCSI errors appeared and the defect was located in a moment.
Since we have hardware warranty, our field service has sent us a new disk early
this morning, and everything is fine again.
Selden pointed out that our swap configuration is as bad as it can be for a high
performance application. Indeed I configured the four swap partitions on two
disks
as a quick workaround for our staff demanding main memory of 15-20 Gigs for a
single
job, using any unused partition "as is". Last night I reconfigured the disks so
that
both have only one swap partition, and the paging performance immediately has
more
than doubled after that ! So NEVER put more than one swap partition on a disk !
Of course one should buy more RAM in such cases, but our financial frame is
exhausted
for this year, so dirty workarounds are needed.
Thanks, Udo.
=============================================
ANSWERS:
Selden
-----------------
Udo,
You wrote
> The logs show the following message several times:
> vmunix: vm_swap I/O error during pageout
You have a disk which is failing.
You must determine which one it is and replace it.
The easiest way is to use the file command:
e.g.
# file /dev/rrz5c
/dev/rrz5c: character special (8/5122) SCSI #0 3391WS disk #40 (SCSI ID #5)
(SCSI LUN #0) errors = 1/0
It will show the number of "hard errors/soft errors"
hard errors were not recoverable: no valid data could be transferred.
soft errors were recovered: after several retries something was retrieved.
(We are about to replace the disk described above)
You can also use the command
uerf -R -o full | more
to look at the log entries of the most recent errors
to see what kind of things are not going right.
> Our swapspace is distributed over two disks and four partitions as
Having more than one swap space on a disk is an exercise in futility.
Your system performance will be very,very poor when it tries to swap
on two different partitions on the same disk. The disk heads
will be jumping back and forth between the two partitions
as fast as they can, which is much too slow.
-----------------
Alan
-----------------
There isn't much real paging activity (pin - page-in or
pout - page-out) which means your problem could be much,
much worse. Evidently, you have enough physical memory
to actually run the program without much real paging.
While the large counts of soft faults and zero page fills
are high, they shouldn't be unexpected. As you said, you
have a 5 GB process. That's on the order of 650,000 pages
that it may want to touch. If the program was coded to
have all of that memory as pre-allocated, but zero filled
data, it will take a lot of page faults and zero fills
to touch all of it.
I suspect the performance problem is due both to the amount
of work that needs to be done to start this particular
program and to it being done all at once inside the
kernel. If it happened gradually the program would take
longer to start, but would be more friendly.
If you have control over the program, you might want to
look at ways to use less static data and do the memory
initialization as part of the program instead of letting
the kernel tie up the system doing it. If the program
is from a 3rd party, report the behavior to them to see
what they can suggest.
You might also want to log a problem report with your
country Compaq Service center. There could a scheduling
problem or lack of feature to gracefully handle these
sorts of programs. A simple program that demonstrates
the problem would help them a great deal.
--
Dr. Udo Grabowski email: udo.grabowski_at_imk.fzk.de
Institut f. Meteorologie und Klimaforschung II, Forschungszentrum Karslruhe
Postfach 3640, D-76021 Karlsruhe, Germany Tel: (+49) 7247 82-6026
http://www.fzk.de/imk/imk2/ame/grabowski/ Fax: " -6141
Received on Wed Aug 16 2000 - 12:19:37 NZST