Summary:Virtual memory faults

From: Udo Grabowski <udo.grabowski_at_imk.fzk.de>
Date: Wed, 16 Aug 2000 14:18:32 +0200

Hello Managers !

Thanks to Seldon E. Ball and Alan_at_nabeth.cxo.dec.com the problem with the
stuck ES40 (see
http://www.ornl.gov/its/archives/mailing-lists/tru64-unix-managers/2000/08/msg00243.html)could
be solved quickly. Indeed there was a broken disk, but the event viewer
in the SMS tool does not show the related events. After issuing a 'dia -i disk'
several thousand SCSI errors appeared and the defect was located in a moment.
Since we have hardware warranty, our field service has sent us a new disk early
this morning, and everything is fine again.

Selden pointed out that our swap configuration is as bad as it can be for a high
performance application. Indeed I configured the four swap partitions on two
disks
as a quick workaround for our staff demanding main memory of 15-20 Gigs for a
single
job, using any unused partition "as is". Last night I reconfigured the disks so
that
both have only one swap partition, and the paging performance immediately has
more
than doubled after that ! So NEVER put more than one swap partition on a disk !
Of course one should buy more RAM in such cases, but our financial frame is
exhausted
for this year, so dirty workarounds are needed.
Thanks, Udo.
=============================================
ANSWERS:
  Selden
-----------------
Udo,
You wrote
> The logs show the following message several times:
> vmunix: vm_swap I/O error during pageout
You have a disk which is failing.
You must determine which one it is and replace it.
The easiest way is to use the file command:
e.g.
# file /dev/rrz5c
/dev/rrz5c: character special (8/5122) SCSI #0 3391WS disk #40 (SCSI ID #5)
(SCSI LUN #0) errors = 1/0
It will show the number of "hard errors/soft errors"
hard errors were not recoverable: no valid data could be transferred.
soft errors were recovered: after several retries something was retrieved.

(We are about to replace the disk described above)

You can also use the command
uerf -R -o full | more
to look at the log entries of the most recent errors
to see what kind of things are not going right.

> Our swapspace is distributed over two disks and four partitions as

Having more than one swap space on a disk is an exercise in futility.
Your system performance will be very,very poor when it tries to swap
on two different partitions on the same disk. The disk heads
will be jumping back and forth between the two partitions
as fast as they can, which is much too slow.
-----------------
  Alan
-----------------
        There isn't much real paging activity (pin - page-in or
        pout - page-out) which means your problem could be much,
        much worse. Evidently, you have enough physical memory
        to actually run the program without much real paging.

        While the large counts of soft faults and zero page fills
        are high, they shouldn't be unexpected. As you said, you
        have a 5 GB process. That's on the order of 650,000 pages
        that it may want to touch. If the program was coded to
        have all of that memory as pre-allocated, but zero filled
        data, it will take a lot of page faults and zero fills
        to touch all of it.

        I suspect the performance problem is due both to the amount
        of work that needs to be done to start this particular
        program and to it being done all at once inside the
        kernel. If it happened gradually the program would take
        longer to start, but would be more friendly.

        If you have control over the program, you might want to
        look at ways to use less static data and do the memory
        initialization as part of the program instead of letting
        the kernel tie up the system doing it. If the program
        is from a 3rd party, report the behavior to them to see
        what they can suggest.

        You might also want to log a problem report with your
        country Compaq Service center. There could a scheduling
        problem or lack of feature to gracefully handle these
        sorts of programs. A simple program that demonstrates
        the problem would help them a great deal.


-- 
Dr. Udo Grabowski                           email: udo.grabowski_at_imk.fzk.de
Institut f. Meteorologie und Klimaforschung II, Forschungszentrum Karslruhe
Postfach 3640, D-76021 Karlsruhe, Germany           Tel: (+49) 7247 82-6026
http://www.fzk.de/imk/imk2/ame/grabowski/           Fax:         "    -6141
Received on Wed Aug 16 2000 - 12:19:37 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT