Managers,
Below is a summary of the responses to my question about our ES40
performance problems Sep 11-14. Since then, things have quieted down
so we are assuming the problems were somehow due to the disasters
on the east coast. We have also ordered a fourth cpu and more memory.
Our investigation of what happened was hampered by the file domain
which holds the logs filling up and shutting off syslog logging. In
the extreme load periods, even well before the domain filled up, there
are missing sendmail log records; it appears as if syslog was
overwhelmed and lost records.
Another problem we encountered was that qpopper somehow doubled active
inboxes. This happened to about 1500 spools. I suspect a race
condition in qpopper. I ended up writing a perl script which went
through all the spools in /var/spool/mail, parsed each spool into
messages, and wrote out only one of each message. I compared for exact
copies. Thus this script would not delete copies if a message was
delivered more than once, as then it would have different received time,
etc. The multiples were in powers of two; i.e. a spool would have a
set of messages repeated 2, 4, 8, ... times. This filled up our
disk increasing the chaos.
Thanks for all the suggestions; the systems group went through
them with a fine tooth comb.
I have included in the summary the suggestions which were new to
me (although our systems group may already have known about them).
We already had rebooted, tried moving directories such as pop_tmp,
creating new mail queues, checked for mail bombs, check for extra
large mailings, temporarily turned off qpopper, temporarily turned
off the web mail, etc., so I haven't included those below.
Several people suggested modifying qpopper to control the number
of processes; for now we are concentrating on getting rid of it,
and switching, probably to Cyrus.
- Jerry Berkman, UC Berkeley
==================================================
Udo.de.boer_at_ubero.nl suggests using "dia" to look for hardware problems,
and to use "ps -O vsize" to see which programs are paging.
crossmd_at_mh.uk.sbphrd.com asks if we have enough memory
in the UBC.
Lindsay.Wakeman_at_bl.uk says:
This is only a guess, but could it be chronic exhaustion of the BMT
structures in your AdVfs filesystems (due to transience of (hundreds
of?)thousands of small files)? I am no expert but the AdVfs info from
your sys_check census looks suspect and mentions a problem with
Pagecounts. Such a problem might also explain the inordinate amount of
time to run ls.
The AdVfs command 'showfile -x' gives you more info about data in the
BMT. If this is the problem I think you would need to take the system
offline to sort it out (which involves possible defrag, recreating the
file partition and restoring data etc. etc.)
This is from the recovery section in Hancock's 'Tru64 UNIX Filesystem
Administration Handbook - a useful tome!
seb_at_lns62.lns.cornell.edu suggests maybe the increased volume due to
the bombings is a factor. He mentions "a multi-megabyte message
containing a powerpoint presentation of images from the NYC disaster"
being circulated plus chain letters and mass mailings.
ckd_at_genome.wi.mit.edu suggests using 'sysconfig -q advfs' and
'sysconfig -q vfs'.
Todd_Minnella_at_intuit.com says:
Have you set up your /var/spool/mail directory to use a one- or
two-level hash? If you have 18,000 files in /var/spool/mail, directory
lookups will take a long time. In addition, you may want to make
either or both /var/spool/mail and /usr/pop_tmp into raid 0+1 volumes;
if your percentage of writes-to-reads is greater than about 30-40% (use
collect to monitor your I/O if you don't know the percentage), under
high load conditions, your disks may not be able to keep up due to the
added overhead of raid 5.
oisin_at_sbcm.com says to implement the VM adjustments suggestions in
the 9/13 sys_check output.
bruce.hines_at_eds.com says:
1. Get the dcpi (Digital Continuous Profiler Infrastructure) from
Compaq web site. This will allow you to capture what the processor is
doing at peak times. You will need to run for a 10-20 second sample
during one of the peak periods. It will give you some idea what is
high running code.
2. Use the kprofile to analyze the kernel during those peak periods.
Run the kprofile for a 10-20 second sample and analyze results. This
should tell you something about where the kernel is spending its time.
Use the option to break down by cpu, will get better report. Please
read man page on usage as it requires "pfm" device driver to be
installed into kernel (which it is not by default).
3. Run lockinfo -sort=misses sleep 10 during this same peak period.
This will capture the lock analysis. Look for high misses (with high
percent miss) and high waitsum seconds.
Please run these one after the other NOT AT SAME TIME as not to skew
the results. The outputs should give you some better ideas what is
causing your bottleneck or at least where else to look.
athome11_at_qwest.net says:
I have seen problems on large email systems where the
sominconn/somaxconn were set to the factory defaults. The symptom is
sendmail grinds to a halt.
Increase them to the max for the OS version you have. 32K will work on
all OS versions. If you are at 4.0f with patches, seems like 64K will
work.
Further, the syscheck from your site recommends it too. This should
help.
From the syscheck:
---------------------------------------------------------------------
Tuning Suggestion: Too many ipintrq.ifq_drops packet drops ( 166 )
Packet drops can be avoided by raising ipintrq.ifq_maxlen to 2048. Use
the sysconfigdb command or the Kernel Tuner to add the following to
/etc/sysconfigtab:
inet:
ipqmaxlen=2048
See System Configuration and Tuning Manual
Tuning Suggestion: The high value for somaxconn_drops indicates that
you may want to increase somaxconn. Use the sysconfigdb command or the
Kernel Tuner to add the following to /etc/sysconfigtab: (the maximum
value is 32767).
socket:
somaxconn=8192
You must reboot the system in order to use the new value. See Tuning
the Socket Listen Queue Limits in the System Configuration and Tuning
Manual for more information.
Tuning Suggestion: Increase tcbhashsize to at least 2048. There are a
large number of total TCP connections ( 7521 ) in comparison to the
hash size ( 256 ), which indicates that you may want to increase
tcbhashsize to at least 2048. Use the sysconfigdb command or the
Kernel Tuner to add the following to /etc/sysconfigtab:
inet:
tcbhashsize=2048
This will take affect when sysconfig -r is run, or on a reboot. See
Improving the Lookup Rate for TCP Control Blocks in the System
Configuration and Tuning Manual for more information.
There are no defined netrain interfaces.
B.C.Phillips_at_massey.ac.nz said:
Your system is paging madly! You need more RAM, or at least to use more
of it for running processes in and less for disk cache.
From your syscheck
Operational: There is excessive page-in activity on the system ( 10938
). Investigate this condition. See vmstat(1) reference page for
information.
...
and suggested using vmubc.
> -----Original Message-----
> From: tru64-unix-managers-owner_at_ornl.gov
> [mailto:tru64-unix-managers-owner_at_ornl.gov]On Behalf Of Jerome M Berkman
> Sent: Monday, September 17, 2001 2:13 AM
> To: tru64-unix-managers_at_ornl.gov
> Subject: ES40 can't work
>
>
> Managers,
>
> Our ES40 mail server is going ballistic, the load averages
> goes into the 300-400 range and nothing gets accomplished!
> It takes over an hour to do "ls -l" in a directory!
> We can't figure it out. Rather than sending a long message
> to this list, I have set up a Web page which describes the
> problem in detail. It is at:
>
> http://socrates.berkeley.edu:7309/perf_problem/
>
> We are desperate; please let me know if you have had any similar
> experience; have any idea what the problem is, or any recommendation
> on how to approach it.
>
> Thanks.
>
> - Jerry Berkman, UC Berkeley, jerry_at_uclink.berkeley.edu
> 1-510-642-4804
Received on Wed Oct 03 2001 - 05:07:43 NZST