Summary: ES40 can't work from Jerome M Berkman on 2001-10-03 (tru64-unix-managers)

From: Jerome M Berkman <jerry_at_uclink.berkeley.edu>
Date: Tue, 02 Oct 2001 22:05:55 -0700 (PDT)

Managers,

Below is a summary of the responses to my question about our ES40
performance problems Sep 11-14. Since then, things have quieted down
so we are assuming the problems were somehow due to the disasters
on the east coast. We have also ordered a fourth cpu and more memory.

Our investigation of what happened was hampered by the file domain
which holds the logs filling up and shutting off syslog logging. In
the extreme load periods, even well before the domain filled up, there
are missing sendmail log records; it appears as if syslog was
overwhelmed and lost records.

Another problem we encountered was that qpopper somehow doubled active
inboxes. This happened to about 1500 spools. I suspect a race
condition in qpopper. I ended up writing a perl script which went
through all the spools in /var/spool/mail, parsed each spool into
messages, and wrote out only one of each message. I compared for exact
copies. Thus this script would not delete copies if a message was
delivered more than once, as then it would have different received time,
etc. The multiples were in powers of two; i.e. a spool would have a
set of messages repeated 2, 4, 8, ... times. This filled up our
disk increasing the chaos.

Thanks for all the suggestions; the systems group went through
them with a fine tooth comb.

I have included in the summary the suggestions which were new to
me (although our systems group may already have known about them).
We already had rebooted, tried moving directories such as pop_tmp,
creating new mail queues, checked for mail bombs, check for extra
large mailings, temporarily turned off qpopper, temporarily turned
off the web mail, etc., so I haven't included those below.
Several people suggested modifying qpopper to control the number
of processes; for now we are concentrating on getting rid of it,
and switching, probably to Cyrus.

        - Jerry Berkman, UC Berkeley

==================================================

Udo.de.boer_at_ubero.nl suggests using "dia" to look for hardware problems,
and to use "ps -O vsize" to see which programs are paging.

crossmd_at_mh.uk.sbphrd.com asks if we have enough memory
in the UBC.

Lindsay.Wakeman_at_bl.uk says:
        This is only a guess, but could it be chronic exhaustion of the BMT
        structures in your AdVfs filesystems (due to transience of (hundreds
        of?)thousands of small files)? I am no expert but the AdVfs info from
        your sys_check census looks suspect and mentions a problem with
        Pagecounts. Such a problem might also explain the inordinate amount of
        time to run ls.

        The AdVfs command 'showfile -x' gives you more info about data in the
        BMT. If this is the problem I think you would need to take the system
        offline to sort it out (which involves possible defrag, recreating the
        file partition and restoring data etc. etc.)

        This is from the recovery section in Hancock's 'Tru64 UNIX Filesystem
        Administration Handbook - a useful tome!

seb_at_lns62.lns.cornell.edu suggests maybe the increased volume due to
the bombings is a factor. He mentions "a multi-megabyte message
containing a powerpoint presentation of images from the NYC disaster"
being circulated plus chain letters and mass mailings.

ckd_at_genome.wi.mit.edu suggests using 'sysconfig -q advfs' and
'sysconfig -q vfs'.

Todd_Minnella_at_intuit.com says:
        Have you set up your /var/spool/mail directory to use a one- or
        two-level hash? If you have 18,000 files in /var/spool/mail, directory
        lookups will take a long time. In addition, you may want to make
        either or both /var/spool/mail and /usr/pop_tmp into raid 0+1 volumes;
        if your percentage of writes-to-reads is greater than about 30-40% (use
        collect to monitor your I/O if you don't know the percentage), under
        high load conditions, your disks may not be able to keep up due to the
        added overhead of raid 5.

oisin_at_sbcm.com says to implement the VM adjustments suggestions in
the 9/13 sys_check output.

bruce.hines_at_eds.com says:
        1. Get the dcpi (Digital Continuous Profiler Infrastructure) from
        Compaq web site. This will allow you to capture what the processor is
        doing at peak times. You will need to run for a 10-20 second sample
        during one of the peak periods. It will give you some idea what is
        high running code.

        2. Use the kprofile to analyze the kernel during those peak periods.
        Run the kprofile for a 10-20 second sample and analyze results. This
        should tell you something about where the kernel is spending its time.
        Use the option to break down by cpu, will get better report. Please
        read man page on usage as it requires "pfm" device driver to be
        installed into kernel (which it is not by default).

        3. Run lockinfo -sort=misses sleep 10 during this same peak period.
        This will capture the lock analysis. Look for high misses (with high
        percent miss) and high waitsum seconds.

        Please run these one after the other NOT AT SAME TIME as not to skew
        the results. The outputs should give you some better ideas what is
        causing your bottleneck or at least where else to look.

athome11_at_qwest.net says:

        I have seen problems on large email systems where the
        sominconn/somaxconn were set to the factory defaults. The symptom is
        sendmail grinds to a halt.

        Increase them to the max for the OS version you have. 32K will work on
        all OS versions. If you are at 4.0f with patches, seems like 64K will
        work.

        Further, the syscheck from your site recommends it too. This should
        help.

        From the syscheck:
        ---------------------------------------------------------------------

        Tuning Suggestion: Too many ipintrq.ifq_drops packet drops ( 166 )
        Packet drops can be avoided by raising ipintrq.ifq_maxlen to 2048. Use
        the sysconfigdb command or the Kernel Tuner to add the following to
        /etc/sysconfigtab:

         inet:
          ipqmaxlen=2048

        See System Configuration and Tuning Manual

        Tuning Suggestion: The high value for somaxconn_drops indicates that
        you may want to increase somaxconn. Use the sysconfigdb command or the
        Kernel Tuner to add the following to /etc/sysconfigtab: (the maximum
        value is 32767).

         socket:
          somaxconn=8192

        You must reboot the system in order to use the new value. See Tuning
        the Socket Listen Queue Limits in the System Configuration and Tuning
        Manual for more information.

        Tuning Suggestion: Increase tcbhashsize to at least 2048. There are a
        large number of total TCP connections ( 7521 ) in comparison to the
        hash size ( 256 ), which indicates that you may want to increase
        tcbhashsize to at least 2048. Use the sysconfigdb command or the
        Kernel Tuner to add the following to /etc/sysconfigtab:

         inet:
          tcbhashsize=2048

        This will take affect when sysconfig -r is run, or on a reboot. See
        Improving the Lookup Rate for TCP Control Blocks in the System
        Configuration and Tuning Manual for more information.

        There are no defined netrain interfaces.

B.C.Phillips_at_massey.ac.nz said:

        Your system is paging madly! You need more RAM, or at least to use more
        of it for running processes in and less for disk cache.

        From your syscheck

        Operational: There is excessive page-in activity on the system ( 10938
        ). Investigate this condition. See vmstat(1) reference page for
        information.

        ...

and suggested using vmubc.

> -----Original Message-----
> From: tru64-unix-managers-owner_at_ornl.gov
> [mailto:tru64-unix-managers-owner_at_ornl.gov]On Behalf Of Jerome M Berkman
> Sent: Monday, September 17, 2001 2:13 AM
> To: tru64-unix-managers_at_ornl.gov
> Subject: ES40 can't work
>
>
> Managers,
>
> Our ES40 mail server is going ballistic, the load averages
> goes into the 300-400 range and nothing gets accomplished!
> It takes over an hour to do "ls -l" in a directory!
> We can't figure it out. Rather than sending a long message
> to this list, I have set up a Web page which describes the
> problem in detail. It is at:
>
> http://socrates.berkeley.edu:7309/perf_problem/
>
> We are desperate; please let me know if you have had any similar
> experience; have any idea what the problem is, or any recommendation
> on how to approach it.
>
> Thanks.
>
> - Jerry Berkman, UC Berkeley, jerry_at_uclink.berkeley.edu
> 1-510-642-4804
Received on Wed Oct 03 2001 - 05:07:43 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT