Signal handling bug in 5.1A?

From: Chris Adams <cmadams_at_hiwaay.net>
Date: Mon, 11 Mar 2002 11:10:54 -0600

I just upgraded (really a clean install) our primary mail server from
4.0F to 5.1A Friday morning. I rebuilt all the software that we use,
including sendmail 8.12.2 (we use some features that aren't available in
Compaq's build of sendmail).

After a few hours, sendmail stopped accepting new connections. I tried
to restart it, but the master daemon process was stuck in an
uninterruptable sleep (ps showed "U" for the state). I couldn't start a
new sendmail daemon because the old one still held the socket open. So,
we had to reboot the server.

We had to do this several times over the next 24 hours. It ran anywhere
from 20 minutes to 12 hours before locking up. Finally, I managed to
trace the process (using "truss" from the Extended System V
Functionality kit from the Associate Products CD). It is weird: it
looks like sendmail received a signal while trying to unblock signals,
and at that point it locked up. Here are the last few lines from
"truss":

sigaction(SIGALRM, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGCLD, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGHUP, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGINT, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGINT, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGPIPE, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGPIPE, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGTERM, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGUSR1, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigaction(SIGUSR1, 0x000000011FFF98B0, 0x000000011FFF9898) = 0
sigprocmask(SIG_UNBLOCK, 0x20086003, 0x00000000) = 537419779
    Received signal #20, SIGCLD [caught]
      siginfo: SIGCLD CLD_KILLED pid=7682 uid=1 status=0x0001

The last line didn't get an end-of-line even. At that point, both
sendmail and truss were unkillable.

I looked at the sendmail source and figured out exactly where it was
when it hung. The signal handler for SIGCLD is a no-op (sendmail's
"ignore" handler). The next line after the sigprocmask call is an
execve, which never happened.

I've reconfigured sendmail to have the daemon and the queue runner in
separate processes (to reduce the likelyhood of hitting this), and so
far that has worked (although the system locked up hard this morning; I
don't know what happened there).

Has anyone else seen anything like this? As near as I can tell, this
looks like a Tru64 kernel bug with respect to signal handling. I don't
think there is any kind of hardware problem; this is an ES40 that has
been running 4.0F for over 2 years (we had a CPU fail about a year and a
half ago, but that has been the only problem).

I'm also trying to open a case with Compaq on this, but we just got a
new contract last week, and it is still in processing (the sales rep at
the reseller we worked through is tracking it down now).
-- 
Chris Adams <cmadams_at_hiwaay.net>
Systems and Network Administrator - HiWAAY Internet Services
I don't speak for anybody but myself - that's enough trouble.
Received on Mon Mar 11 2002 - 17:11:14 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:43 NZDT