FOLLOWUP: Serious Problems DEFPA + 4.0B from Tim W. Janes on 1997-04-22 (tru64-unix-managers)

From: Tim W. Janes <janes_at_signal.dra.hmg.gb>
Date: Mon, 21 Apr 1997 23:03:47 +0100 (BST)

We are no nearer finding a solution.

The lockups only effect one program, have never effected 250 2/266
Ethernet machines, only rarely effect 500 3/333 ethernet machines but
kill any FDDI connected machine within 36 hours.

Kurt Carlson <sxkac_at_java.sois.alaska.edu> suggested leaving another
process running to check that we were not running out of swap. This
showed that we were not.

DEC have led us through the process of halting the system and forcing
a crash. (Involved moving a jumper on the Motherboard)

Here are a couple of snippets from crash_data

malloc failed: bucket size=8192, # of failures=2496700
malloc failed: bucket size=8192, # of failures=2496800
malloc failed: bucket size=8192, # of failures=2496900
malloc failed: bucket size=8192, # of failures=2497000
malloc failed: bucket size=8192, # of failures=2497100
malloc failed: bucket size=8192, # of failures=2497200
malloc failed: bucket size=8192, # of failures=2497300

       Swap device name Size In Use Free
-------------------------------- ---------- ---------- ----------
(null) 524288k 207656k 316632k
                                      65536p 25957p 39579p
-------------------------------- ---------- ---------- ----------
Total swap partitions: 1 524288k 207656k 316632k
                                      65536p 25957p 39579p
_kdbx_swap_end:

DEC has the crash dumps for 2 weeks but so far we are sadly lacking
any feedback.

We have also occasionally noticed messages such as

Apr 7 22:34:46 saki vmunix: malloc failed: bucket size=8192, # of failures=100
Apr 7 22:34:47 saki vmunix: malloc failed: bucket size=8192, # of failures=200

in kern.log but these do not correspond to any crash.

Tim Janes

Here is my original question:-

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since upgrading to DU4.0B we have experience serious lock-up problems
on all our FDDI (DEFPA-UA) connected workstations. (250-4/266 and 500/333)

One of our main user written programs that is extremely CPU intensive
will lock-up the machine after some hours of running.

By lock-up I mean that the screen is blank and cannot be woken up by
keyboard or mouse and the system will not respond to any network
requests including ping. On pressing the reset button the machine
enters the power-up sequence and reboots OK. There is nothing logged
in uerf, /var/adm/messages or /var/adm/syslog.dated/* and no core file
generated.

This program ran OK day in day out on 3.2G and still runs OK on
ethernet connected machines. If the dataset for the program is chosen
so that its memory requirements are small it is still OK on FDDI
machines. It only appears to fail when its Virtual memory requirement
gets large ( >180Mbytes?) ( Physical memory is 128Mbytes, swap
512Mbytes ) and it is running on an FDDI machine. It is not yet clear
if this is the only program that will fail as it is the only large
CPU/memory program that we have run for any length of time since
upgrading to 4.0B. The program does not fail consistently at the same
point even with the same input data. We have tried recompiling on 4.0B
without any change in in behaviour.

I have logged a call with DEC and they are sending me by post( what is
wrong with ftp?)x the latest 4.0B patch kit but I apparently there was
little in the info that suggests that it will cure this problem.

Any suggestions on how to debug what is happening?

One point I notices when upgrading firmware is that on the servers
(1000 and 1000a) the 3.9 firmware ( downloaded from digital by ftp)
upgraded the DEFPA firmware to 3.1 but on all workstations the upgrade
procedure left the DEFPA firmware at 2.46. What is the difference
between 2.46 and 3.1? Is it relevant to our problem?

Many thanks for any help.
Received on Tue Apr 22 1997 - 00:15:53 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT