memory problem - production machine down

From: <anthony.miller_at_vf.vodafone.co.uk>
Date: Fri, 18 Feb 2000 01:34:38 +0000

All...

Hope 6you can help. We have been doing some maintenance on one of our
production 8400's.

Have added some additional KZPBA-CB's (connected to an EMC symmetrix). All
went well. The system booted from genvmunix - lsm starts, all my devices
are visible and mount etc., etc.

Built a new kernel (and config file). Rebooted from new kernel and system
crashes mid boot.

I did see the following displayed during the genvmunix (single user) boot -
but sort of ignored it:
 
Starting at 0xfffffc000047e9b0
  contig_malloc: failed to allocate memory within addrlimit
 contig_malloc: failed to allocate memory within addrlimit
 contig_malloc: failed to allocate memory within addrlimit

The system came up to single user ok. Started lsm and mounted /usr -
generated a new kernel and booted from it.

Upon booting from the new kernel, everything seems to be proceeding ok
until:
TLMEM at node 7
 TLMEM at node 6
 TLMEM at node 5
 TLMEM at node 4
 Dual TLEP at node 3
 Dual TLEP at node 2
 Dual TLEP at node 1
 Dual TLEP at node 0
 lvm0: configured.
 lvm1: configured.
 
 trap: invalid memory read access from kernel mode
 
     faulting virtual address: 0x0000027b00000005
     pc of faulting instruction: 0xfffffc000026d618
     ra contents at time of fault: 0xfffffc000026d5d0
     sp contents at time of fault: 0xfffffffe9d8df7e0
 
 panic (cpu 0): kernel memory fault
 
 DUMP: No primary swap, no explicit dumpdev.
           Nowhere to put header, giving up.
 
 halted CPU 0
 
 halt code = 5
 HALT instruction executed
 PC = fffffc00004b8130
 P00>>>init


This was a consistent problem. However booting multi-user from genvmunix
worked fine. System came up ok - all applications started etc.

It was by this time 01:00am so we were going to leave it running genvmunix
and diagnose further tomorrow.

Only one problem - This system uses HSM software and the application has
near line data on TZ89 based tape silo. The application needs the tape silo
to work. problem is that genvmunix does not seem to have media changer
support.

The system had been up for some 30 minutes or more. We were just wondering
the workaround to this when the system crashed.

trap: invalid memory read access from kernel mode
 
     faulting virtual address: 0x0000043e00000005
     pc of faulting instruction: 0xfffffc000026b5e0
     ra contents at time of fault: 0x0000000000000168
     sp contents at time of fault: 0xfffffffea0c476d0
 
 panic (cpu 1): kernel memory fault
 syncing disks...
 
 LSM attempting to dump to SCSI device unit number rz1
 
 DUMP: 27468083 blocks available for dumping.
 DUMP: 666546 wanted for a partial compressed dump.
 DUMP: Allowing 4843182 of the 4847278 available on 0x800401
 DUMP.prom: dev SCSI 0 3 0 1 100 0 0, block 409600
 DUMP: Header to 0x800401 at 4847278 (0x49f6ae)
 DUMP.prom: dev SCSI 0 3 0 1 100 0 0, block 409600


Looks like to me a hard memory fault of some kind. Any ide how I decide
which memory module may be the faulty one? My config is as follows:

01:03:23 P00>>>show config
01:04:12
01:04:12 Name Type Rev Mnemonic
01:04:12 TLSB
01:04:12 0++ KN7CF-AB 8014 0000 kn7cf-ab0
01:04:12 1++ KN7CF-AB 8014 0000 kn7cf-ab1
01:04:12 2++ KN7CF-AB 8014 0000 kn7cf-ab2
01:04:12 3++ KN7CF-AB 8014 0000 kn7cf-ab3
01:04:12 4+ MS7CC 5000 4000 ms7cc0
01:04:12 5+ MS7CC 5000 4000 ms7cc1
01:04:13 6+ MS7CC 5000 0000 ms7cc2
01:04:13 7+ MS7CC 5000 0000 ms7cc3
01:04:13 8+ KFTHA 2000 0D03 kftha0
01:04:13


01:16:14 P00>>>sho mem
01:17:58 Set Node Size Base Address Intlv Position
01:17:59 --- ---- ---- -------- -------- ----- --------
01:17:59 A 4 4096 Mb 00000000 00000000 8-Way 0
01:17:59 A 5 4096 Mb 00000000 00000000 8-Way 1
01:17:59 B 6 2048 Mb 00000002 00000000 4-Way 0
01:17:59 B 7 2048 Mb 00000002 00000000 4-Way 1
01:17:59 P00>>>



Any help would be greatly appreciated.

Best regards - Tony
Received on Fri Feb 18 2000 - 01:35:55 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT