summary: memory problem - production machine down

From: <anthony.miller_at_vf.vodafone.co.uk>
Date: Thu, 02 Mar 2000 11:03:11 +0000

All...

On the 18/2, I sent out a posting (original shown at the end of this mail)
RE some apparent memory errors.

I received very helpful replies from:

michael.pelley_at_northatlantic.nf.ca
Tom Webster [webster_at_ssdpdc.lgb.cal.boeing.com]
North, Walter [wnorth_at_state.mt.us]
Dr. Tom Blinn, 603-884-0646 [tpb_at_doctor.zk3.dec.com]
Blake Roberts [BLAROB_at_HBSI.COM]

Tom Webster' reply gave some very helpful stuff on turbo-laser start up's.
I have attached this as it is of general interest to the group.

By a process of elimination, we established that all 4 memory boards (2 x
2gb & 2 x 4gb) are good (we suspected that due to the crashes we were seeing
that memory may be a problem). We then suspected possibly a processor
problem.

However, what we actually did next was to disable (via the switches on the
EMC array) ALL of the EMC output busses. Bingo - genvmunix boots multi user
straight off. LSM starts and all the devices come up, everything is mounted
etc. We then did a standard 'doconfig' and generated a new kernel config
file.
Copied the new kernel to /vmunix and did a full reboot. Again - system has
come straight up - no problems at all. We now have robot support so the HSM
part of our application was a happy bunny.

The eventual solution was two fold however. Disabling the EMC & Walter
North's mail pointed me in the direction of SCSI id's. I ask the EMC
engineer to confirm that all SCSI id's on the EMC were set to 6. I
understand some of these may not have been, thus causing SCSI id clashes.
We checked the ISPnn_HOST_ID console parameter for each of the new ISP's
(KZPBA-CB's) and found a couple set to 6 & one set to 1! We set these to 7
thus avoiding any id problems.

This system was running console firmware v5.1. This is pretty old (Jan
'98).
DEC technical support confirmed that this firmware version does support
KZPBA-CB's, but our hardware engineers suggested upgrading to a later
version. On a SHOW CONFIG output from >>> KZPBA's are supposed to show up as
ISP1040B controllers. Ours seem to be showing up as ISP1020's (f/w single
ended):
 C2 PCI connected to kftha0 pci2
   0+ KZPSA 81011 0000 kzpsa10
   3+ QLogic ISP1020 10201077 0005 isp0
   4+ QLogic ISP1020 10201077 0005 isp1
   7+ QLogic ISP1020 10201077 0005 isp2
   8+ QLogic ISP1020 10201077 0005 isp3
   B+ QLogic ISP1020 10201077 0005 isp4

The console parameters ISPnn_SOFT_TERM appeared slightly unusual. At V5.1
of firmware, ours show up as set to "diff". I thought these should be set
to "on". Having upgraded to V5.3, these appeared ok. We still see the
controllers as 1020's and not 1040's though (Ultra-2, diff).

The system came straight up - no crashes, no funny memory errors etc. All
the EMC storage is available and accessible.


Thanks everybody for your help.

regards - Tony

 



========================Blake
Roberts==========================================
I have a similar set up to you with KZPBA-CB's connected to an EMC
Symmetrix. I don't know what version of the OS you are running, but if you
are on 4.0D PK5 or above, there is a SCSI patch (isp.mod) that needs to be
applied to prevent some flaky SCSI behavior. I've never seen it kernel
panic, but I have seen it timeout I/O and some other things.

>From the behavior you are saying, it doesn't sound like the SCSI ID's of
some of your drives have changed. One place you might want to verify is the
SCSI ID of your autochanger. If you didn't have the KZPBA's in before the
install, they may have changed bus numbers, which will throw the device
numbers off by a multiple of 8.

The only other thing I can think of is checking the CLC version on your
system. That has caused a kernel panic on one of my 8400's before.
Upgrading to a newer version solved that.



=============================Dr.
Tom===========================================
Tony, no where in your message do you indicate what kernel version you are
using, and since you imply you are using the SCSI/CAM layered components to
get media changer support, what version of that you are using and whether
you are CERTAIN it's the latest version and compatible with your kernel.

In your /usr/sys/conf directory, in addition to your <SYSTEMNAME> config
file, there may be a <SYSTEMNAME>.list file (as well as a .product.list
file). The .product.list file is the "master list" of layered products
that have components that will be built into the kernel. The copy for
each config file is based on the .product.list at the time the config
file was first generated (at least, that's how doconfig used to work, I
don't think it has changed). Compare the .product.list file to the one
for your configuration. If they differ, you might replace the one that
is specific to your configuration with a copy of the .product.list file.
Also, you might try replacing the one specific to your configuration with
an empty file, or comment out all of the kit identifier lines in the one
specific to your configuration, and rebuild your kernel. I suspect that
if you do the latter (use an empty file or one with the lines commented
out), you'll build a kernel that will reboot successfully. But it won't
have media changer support, either.

In any case, you've apparently got something messed up pretty badly in
your configuration, it's even possible you've got an invalid or simply
unsupported configuration, but there's not enough information in your
post to tell for sure. I'm going to pass a copy of your message along
to a colleague who might have some clues (or might not), but I suspect
you've got something out of the bounds of what's supposed to work. I can
imagine that you've got too many KZPBA-CB adapters for an 8400, just as
one possibility (since you say you added more).


=========================Walter
North=========================================
You might check the internal scsi id's of the new KZPBA,
I got some a month or so ago that were set to 6 instead of 7
and had a problem attached to a IBM ESS (EMC wanna be) which
looks to 7 to transmit card info. The 8400 I was hooked to
lost sight of its internal disks and reboot produced similar
messages.

hope this helps.

a sho dev will show the internal card settings out on the far end
if i remember correctly.

According to dec you then can use the set command at the console
prompt to change em which I am going to do tommorrow when we
get down time. Until then I have had to set the IBM ESS to
6 for these cards.




=========================From: Tom
Webster==============================================
1. Bad hardware. It the fault light on the front panel lit? If it is
   then the hardware has failed the power-on-self-test. The 8400's
   print a diagnostic listing of what board they have when they init.
   I learned the hard way when we were having processor problems that
   it is usually a good idea to have a copy of a healthy one around
   (I know, now I tell you).

   This is an old listing from one of our machines that I happened to
   have on-line (lovingly retyped as an example for our restart
   procedures (most of it is straight from the manual):

   F E D C B A 9 8 7 6 5 4 3 2 1 0 NODE#
                        A M . . . . P P P TYP
                        o + . . . . ++ ++ ++ ST1
                        . . . . . . EE EE EB BPD
                        o + . . . . ++ ++ ++ ST2
                        . . . . . . EE EE EB BPD
                        + + . . . . ++ ++ ++ ST3
                        . . . . . . EE EE EB BPD
                           . + + + . + + + C0 PCI +
               + . . . + . + + . . . + C1 PCI +
                     . . . . . . . . EISA +
                        . A0 . . . . . . . ILV
                        .2GB . . . . . . . 2GB

   The TYP field lists the type of module:

   A = Adapter
   M = Memory
   P = Processor

   The ST1-ST3 lines show the results of the self-tests. The
   results are coded as follows:

   + = pass
   - = fail
   o = N/A

   The BPD lines indicate boot processor designation. The results
   on this line indicate:

   B = boot processor
   E = eligible to boot
   D = ineligible to boot




==========================original
posting=======================================
Hope 6you can help. We have been doing some maintenance on one of our
production 8400's.

Have added some additional KZPBA-CB's (connected to an EMC symmetrix). All
went well. The system booted from genvmunix - lsm starts, all my devices
are visible and mount etc., etc.

Built a new kernel (and config file). Rebooted from new kernel and system
crashes mid boot.

I did see the following displayed during the genvmunix (single user) boot -
but sort of ignored it:
 
Starting at 0xfffffc000047e9b0
  contig_malloc: failed to allocate memory within addrlimit
 contig_malloc: failed to allocate memory within addrlimit
 contig_malloc: failed to allocate memory within addrlimit

The system came up to single user ok. Started lsm and mounted /usr -
generated a new kernel and booted from it.

Upon booting from the new kernel, everything seems to be proceeding ok
until:
TLMEM at node 7
 TLMEM at node 6
 TLMEM at node 5
 TLMEM at node 4
 Dual TLEP at node 3
 Dual TLEP at node 2
 Dual TLEP at node 1
 Dual TLEP at node 0
 lvm0: configured.
 lvm1: configured.
 
 trap: invalid memory read access from kernel mode
 
     faulting virtual address: 0x0000027b00000005
     pc of faulting instruction: 0xfffffc000026d618
     ra contents at time of fault: 0xfffffc000026d5d0
     sp contents at time of fault: 0xfffffffe9d8df7e0
 
 panic (cpu 0): kernel memory fault
 
 DUMP: No primary swap, no explicit dumpdev.
           Nowhere to put header, giving up.
 
 halted CPU 0
 
 halt code = 5
 HALT instruction executed
 PC = fffffc00004b8130
 P00>>>init


This was a consistent problem. However booting multi-user from genvmunix
worked fine. System came up ok - all applications started etc.

It was by this time 01:00am so we were going to leave it running genvmunix
and diagnose further tomorrow.

Only one problem - This system uses HSM software and the application has
near line data on TZ89 based tape silo. The application needs the tape silo
to work. problem is that genvmunix does not seem to have media changer
support.

The system had been up for some 30 minutes or more. We were just wondering
the workaround to this when the system crashed.

trap: invalid memory read access from kernel mode
 
     faulting virtual address: 0x0000043e00000005
     pc of faulting instruction: 0xfffffc000026b5e0
     ra contents at time of fault: 0x0000000000000168
     sp contents at time of fault: 0xfffffffea0c476d0
 
 panic (cpu 1): kernel memory fault
 syncing disks...
 
 LSM attempting to dump to SCSI device unit number rz1
 
 DUMP: 27468083 blocks available for dumping.
 DUMP: 666546 wanted for a partial compressed dump.
 DUMP: Allowing 4843182 of the 4847278 available on 0x800401
 DUMP.prom: dev SCSI 0 3 0 1 100 0 0, block 409600
 DUMP: Header to 0x800401 at 4847278 (0x49f6ae)
 DUMP.prom: dev SCSI 0 3 0 1 100 0 0, block 409600


Looks like to me a hard memory fault of some kind. Any ide how I decide
which memory module may be the faulty one? My config is as follows:

01:03:23 P00>>>show config
01:04:12
01:04:12 Name Type Rev Mnemonic
01:04:12 TLSB
01:04:12 0++ KN7CF-AB 8014 0000 kn7cf-ab0
01:04:12 1++ KN7CF-AB 8014 0000 kn7cf-ab1
01:04:12 2++ KN7CF-AB 8014 0000 kn7cf-ab2
01:04:12 3++ KN7CF-AB 8014 0000 kn7cf-ab3
01:04:12 4+ MS7CC 5000 4000 ms7cc0
01:04:12 5+ MS7CC 5000 4000 ms7cc1
01:04:13 6+ MS7CC 5000 0000 ms7cc2
01:04:13 7+ MS7CC 5000 0000 ms7cc3
01:04:13 8+ KFTHA 2000 0D03 kftha0
01:04:13


01:16:14 P00>>>sho mem
01:17:58 Set Node Size Base Address Intlv Position
01:17:59 --- ---- ---- -------- -------- ----- --------
01:17:59 A 4 4096 Mb 00000000 00000000 8-Way 0
01:17:59 A 5 4096 Mb 00000000 00000000 8-Way 1
01:17:59 B 6 2048 Mb 00000002 00000000 4-Way 0
01:17:59 B 7 2048 Mb 00000002 00000000 4-Way 1
01:17:59 P00>>>



Any help would be greatly appreciated.

Best regards - Tony
Received on Thu Mar 02 2000 - 11:04:23 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT