Multiple Problems with V5.1 Cluster.

From: ^IT_MPG_UNIX_ADMIN <^IT__MPG__UNIX__ADMIN_at_ccmail.allegiance.net>
Date: Thu, 19 Apr 2001 16:19:58 -0500

     We seem to be having multiple problems with our V5.1 Cluster. This
     cluster currently has 4 systems in it. It is running Patch Kit 2. The
     Storage solution is HSG80's.
     
     Problem No 1: When the system is at the hardware prompt and we issue a
     boot command, it keeps getting these messages. After the kernel has
     been loaded they stop but periodically, some of our disks go offline.
     Just wondering if anyone has experienced anything like this, that
     would save us time. This just recently started. Was fine until last
     week.
     
     -----BEGINNING OF MESSAGES------
     
     CPU 15 speed is 731 MHz
     create powerup
     initializing GCT/FRU at 214000
     initializing pka dqa eia eib eic eid eie eif eig eih eii pga pgb pgc
     pga0.0.0.7.1 - probe timeout retry
     .pgd pgb0.0.0.1.2 - probe timeout retry
     .pga0.0.0.7.1 - probe timeout failure
     .
      pge pgc0.0.0.7.9 - probe timeout retry
     .pgb0.0.0.1.2 - probe timeout failure
     .pga0.0.0.7.1 - probe timeout retry
     .pgf pgd0.0.0.1.10 - probe timeout retry
     .pgg pga0.0.0.7.1 - probe timeout failure
     .pge0.0.0.2.16 - probe timeout retry
     .pgd0.0.0.1.10 - probe timeout failure
     .pgc0.0.0.7.9 - probe timeout failure
     .pgb0.0.0.1.2 - probe timeout retry
     .pgh pgf0.0.0.4.17 - probe timeout retry
     AlphaServer Console V5.9-111, built on Jan 9 2001 at 15:21:06
     pga0.0.0.7.1 - probe timeout retry
     .
     CPU 0 booting
     
     (boot dga900.1001.0.7.1 -flags A)
     pgg0.0.0.2.24 - probe timeout retry
     .pge0.0.0.2.16 - probe timeout retry
     .pgd0.0.0.1.10 - probe timeout failure
     .pgb0.0.0.1.2 - probe timeout retry
     .pga0.0.0.7.1 - probe timeout failure
     .pgh0.0.0.4.25 - probe timeout failure
     .block 0 of dga900.1001.0.7.1 is a valid boot block
     reading 18 blocks from dga900.1001.0.7.1
     bootstrap code read in
     base = 5f4000, image_start = 0, image_bytes = 2400
     initializing HWRPB at 2000
     initializing page table at fffe0pgf0.0.0.4.17 - probe timeout retry
     
     ------END OF MESSAGES-----
     
     Problem No 2: Also we keep taking these malloc_internal panics. Now
     these just started a few days ago. We had this cluster running for a
     while without any problems. Now all of a sudden this starts. We
     couldn't even complete a run of sys_check -escalate before the system
     paniced again. We have opened a call with Support but just wanted to
     check and see if anyone had seen anything like this.
     
     -------BEGINNING OF MESSAGES-------
     panic (cpu 1): malloc_internal
     syncing disks...
     DUMP: 123446380 blocks available for dumping.
     DUMP: 1141122 wanted for a partial compressed dump.
     DUMP: Allow ng 16773118 of the 16777214 available on 0x13002d3
     device string for dump = SCSI3 1 7 0 1 0 0 0 _at_wwid1.
     DUMP.prom: dev SCSI3 1 7 0 1 0 0 0 _at_wwid1, block 262144
     
     ------END OF MESSAGES----------
     
     I realise that this is not too much information but I am merely
     looking for some advice if anyone has already seen these problems and
     they have managed to get past them.
     
     Everything was working fine until the beginning of this week and now
     suddenly everything is falling apart. We tracked back and we really
     haven't made any changes to this environment. Any and all help will be
     greatly appreciated!!
     
     Tks & Rgds,
     
     Alay Shah
     Allegiance Healthcare Corporation
     1400 Waukegan Rd.
     McGaw Park, IL 60085
     Ph - 847-578-2584
     Fax - 847-578-5586
     Email - unixadmin_at_allegiance.net
             shahal_at_allegiance.net
Received on Thu Apr 19 2001 - 21:23:05 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:42 NZDT