System panic from Jim Fitzmaurice on 2002-09-13 (tru64-unix-managers)

From: Jim Fitzmaurice <jpfitz_at_fnal.gov>
Date: Thu, 12 Sep 2002 08:48:36 -0500

This is a 4100 running Tru64 v5.1 (PK-5) part of a 3 member cluster
running TruCluster v5.1. The multiple security patch,
T64V51B19-C0136901-15143-ES-20020817, was rolled in early yesterday morning,
without significant problems. (I always have a minor problem switching
because clu_upgrade is not Kerberos friendly. and we run Kerberos.) This
morning the system experienced the following error/panic, and rebooted:

Sep 12 07:57:30 d0ola vmunix: rmerror_int: failover: mchan0 error_type =
0xe0000004 error_count = 0x1 time = 0x479183d808cb4
Sep 12 07:57:30 d0ola vmunix: mcerr = 0x12020008 lcsr = 0xc07b
mcport = 0x16440000
Sep 12 07:57:30 d0ola vmunix: rm_crash_node_mask: caller =
0xfffffc00006e14d0, nodes_to_crash = 0x10, time = 0x479183d808cb4
Sep 12 07:57:30 d0ola vmunix: panic (cpu 0): rm_lock_global_error: no good
rail or can't get locks
Sep 12 07:57:30 d0ola vmunix: rmerror_int: dismissed because of panic

The strange thing about this is the cluster is the NFS/NIS server for out
network and at the exact same time this system panicked, two Linux based
NFS/NIS clients locked up. They had to be hard-booted to get the systems
back up, one initially had problems mounting NFS drives, and the other came
up with the time skewed.

I haven't seen this error before. Has anyone else? And how could it
effect clients of a 3 member cluster where two of the members are just fine?

James Fitzmaurice
D0 Online Systems Manager
Fermi National Accelerator Laboratory
(630) 840-4011
jpfitz_at_fnal.gov

UNIX is very user friendly, It's just very particular about who it makes
friends with.
Received on Thu Sep 12 2002 - 13:48:50 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:43 NZDT