SUMMARY: Can anyone help me ?!? PANIC ?!?

From: Marcelo Fiuza <marcelo.fiuza_at_intelig.net.br>
Date: Tue, 09 Jan 2001 10:01:59 -0300

Thanks Joe Fletcher, Mark Myszkowski, Jim Lola, Nikola Milutinovic, Whitney
Latta, Dr. <mailto:Thomas.Blinn_at_Compaq.com> Thomas.Blinn and
<mailto:alan_at_nabeth.cxo.dec.com> alan.
 
 
-----Original Message-----

Hi Friends,

        I Had a problem that restarted my Alpha/Tru64 4.0F. This is the
output from the UERF command. Does anyone knows what happened ?!? Can anyone
help me ?!? I can send more info, I just donīt know what info ...


Thanks for the help ...



********************************* ENTRY 115.
*********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 302. PANIC
SEQUENCE NUMBER 1433.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Jan 8 13:07:22 2001
OCCURRED ON SYSTEM bep4sp
SYSTEM ID x00080022
SYSTYPE x00000000
PROCESSOR COUNT 4.
PROCESSOR WHO LOGGED x00000001
MESSAGE panic (cpu 1): simple_lock: time
limit
                                         _exceeded

********************************* ENTRY 116.
*********************************

----- EVENT INFORMATION -----

EVENT CLASS ERROR EVENT
OS EVENT TYPE 110. MACHINE STATE
SEQUENCE NUMBER 0.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Jan 8 13:13:23 2001
OCCURRED ON SYSTEM bep4sp
SYSTEM ID x00080022
SYSTYPE x00000000
SYSTEM STATE x0003 CONFIGURATION

********************************* ENTRY 117.
*********************************

----- EVENT INFORMATION -----

EVENT CLASS OPERATIONAL EVENT
OS EVENT TYPE 300. SYSTEM STARTUP
.
.
.

 
----------------------------------------------------------------------------
----------------------------------------------------Hi,

Had the same thing myself te other day. Not sure of the cause but maybe

you can help me narrow it down. Is your machine a multi processor box? YES

Does it have any NFS2 mounted file systems (eg stuff served from linux)? NO


Were you running a compute-intensive parallel task when it crashed? YES

Joe

 
----------------------------------------------------------------------------
----------------------------------------------------Are you running with the
latest version of the Patches for 4.0F? If not, if I were you and would
download them then search for the string "simple_lock" in the
descriptions/README for the patches. If you are running w/ the latest Patch
Kit, then you should contact Compaq Support.

The patches can be located at :


 <http://ftp1.support.compaq.com/public/unix/v4.0f/>
http://ftp1.support.compaq.com/public/unix/v4.0f/


Good luck.


Jim

 
----------------------------------------------------------------------------
----------------------------------------------------

Have you patched your system? This kind of bug has beem seen in patch

information.

Nix.

 
----------------------------------------------------------------------------
----------------------------------------------------Good Day,

The panic "simple_lock: time limit exceeded" indicates one of the cpus

tried to acquire a spin lock that was being held by another cpu. The one

that is trying to acquire the spinlock will wait 15 seconds before it

determines there is a fatal problem and calls panic() to crash the

system.

Typically, this requires an in-depth analysis of the crash files to

isolate the problem; however, you may get an idea as to what area of

code was involved by looking at the crash-data file in /var/adm/crash

directory. In this directory you will find three files that together

form the set of "crash dump files". The three files in each set have the

same number appended to them. In the crash-data file, you will find a

stack trace that will reveal what function calls were being made that

resulted in this crash. If you edit that file, search for the string

"tset machine_slot[paniccpu].cpu_panic_thread:"

Following that, you will see the stack trace; If you post that trace, it

may reveal a known issue. If it is a readily identifiable issue, we can

direct you... if not, a crash analysis must be done by Compaq Unix

Support.

I hope this is helpful...

Regards,

Whitney Latta

 
----------------------------------------------------------------------------
----------------------------------------------------If you really want to
know what made the system panic, you need to

get into the /var/adm/crash directory and look at the crash-data

file that was created during the reboot.

All that shows up in your "UERF" output is that you had a simple

lock timeout failure. The output you posted and the rest of your

problem description doesn't even indicate what system model you

have or what version of software you are running, so no one can

say much more with any authority.

For what it's worth, a simple lock timeout is USUALLY a software

problem, but it can be caused by mis-behaving hardware. There are

lots of data structures inside the kernel that are accessible from

different CPUs in a multi-processor system, so there are locking

mechanisms to coordinate access. If one of the CPUs can't get the

simple lock (a "spin lock") for a particular data structure in a

reasonable amount of time (it should never take really long), then

it panic-s the system. That's what happened here. But you can't

tell WHY it happened -- that is, where the kernel was running when

the problem occurred -- because that information isn't recorded in

your binary error log file (so UERF can't report it).

Tom

 
----------------------------------------------------------------------------
----------------------------------------------------

Panics are software failures. This one is particular to

SMP lock handling. Check the Services web site for the

patch kits for your particular version and see if they

have a patch for the symtom. If not, call your country

support center and report the problem. If you don't have

a software support contract, they may charge per-call to

handle it.

 
Received on Tue Jan 09 2001 - 12:04:39 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT