Thanks Joe Fletcher, Mark Myszkowski, Jim Lola, Nikola Milutinovic, Whitney
Latta, Dr. <mailto:Thomas.Blinn_at_Compaq.com> Thomas.Blinn and
<mailto:alan_at_nabeth.cxo.dec.com> alan.
-----Original Message-----
Hi Friends,
I Had a problem that restarted my Alpha/Tru64 4.0F. This is the
output from the UERF command. Does anyone knows what happened ?!? Can anyone
help me ?!? I can send more info, I just donīt know what info ...
Thanks for the help ...
********************************* ENTRY 115.
*********************************
----- EVENT INFORMATION -----
EVENT CLASS ERROR EVENT
OS EVENT TYPE 302. PANIC
SEQUENCE NUMBER 1433.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Jan 8 13:07:22 2001
OCCURRED ON SYSTEM bep4sp
SYSTEM ID x00080022
SYSTYPE x00000000
PROCESSOR COUNT 4.
PROCESSOR WHO LOGGED x00000001
MESSAGE panic (cpu 1): simple_lock: time
limit
_exceeded
********************************* ENTRY 116.
*********************************
----- EVENT INFORMATION -----
EVENT CLASS ERROR EVENT
OS EVENT TYPE 110. MACHINE STATE
SEQUENCE NUMBER 0.
OPERATING SYSTEM DEC OSF/1
OCCURRED/LOGGED ON Mon Jan 8 13:13:23 2001
OCCURRED ON SYSTEM bep4sp
SYSTEM ID x00080022
SYSTYPE x00000000
SYSTEM STATE x0003 CONFIGURATION
********************************* ENTRY 117.
*********************************
----- EVENT INFORMATION -----
EVENT CLASS OPERATIONAL EVENT
OS EVENT TYPE 300. SYSTEM STARTUP
.
.
.
----------------------------------------------------------------------------
----------------------------------------------------Hi,
Had the same thing myself te other day. Not sure of the cause but maybe
you can help me narrow it down. Is your machine a multi processor box? YES
Does it have any NFS2 mounted file systems (eg stuff served from linux)? NO
Were you running a compute-intensive parallel task when it crashed? YES
Joe
----------------------------------------------------------------------------
----------------------------------------------------Are you running with the
latest version of the Patches for 4.0F? If not, if I were you and would
download them then search for the string "simple_lock" in the
descriptions/README for the patches. If you are running w/ the latest Patch
Kit, then you should contact Compaq Support.
The patches can be located at :
<
http://ftp1.support.compaq.com/public/unix/v4.0f/>
http://ftp1.support.compaq.com/public/unix/v4.0f/
Good luck.
Jim
----------------------------------------------------------------------------
----------------------------------------------------
Have you patched your system? This kind of bug has beem seen in patch
information.
Nix.
----------------------------------------------------------------------------
----------------------------------------------------Good Day,
The panic "simple_lock: time limit exceeded" indicates one of the cpus
tried to acquire a spin lock that was being held by another cpu. The one
that is trying to acquire the spinlock will wait 15 seconds before it
determines there is a fatal problem and calls panic() to crash the
system.
Typically, this requires an in-depth analysis of the crash files to
isolate the problem; however, you may get an idea as to what area of
code was involved by looking at the crash-data file in /var/adm/crash
directory. In this directory you will find three files that together
form the set of "crash dump files". The three files in each set have the
same number appended to them. In the crash-data file, you will find a
stack trace that will reveal what function calls were being made that
resulted in this crash. If you edit that file, search for the string
"tset machine_slot[paniccpu].cpu_panic_thread:"
Following that, you will see the stack trace; If you post that trace, it
may reveal a known issue. If it is a readily identifiable issue, we can
direct you... if not, a crash analysis must be done by Compaq Unix
Support.
I hope this is helpful...
Regards,
Whitney Latta
----------------------------------------------------------------------------
----------------------------------------------------If you really want to
know what made the system panic, you need to
get into the /var/adm/crash directory and look at the crash-data
file that was created during the reboot.
All that shows up in your "UERF" output is that you had a simple
lock timeout failure. The output you posted and the rest of your
problem description doesn't even indicate what system model you
have or what version of software you are running, so no one can
say much more with any authority.
For what it's worth, a simple lock timeout is USUALLY a software
problem, but it can be caused by mis-behaving hardware. There are
lots of data structures inside the kernel that are accessible from
different CPUs in a multi-processor system, so there are locking
mechanisms to coordinate access. If one of the CPUs can't get the
simple lock (a "spin lock") for a particular data structure in a
reasonable amount of time (it should never take really long), then
it panic-s the system. That's what happened here. But you can't
tell WHY it happened -- that is, where the kernel was running when
the problem occurred -- because that information isn't recorded in
your binary error log file (so UERF can't report it).
Tom
----------------------------------------------------------------------------
----------------------------------------------------
Panics are software failures. This one is particular to
SMP lock handling. Check the Services web site for the
patch kits for your particular version and see if they
have a patch for the symtom. If not, call your country
support center and report the problem. If you don't have
a software support contract, they may charge per-call to
handle it.
Received on Tue Jan 09 2001 - 12:04:39 NZDT