SUMMARY: tb_shoot ack timeout

From: Peyton Bland <bland_at_umich.edu>
Date: Tue, 09 Jan 2001 17:31:13 -0500

Hi,

The original question...
==========
We are running 4.0F on a DS-20 and recently crashed with a "tb_shoot ack
timeout" panic from CPU 0. The archives contained a similar question from
June 1999, but I didn't find much info that helped me diagnose this.
Another posting I found was in German (not my native language!) which
mentioned lockprim.o (and simple_lock) and ambiguity over whether it was a
software or hardware problem.
==========

Patches were suggested by almost everyone, plus memory or CPU swap by others.

My thanks to those who replied...
-----
"Nick Leonard" <nickl_at_poole-tr.swest.nhs.uk> (the user whose posting
appears in the June 1999 archives mentioned above)...
We had a 2100 running 3.2c and the problem was initially diagnosed as a
hardware error (replaced all cpu's and ram to no effect). The system
crashed between weekly and monthly with the error. In the end it was
diagnosed as a software error and a Unix patch was applied and that was the
end of it. We are now on 4.0d with the same machine and have never had a
problem with it again.
I advise Compaq support and a patch; it worked for us.
-----
A co-worker translated the German posting mentioned above...
Four crashes with "tb_shoot ack timeout" and "kernel memory fault". After
Digital cannot be dissuaded that this is a hardware and not a software
problem, a patch was issued (lockprim.o). However, this only results in a
change of the error message when crashing ("simple_lock: time limit
exceeded"). Only after a second patch, (again lockprim.o) also remained
uneffective, Digital exchanged one CPU on suspicion and rotated the
remaining ones cyclically. This caused several "CPU Exceptions" and
another crash, this time unambiguously caused by CPU #1. After exchanging
this CPU, the problem was finally resolved.
-----
Joe Fletcher <joe_at_meng.ucl.ac.uk>
I recently had an ES40 crash with a simple_lock timeout panic. Looking
through the readmes in the patch kits there are references to this but they
seem to relate primarily to Trucluster (which I don't run). In my case the
crash appeared to be due to a combination of running a compute intensive
parallel task which used all 4 CPUs and a Linux based NFS server acting up
at the same time. The Linux box served two filesystems to the ES40 but it
isn't that reliable. It dropped the link and when it came back the box
crashed. As I say I was hammering the ES40 at the time so it may have been
coincidence but I'm inclined to think there's a connection. BTW the ES40
is running
4.0F PK4, 4x 500MHz, 4GB RAM, DEGPA NIC.
-----
Judith Reed jreed_at_appliedtheory.com
We had this error - here is what I recorded at the time:
                 panic (cpu 0):
                 tb_shoot ack
                 timeout
                 firmware update to v5.8 was recommended but
                 it was noted that 99% of time hardware related?
                 firmware update didn't fix it, cpu1 was
                 replaced - this didn't fix it,
                 memory was replaced, problem went away.
This was on a gs140 running 4.0F with 4 cpus, 8 GB memory.
-----
John P Speno <speno_at_isc.upenn.edu>
Did you check the readmes in the latest 4.0F patch kits? Grep for your
panic string in the docs/txt directory.
-----
Richard Tame <richard.tame_at_compaq.com>
I have a call open with Compaq Australia with the same symptoms. I also had
a simple_lock crash too. My DS20 is running V4.0F Patch 2. The advice I was
given was get on Patch 4 and see if it happens again. The software
specialist I spoke to has a document which detailed both of these crashes
explicitly in one of the patches.
-----
And a special thanks to Tom Blinn for his very thorough and educational
answer (Dr. Thomas.Blinn_at_Compaq.com <tpb_at_doctor.zk3.dec.com>)

You have a dual CPU system. If one CPU (say CPU 0) is running kernel
code that modifies the memory mapping in a way that invalidates the
translation lookaside buffer ("TLB" or just "TB") on the other CPU, it
has to post a request (a hardware interrupt, really) to invalidate the
TLB on the other CPU, and the other CPU has to acknowledge that it has
done so. This is really done, I believe, through the PALcode routines
that are embedded in the SRM console; that is, the kernel calls into
the PAL and the PAL does the function and then returns to the kernel.
If the request to the other CPU(s) isn't serviced in a timely manner,
you get a "tb_shoot ack timeout" condition, and since this shouldn't
ever happen, the kernel code quite literally panics (and shuts down
the system), because there is no way to keep operating reliably.

Yes, it can be load related. I believe (but would have investigate to
be sure) that the other CPU(s) would have be running uninterruptably
(e.g., executing in the PAL) to make this happen, so it is likely to
be really rare.

It's possible that it's a software bug (look for patch kits that have
a fix for this) or a PAL problem (update to current firmware). If I
had a DS20 (I've got a DS20E, but it's a different beast) I'd want to
get to V4.0G, which I would expect to be more stable than V4.0F, but
that's just my preference.
-----

Thanks again to all who responded.

Peyton Bland

University of Michigan, Radiology
voice: 734-647-0849
FAX: 734-764-8541
e-mail: bland_at_umich.edu
URL: http://www.med.umich.edu/dipl/
Received on Tue Jan 09 2001 - 22:32:37 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT