Thanks a lot for everyone who quickly replied, especially
Dr. Tom Blinn and Alan (I'm sorry, but I don't know his last name!).
After theirs tips I halted my box again, issued "show config"
and really there was a strange information regarding processors:
...
Processors
CPU 0 Alpha 21264-? 500MHz 4MB cache
CPU 1 Alpha 21264-? 500MHz 4MB cache
...
Since I had no idea how to fix that information, I updated the
firmware to 5.7 and after that "show config" returned:
...
Processors
CPU 0 Alpha EV6 pass 2.5 500MHz 4MB cache
CPU 1 Alpha EV6 pass 2.5 500MHz 4MB cache
...
Thanks again!
---------------------------------------------------------------
>From Alan (alan_at_nabeth.xxxxxxx)
-------------------------------
I'm not sure exactly, but seems to be saying that there
was some sort of hardware machine check, which would have
been correctable had you not been using a pass 1 EV6 CPU
chip. EV6 is the current design generation for Alpha
chips.
It may be that you have a prototype system which has
pass 1 chips, or that all CPUs in the speed class that
you have are pass 1 and this a general restriction. Or,
it could be that you have a system where errors such as
this are correctable, but the operating system has mis-
identified the CPU generation.
I'd suggest logging a service call, since the root problem
is some sort of machine check. If you have a version of
DECevent or Compaq Analyze that supports this system, you
might want to run it and see what errors were generated
before the kernel decided that it couldn't handle the
error.
---------------------------------------------------------------
>From Dr. Tom Blinn
------------------
The "kn600_softerr_intr()" message means that you were running in the
platform specific code that deals with ES40 and similar systems (that
are called "KN600" system types), and you got an interrupt from the
system hardware that said there was a "soft" or correctable error, but
the platform code believes you have a "pass 1" EV6 processor in your
system, and it reports that it doesn't know how to correct an error of
this type for the EV6 pass 1 chips.
There are two possibilities here:
1) You really do have EV6 pass 1 CPU chips in your system, and what
you are seeing reported is a correct message.
2) You have newer chips in your system, but the system software isn't
correctly identifying your CPU types and is making the wrong choice.
There is a third possibility, which is that you have newer chips in
your system, and they are correctly reporting a soft error, but the
system software was never updated to handle such errors correctly,
and it's just a bug in V4.0F. However, I've taken a quick look at
the code, and it looks like it does check specifically for pass 1
vs. later CPU chips, so I bet you really do have a system with the
oldest pass 1 chips (did you get an early ship machine?). For what
it's worth, the comments say "panic system on EV6 pass 1, It does not
correct single bit errors" -- which I believe are single bit memory
errors, which should be corrected by the ECC logic. In any case,
you do not want to have single bit errors go uncorrected, you need
to get your hardware fixed.
Clearly, something your Oracle processes are doing triggers this, but
that doesn't mean there's anything wrong with Oracle.
I believe that if you get into the console firmware and issue a "show
config" command, the output might show you whether you have EV6 pass 1
chips or later chips. If all your CPU chips are newer than pass 1, we
have a serious bug in our product.
In any case, if you have a support contract for your system with our
Compaq Services, PLACE A SUPPORT CALL.
---------------------------------------------------------------
Best Regards
-----
Andre Pinho
Analista de Suporte - Maxitel S.A. / DTI-TIPD
71-9149-9157 / 71-340-3255 / Fax: 71-340-3205
Received on Tue Oct 24 2000 - 22:46:11 NZDT