SUMMARY: machine check while in PAL mode

From: Uwe Siodlaczek <siodlack_at_pit.physik.uni-tuebingen.de>
Date: Fri, 09 Apr 1999 16:03:12 +0200 (MET DST)

Hi managers

Thank you to all that have replied:
  Donn Aiken <daiken_at_regents.edu>
  Whitney Latta <latta_at_decatl.alf.dec.com>
  Heater, Gene <Gene.Heater_at_echostar.com>
  Dr. Tom Blinn <tpb_at_doctor.zk3.dec.com>

Donn Aiken, who had a very similar problem give the folowing advice:

> 1. Went to the latest public release of the firmware (V5.3 for our 4100).
> 2. Updated the Alphabios to the same version (ours was out of sync).
> 3. Updated the kzpsa firmware to the A11 version.
> 4. Replaced the SCSI cable from our kzpsa controller to our raid array with
      a better, shorter one (but not too short!).
> 5. Updated the PALcode to the same version as was on the public release
      (which was 1.21-26)

Doing most of these things I can now exclude trivial problems.
Finally, I called Compaq H/W support.

Below the most detailed answer from Dr. Tom Blinn:

> Hardware. The PALcode (Privileged Architecture Library) is what runs to deal
> with the really low level hardware events, such as interrupts, arithmetic
> traps, and so forth (including memory management page faults). In each case
> it either processes the interrupt directly (e.g., it might handle a single
> bit memory error by ignoring it if it's been told to do so, as long as the
> error was corrected by the ECC code), or it reports the event to the kernel,
> through a transfer of control through a well-defined interface.
> When the PALcode sees a machine check (hardware fault), it's supposed to put
> a log frame (event logging) into memory and transfer control to the kernel,
> and then the kernel logs the event in the error log and either panics the
> system (if it's a fatal error) or returns control through the PALcode so that
> things keep running. Either way, the PALcode is involved for hardware faults
> , such as a "machine check" (which is a class of hardware fault).
> In this case, a machine check occurred WHILE THE PALcode WAS ALREADY RUNNING.
> For instance, if while an interrupt is being serviced, a machine check occurs
> and you're still running in the PALcode, you'd see this error.
>
> Since the frequency is increasing that would suggest that whatever it is in
> the hardware that's failing is starting to fail more often. Get the hardware
> repaired and this problem should go away.
>
> Tom

Uwe Siodlaczek

---------------------------------------------------------------------------
Original post:

Hello,

We have an AlphaStation 255/233 running Digital UNIX V4.0B (Rev. 564) WITH
Patch Kit-0008 ( Firmware revision: 6.9 PALcode: OSF version 1.46) which
halts with

 Halted CPU
 halt code = 7
 machine check while in PAL mode
 PC=19370.

The frequency which this happens seems to be increasing.
Hardware/Software Error?
Any hints? Thanks in advance.
Received on Fri Apr 09 1999 - 14:06:21 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:39 NZDT