SUMMARY: Problems with Alpha255 after upgrading to DU4.0

From: Devdas V <das_at_rri.ernet.in>
Date: Mon, 16 Jun 1997 20:38:16 +0530 (IST)

        Dear OSF-Managers,

                Our problem is still unsolved. I received a few replies
        suggesting that I should try Open3D. I couldn't try that because
        we don't have a license for that. Dr Tom Blinn gave lot of
        suggestions which were the most useful. I tried upgrading from
        DU4.0 to DU4.0B but the problems still persist. The machine works
        fine without CDE/XDM. We are now convinced that we have a serious
        problem with either the motherboard or the graphics card and that
        this shows up only with DU4.0X. The local Digital support has
        accepted to replace either the graphics card or the motherboard.
        We are waiting for the replacement.

        Thanks to:
        
        Mike Epstein <mjepst_at_hcs.harvard.edu>
        Ib Koersner <koersner_at_tsl.uu.se>
        Kurt Carlson <sxkac_at_java.sois.alaska.edu>
        Ernie Bisson <bisson_at_aesir.mit.edu>
        Steve McFadden <smcfa.hcia.com>
        Kjell Andresen <kjell.andresen_at_usit.uio.no>
        Jie Gao <jgao.csu.edu.au>
        and
        Dr Tom Blinn <tpb_at_zk3.dec.com>

Here's my query:
>
> Dear OSF Managers,
> I have landed my self in serious trouble after
> upgrading our AlphaStation 255 4/232 to DU4.0 from DU3.2D-1. The machine
> orginally had firmware 6.0-943. We follwed the installation guide exactly
> and the installation went through without any hitch. After the installation the
> machine has been rebooting very often - sometimes even with no users logged in.
> The machine was functioning normally with DU3.2D-1.
> The error logger logs the following error:
>
> 'panic (cpu0): Machine check - _Hardware error'
> Sometimes it produces a crash dump which I have attached.
>
> Suspecting the firmware, I downloaded the latest firmware(v6.4)
> and updated the firmware. After updating the firmware the reboots still take
> place but with a minor difference - nothing is ever logged in the error
> logger and no crash dumps are produced. But we get a one line message on
> the Console - 'Machine check in pal mode'. After this the machine reboots.
> I have also reinstalled the OS again but the problem persists.
>
> We also have one more problem - Whenever a user tries
> to change the desktop colors in cde, the cpu panics and the machine hangs.
> I tried going back to DU3.2B but I end up with other problems(the network
> adapter is not found). So I have gone back to DU4.0. The local Digital
> people are also at loss to explain as to what is happening. They are
> suggesting me to upgrade to DU4.0B instead. Looks like I have no other
> option other than upgrading to DU4.0B.
>
> We also have a number of DEC3000s and Alpha500 which are
> also running DU4.0 without any such serious problems. We have one more
> Alpha255 running DU4.0 which also exhibits the same phenomenon but less
> frequently. This machine has the orginal firmware V6.0-943. I have also
> noticed one more thing - The machine doesn't reboot if I stop CDE/XDM but
> it is not of much use as the Console can't be used.
>
> Meanwhile could someone please help me out in figuring
> out what the problem is ? I am attaching one of the crashdc outputs.
>
> Thanks in advance,
>
> Devdas
> System Administrator
> Raman Research Institute,
> Bangalore
> India
> email: das_at_rri.ernet.in
>

The suggestions:

Dr Tom Blinn wrote:


> I am very sorry for sending a mime encoded attachment. I
> am appending the crash dump in text format. Please have a look at it. I
> find it difficult to believe that I have broken hardware as the machine
> was functioning normally with DU3.2D-1. Others have suggested that I
> should install Open3D server as a solution. Meanwhile the local DEC guys
> have given me DU4.0B which I am installing now.
>
> Regards,
> Devdas
> das_at_rri.ernet.in

The fact that a machine was "functioning normally" doesn't mean that it did
not break. It could have been broken all along, but the older software did
not exercise the broken part, and the new software does. Then it dies.

> Here's the crashdc output.

Your system reports this graphics option:

> tga0 at pci0 slot 13
> tga0: depth 8, map size 2MB, 1280x1024
> tga0: ZLXp2-E, Revision: 34

You should NOT need Open3D to support an 8 plane ZLXp2-E. The option is
fully supported by the base operating system. But the details of how the
support works have changed between V3.2D and V4.0.

> Alpha PC machine check type 0x660.
> Machine check abort

This is a typical symptom of a hardware (or PALcode) malfunction.

> panic (cpu 0): Machine check - Hardware error

That message means the hardware malfunctioned. That's why it's called a
"Machine check" and labelled a "Hardware error". The current PID was 1227
which is Xdec; it's possible that a bug in the X server is exercising some
problem in your system's hardware that wasn't exercised before, and that is
making the system panic. If that's the case, then V4.0A or V4.0B might not
fail in the same way. The last couple of routines called just before the
system died with a hardware error were ws_set_dpms_on() which turns on the
power management for the graphics subsystem, and tga2_set_get_power_level()
which does the actual device specific power management for your ZLXp2-E.

This may well be new functionality in V4.0 that wasn't in V3.2D, and if your
system was always broken for this function but the function was never done
in the older software (I'd have to dig out the V4.0 new features list to be
sure that we implemented power management for the first time in V4.0, but
that is my recollection), then you'd first see the problem in V4.0 when you
upgraded.

Here are a couple of things you can try to pin this down:

1) Bring the system up to single user mode. Then, at the single user
prompt, bring the system up to run level 2 (# init 2). This won't run
the X server. If the system comes up to run level 2 and you can log in
on the console as root, that suggests there may be a problem with using
the graphics hardware through the X server.

2) Connect a serial terminal to the COM1 port. Shutdown the system and
halt it. On the serial terminal (set to 9600 baud, 8 data bits, no parity,
1 stop bit) type a return. If you see a ">>>" prompt, type in the "set
console serial" command followed by "init". Now bring the system up to the
single user mode (boot -fl s) and run bcheckrc, then cd to the /sbin/rc3.d
directory and find the xlogin startup script. Change the name to begin with
"no-" (so you can easily change it back later). Now the system won't start
the X server. Type a control-D to bring the system up to fun multi-user
mode and verify that the network starts, etc.

3) If both of the above work fine, then try this logged in as root on the
serial line:

        /sbin/init.d/xlogin start

and see if the system panics. If it does, it's a problem with starting the
X server.

4) Re-seat the TGA card. If you've got a different graphics card that you
can use for testing, try a different graphics card. Try moving the TGA card
to a different PCI slot (then boot with the genvmunix kernel, because you'd
need to rebuild your target kernel to find the card).

Tom
 
 Dr. Thomas P. Blinn, UNIX Software Group, Digital Equipment Corporation
  110 Spit Brook Road, MS ZKO3-2/U20 Nashua, New Hampshire 03062-2698
   Technology Partnership Engineering Phone: (603) 881-0646
    Internet: tpb_at_zk3.dec.com Digital's Easynet: alpha::tpb
     ACM Member: tpblinn_at_acm.org PC_at_Home: tom_at_felines.mv.net

------------
Jie Gao wrote:

There was one bad inlay image in the set of background colours which
causes the freeze, according to experience here. Subsequence change of
the set stopped the phenomenon. People wonder why one single background
colour could disable the whole machine.

Jie

------------
Kjell Andresen wrote:


Hello!

We've had similar problems 'panic (cpu0): Machine check - _Hardware error
kernel memory fault' with one of our machines and got a special patch from
Digital.
We still experience problems wtih the machine though and with the same
message, but according to Digital the source is not the same.

So I advise you to report it to Digital.

Sorry for the delay in my reply.

The colour-problem in v4.0x is fixed by installing the Open3D-kit!!
We have only experienced this trouble with the 255s.

-------------

Kurt Carlson wrote:


> I have landed my self in serious trouble after
>upgrading our AlphaStation 255 4/232 to DU4.0 from DU3.2D-1.

I would strongly recommend v4.0b over v4.0. Also, I'd recommend
getting the latest patch kit (-003). Mostly what you get from
these are bug fixes, including fixes for many panics.

> 'panic (cpu0): Machine check - _Hardware error'
>Sometimes it produces a crash dump which I have attached.

This is almost always the result of a hardware error,
did anything show in the error log?

Upgrading the firmware, as you did, might help.

> Suspecting the firmware, I downloaded the latest firmware(v6.4)
>and updated the firmware. After updating the firmware the reboots still take
>place but with a minor difference - nothing is ever logged in the error
>logger and no crash dumps are produced. But we get a one line message on
>the Console - 'Machine check in pal mode'. After this the machine reboots.
>I have also reinstalled the OS again but the problem persists.

This again points to a hardware problem.

> We also have one more problem - Whenever a user tries
>to change the desktop colors in cde, the cpu panics and the machine hangs.
>I tried going back to DU3.2B but I end up with other problems(the network
>adapter is not found). So I have gone back to DU4.0. The local Digital
>people are also at loss to explain as to what is happening. They are
>suggesting me to upgrade to DU4.0B instead. Looks like I have no other
>option other than upgrading to DU4.0B.

v4.0 had lots of bugs, v4.0b was the first release of v4 which was
relatively stable. as i stated above, also get the 003 patch kit...
you might as well get to a level where you have all known fixes applied.
kurt


Subject: Re: Problems with Alpha255 after upgrading to DU4.0

I forgot to mention in the previous I've seen one case where
machine check was software induced, the "ping of death".
The general cause is an NT system issuing a large ping and
it affected almost all varieties of UNIX (not just Digital).

This is fixed in v4.0b and any other release of Digital UNIX
has a patch. If you are being hit by this, you can see
back-to-back panics as the offending ping can be lurking
until your system is back up. If you have not applied a
patch for this, do so. You would need to reapply it after
upgrading to v4.0.

  kurt

-------------------

Ib Koersner wrote:

> This looks like the same problem we had. It seems to be solved
> by installing open3d. The advice came from DEC.

> Ib Koersner
> The Svedberg Lab
> Uppsala
> Sweden


-------------------
Ernie Bisson wrote:



>Dear OSF Managers,
> I have landed my self in serious trouble after
>upgrading our AlphaStation 255 4/232 to DU4.0 from DU3.2D-1. The machine
>orginally had firmware 6.0-943. We follwed the installation guide exactly
>and the installation went through without any hitch. After the installation the
>machine has been rebooting very often - sometimes even with no users logged in.
>The machine was functioning normally with DU3.2D-1.
>The error logger logs the following error:
>
> 'panic (cpu0): Machine check - _Hardware error'
>Sometimes it produces a crash dump which I have attached.

I recently converted an AlphaStation 255 from VMS to DU4.0A.
It to was crashing periodically and had strange CDE problems/
The solution was to install Open3D.

Ernie

---------------
Steve Mcfadden wrote:

Steve McFadden
06/09/97 04:35 PM
Your solution might lie in the Open3D drivers for the video card in the
Alpha Station 255. Try getting the version corresponding with the OS. I
believe Open3D version 4.0 works with DU4.0.... & version 4.2 works with
DU4.0B.
- SDM
Received on Mon Jun 16 1997 - 17:35:05 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT