SUMMARY: Bad memory board on AS 255

From: Rob McCauley <robmccau_at_RadOnc.Duke.EDU>
Date: Fri, 27 Mar 1998 10:58:48 -0500 (EST)

With apologies for the delay:

Thanks to Dr. Tom Blinn, who had they only reply to my question.

In brief, Dr. Blinn suggests that I either contact Digital and have them
fix the problem, or contact the vendor and have them do the same. I've
taken the latter option, and have sitting in front of me 4 new boards
which I will use to completely replace the originals. The full text of
his reply is included below.

I'd still prefer to be able to make better sense out of the error logs,
but no immediate solution presents itself.

I did find a interesting book on the Alpha AXP architecture (Alpha AXP
Architecture Reference Manual, Digital Press, ISBN 1-55558-145-5) which
helps quite a bit in getting a hardware level understanding of what's
going on.

My original question, without the error logs, follows along with Dr.
Blinn's reply.

Thanks again,

Rob McCauley


>> Managers,
>>
>> I have an AlphaStation 255 running DU4.0D with 256M of Dataram memory in
>> Bank0. One or more of the boards seems to be bad. My question is:
>> which one?
>>
>> I checked the archives and found the firmware memory test post, but it
>> didn't help. The command memory, in particular, fails saying that echo
>> and memtest aren't found. I upgraded the firmware, but this didn't help.
>>
>> I've also moved the memory from another 255 which is identical aside
>> from running an earlier version of the OS. The problem follows the
>> memory, and unfortunatlely I don't have another set of boards to swap one
>> at a time to trace the problem.
>>
>> I called Digital and they were able to run a script (which they weren't
>> willing to part with--of course I asked :>) which gave me the jumper
>> location. I replaced the board, and all was apparently well for a
>> while. It now seems that things are better, but not fixed, as if two
>> boards were bad and the worst has been replaced. These errors do cause
>> panics, so it's important that I get this fixed, and soon. Naturally
>> these crashes come during the big memory intensive jobs that you least
>> want to crash.
>>
>> The output of dia -o full is below for two of these events. The last is
>> the most recent. I'm sure the answer is there, and while I can make a
>> guess, I need a definitive answer, and better yet, directions on how to
>> derive it myself. I want to shuffle the boards once I know which is bad
>> and verify that the error moves with a single board.
>>
>> I'd also like to find in-depth information on Alpha hardware, whether in
>> digital or paper form. I searched Digital's web site and found a lot of
>> promotional materials, but nothing on the "nuts and bolts" level.
>> Recommendations welcome. :)
>>
>> Many thanks in advance, and I will post a summary.
>>
>> Rob McCauley, the .sig-less

Dr. Tom Blinn writes:

> If you have your system under a service contract, then perhaps you
> should be asking our services organization to repair it. If they tell
> you that it's not covered because of the Dataram memory, then call
> Dataram or the vendor who sold you the memory and ask them to repair
> it, or to loan you memory to use to "module swap" until it's fixed.

> The exact functionality of the console firmware is, alas, subject to change
> almost at the whim of the firmware groups. The systems have a limited
> amount of PROM for holding the firmware image; as new options or functions
> have to be added to the firmware (e.g., new things the operating systems
> expect the fimrware to do that needs more functional code), then things may
> get removed to free up space so the firmware will still fit in the PROM. I
> have no position on the benefits of this to you, the customer; you can draw
> your own conclusions.

> If you plan to do your own hardware maintenance, you should contact our
> services organization about available self-maintenance documentation and
> tools for your system. I don't know whether anything is available, but it
> is unlikely to be free. Things like the script that decodes the memory
> fault information to the FRU level may be viewed as an added value
> tool, and not be generally available. That is, it's not part of the
> basic product.

> You might also inquire, through your sales contact, about technical manuals
> for your system. There is often additional information available that may
> not be "free", e.g., posted on a Web site. There may be a charge to
> get the information you need, but I'm sure there is a manual somewhere
> that tells you (or anyone who has it) how to decode the bit level data
> -- all that the dia tool is doing in the output you posted is bit to
> text interpretation, but it doesn't necessarily tell you everything you
> want to know.

> Tom
>
> Dr. Thomas P. Blinn, UNIX Software Group, Digital Equipment Corporation
> 110 Spit Brook Road, MS ZKO3-2/U20 Nashua, New Hampshire 03062-2698
> Technology Partnership Engineering Phone: (603) 884-0646
> Internet: tpb_at_zk3.dec.com Digital's Easynet: alpha::tpb
> ACM Member: tpblinn_at_acm.org PC_at_Home: tom_at_felines.mv.net
>
> Worry kills more people than work because more people worry than work.
>
> Keep your stick on the ice. -- Steve Smith ("Red Green")
>
> My favorite palindrome is: Satan, oscillate my metallic sonatas.
> -- Phil Agre, pagre_at_ucsd.edu
>
> Yesterday it worked / Today it is not working / UNIX is like that
> -- apologies to Margaret Segall
>
> Opinions expressed herein are my own, and do not necessarily represent
> those of my employer or anyone else, living or dead, real or imagined.



 
Received on Fri Mar 27 1998 - 17:00:07 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:37 NZDT