thanks to all who replied ,
the Pb reappeared so the faulty card was changed .
Reg/
-----
As with any complex hardware, it is possible to have a transient or
intermittent problem be detected and reported and then corrected in
some way. You have to assume the error was real, but it may have
been a one time event. The "emx" driver makes heroic efforts to
keep the hardware working, and it probably reset the adapter to a
good state (it's sometimes possible to do this, sometimes not, it
all depends on the nature of the fault) and kept going. But keep
an eye on the logs and if you get another instance, you probably
should get the hardware repaired or replaced; if you have a support
contract, file a formal support call.
Tom
Dr. Thomas P. Blinn + Tru64 UNIX Software + Hewlett-Packard Company
Internet: tpb_at_zk3.dec.com, thomas.blinn_at_compaq.com, thomas.blinn_at_hp.com
110 Spit Brook Road, MS ZKO3-2/W17 Nashua, New Hampshire 03062-2698
Alpha Hardware Platforms and I/O Phone: (603) 884-0646
ACM Member: tpblinn_at_acm.org PC_at_Home: tom_at_felines.mv.net
Worry kills more people than work because more people worry than work.
Keep your stick on the ice. -- Steve Smith ("Red Green")
My favorite palindrome is: Satan, oscillate my metallic sonatas.
-- Phil Agre, pagre_at_alpha.oac.ucla.edu
Yesterday it worked / Today it is not working / UNIX is like that
-- apologies to Margaret Segall
Opinions expressed herein are my own, and do not necessarily represent
those of my employer or anyone else, living or dead, real or imagined.
-----
EMX Adapter Hardware Errors
If the adapter reported a h/w error prior to version V2.00 of
the emx driver, the system would panic.
Starting with the V2.00 version of the emx driver new reset adapter
functionality was added. Instead of panicing, the adapter is reset in an
attempt to recover the adapter since most reported h/w errors are transient
events. If the adapter hangs in reset or fails to
complete the reset correctly, the adapter is marked dead and removed from
the running configuration. The adapter will return to use at the next boot.
So with new versions of the driver, hardware parity errors just become a
nuisance as it will force io retries after the adapter is reset. Cases
have been seen where the adapter can become wedged and hangs causing a
system hang. If there is more than a few parity or other errors
reported over a few days to few weeks typically indicates a board going
bad.
The h/w errors which will cause an adapter reset include:
HW ERR:EBUS Parity Error
HW ERR:BBUS Parity Error
HW ERR:Host Bus(PCI) Error
HW ERR:Sequence Manager Fatal Error
HW ERR:BIU Fatal Error
HW ERR:ENDEC Fatal Error
HW ERR:Context SRAM Fatal Error
HW ERR:Buffer SRAM Fatal Error
The most typical error is BBUS Parity Errors.
BBUS Parity Error
The adapter sets this error to indicate an internal parity error on the
internal BBus.
Robert Mclean
HPTC Support
HP Services Americas
Office 352-726-9087
Pager 352-268-0030
E-mail robert.mclean2_at_hp.com <mailto:robert.mclean2_at_hp.com>
-----
Call your service vendor and see if a single parity error
on the particular model HBA is grounds to have it replaced.
Systems, Storage subsystems and even I/O adapters are
designed to tolerate certain types of errors. If an
error is detected there are often recovery procedures
that the hardware uses to allow it to continue running.
A small number of correctable errors during the lifetime
of a device are to be expected.
If the problem is not correctable or the frequency of
correctable errors is too high, then the part should
be replaced.
Particular to this problem, if the system didn't crash,
the domain didn't panic or the data in transit at the
time wasn't corrupted, then the hardware and software
error recovery worked as expected. If errors such as
this continue, then there is a risk that a more serious
non-correctable problem will eventually occur. If this
error is uncommon, then you can probably trust that the
error recovery will handle it, if it happens again.
The system error log may have more information about the
error. DECevent may still work on V5.1B, but uerf(8)
isn't likely to. Compaq Analyze may have bit-to-text
translation for the error, if something made it into the
binary error log.
---
ESC:wq
--
Régis Carlier, APX Computer, 31 rue Denis Papin,
Parc Club des Prés, 59650 Villeneuve d'Ascq
Tel: +33320190018 , Fax: +33320190010 ,
Gsm: +33686943971 , Mail: Regis.Carlier_at_apx.fr
Received on Thu Jul 17 2003 - 12:53:48 NZST