SUMMARY: Problem on ES40 from Peter.Stern_at_weizmann.ac.il on 2007-02-07 (tru64-unix-managers)

From: <Peter.Stern_at_weizmann.ac.il>
Date: Tue, 06 Feb 2007 17:30:29 +0200 (IST)

I wish to thank the many people who tried to help:

Thierry Faidherbe
Rudolf Gabler
Joe Fletcher
Benjamin C. Ingwer
Guy Noce
Paul Maglinger
John Lanier
David Gutierrez
Richard Loken
Martin Roende
Fernando Carnero

Several people suggested reseating the memory modules.

John Lanier suggested using WEBES (Compaq Analyze on my machine) to
translate the binary error log and sent some tips on how to do that.
This identified the relevant DIMM.

I just took all of them out and put them back in and haven't gotten any
errors for two days. I never believed that this would make any
difference, but there you go. I hope the errors don't return. If the
do, we'll just make sure that we have identified the correct DIMM by
removing that set for a while and finally replacing it.

Regards,
Peter

>
> Forwarded message:
> > From peter Thu Feb 1 10:33:57 2007
> > Subject: Problem on ES40
> > To: tru64-unix-managers_at_ornl.gov
> > Date: Thu, 1 Feb 2007 10:33:57 +0200 (IST)
> > From: Peter.Stern_at_weizmann.ac.il
> > Reply-to: Peter.Stern_at_weizmann.ac.il
> > X-Mailer: ELM [version 2.5 PL3]
> > Content-Length: 6149
> >
> > We have an old ES40 (Tru64 v4.0f) which has been generally working
> > fine. About four months ago, it rebooted after recording the
> > following messages every few minutes in /var/adm/messages:
> > Sep 29 15:20:16 chemphys vmunix: trap: invalid memory read access
> > from kernel mode
> > Sep 29 15:20:16 chemphys vmunix:
> > Sep 29 15:20:17 chemphys vmunix: faulting virtual address:
> > 0xffffffff813bc000
> > Sep 29 15:20:17 chemphys vmunix: pc of faulting instruction:
> > 0xfffffc0000268884
> > Sep 29 15:20:17 chemphys vmunix: ra contents at time of fault:
> > 0xfffffc0000268870
> > Sep 29 15:20:17 chemphys vmunix: sp contents at time of fault:
> > 0xffffffffbc953850
> > Sep 29 15:20:17 chemphys vmunix:
> > Sep 29 15:20:17 chemphys vmunix: panic (cpu 2): kernel memory fault
> > Sep 29 15:20:17 chemphys vmunix: device string for dump = SCSI 1 2 0
> > 0 0 0 0.
> > Sep 29 15:20:17 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> > block 524288Sep 29 15:20:17 chemphys vmunix: device string for dump =
> > SCSI 1 2 0 0 0 0 0.
> > Sep 29 15:20:17 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> > block 524288Sep 29 15:20:17 chemphys vmunix: Alpha boot: available
> > memory from 0x2e1a000 to
> > 0x7fffc000
> >
> > But then, about two weeks ago it started giving the following errors
> > (which I did not noice at the time):
> > Jan 19 02:12:23 chemphys vmunix: WARNING: too many Processor
> > corrected errors detected on cpu 1. Reporting suspended.
> > Jan 19 02:12:29 chemphys vmunix: WARNING: too many Processor
> > corrected errors detected on cpu 3. Reporting suspended.
> > Jan 19 02:12:38 chemphys vmunix: WARNING: too many Processor
> > corrected errors detected on cpu 0. Reporting suspended.
> > Jan 19 02:13:46 chemphys vmunix: WARNING: too many Processor
> > corrected errors detected on cpu 2. Reporting suspended.
> >
> > ...
> >
> > Jan 30 14:08:52 chemphys vmunix: WARNING: too many System corrected
> > errors detected on cpu 0. Reporting suspended.
> > Jan 30 15:11:54 chemphys vmunix: WARNING: too many Processor
> > corrected errors detected on cpu 3. Reporting suspended.
> > Jan 30 15:34:15 chemphys vmunix: WARNING: too many Processor
> > corrected errors detected on cpu 2. Reporting suspended.
> > Jan 30 18:28:34 chemphys vmunix: WARNING: too many System corrected
> > errors detected on cpu 0. Reporting suspended.
> >
> > until after 11+ days it crashed and rebooted:
> > Jan 30 18:47:08 chemphys vmunix: Machine Check Processor Fatal Abort
> > Jan 30 18:47:08 chemphys vmunix: Machine check code = 0x1000000a0
> > Jan 30 18:47:09 chemphys vmunix: Ibox Status
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Dcache Status
> > = 0000000000000008
> > Jan 30 18:47:09 chemphys vmunix: Cbox Address
> > = 0000000029ab2bc0
> > Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 1
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Fill Syndrome 0
> > = 000000000000006b
> > Jan 30 18:47:09 chemphys vmunix: Cbox Status
> > = 000000000000000b
> > Jan 30 18:47:09 chemphys vmunix: EV6 captured status of Bcache
> > mode
> > = 0000000000000002
> > Jan 30 18:47:09 chemphys vmunix: EV6 Exception Address
> > = 00000000121e9b00
> > Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Enablement and
> > Current Processor mode = 0000007ee0000008
> > Jan 30 18:47:09 chemphys vmunix: EV6 Interrupt Summary Register
> > = 0000000080000000
> > Jan 30 18:47:09 chemphys vmunix: EV6 TBmiss or Fault status
> > = 0000000000000280
> > Jan 30 18:47:09 chemphys vmunix: EV6 PAL Base Address
> > = 0000000000018000
> > Jan 30 18:47:09 chemphys vmunix: EV6 Ibox control
> > = fffffffc06304396
> > Jan 30 18:47:09 chemphys vmunix: EV6 Ibox Process_context
> > = 0000410000000004
> > Jan 30 18:47:09 chemphys vmunix: O/S Summary flag
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Cchip Base Address (phys)
> > = 00000801a0000000
> > Jan 30 18:47:09 chemphys vmunix: Cchip Device Raw Interrupt
> > Request
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: DRIR Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: PCI Device Interrupt
> > Mask
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Cchip Miscellaneous Register
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Misc Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: Cchip Revision: 00
> > Jan 30 18:47:09 chemphys vmunix: ID of CPU performing
> > read: 00
> > Jan 30 18:47:09 chemphys vmunix: Pchip 0 Base Address (phys)
> > = 0000080180000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip 0 Error Register
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
> > Address
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
> > Acknowledge
> > Jan 30 18:47:09 chemphys vmunix: Pchip 1 Base Address (phys)
> > = 0000080380000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip 1 Error Register
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: Pchip Error Register Decode:
> > Jan 30 18:47:09 chemphys vmunix: PCI Xaction Start
> > Address
> > = 0000000000000000
> > Jan 30 18:47:09 chemphys vmunix: PCI Command: Interrupt
> > Acknowledge
> > Jan 30 18:47:10 chemphys vmunix: CPU 3 is prevented from being rebooted.
> > Jan 30 18:47:10 chemphys vmunix: The system must be reset or power
> > cycled to clear this state.
> > Jan 30 18:47:10 chemphys vmunix: panic (cpu 3): Processor Machine Check
> > Jan 30 18:47:10 chemphys vmunix: syncing disks... device string for dump
> > = SCSI
> > 1 2 0 0 0 0 0.
> > Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> > block 524288Jan 30 18:47:10 chemphys vmunix: device string for dump =
> > SCSI 1 2 0 0 0 0 0.
> > Jan 30 18:47:10 chemphys vmunix: DUMP.prom: dev SCSI 1 2 0 0 0 0 0,
> > block 524288Jan 30 18:47:10 chemphys vmunix: Alpha boot: available
> > memory from 0x2e1a000 to
> > 0x7fffc000
> >
> > I power cycled and rebooted, but it gave the same "too many Processor
> > corrected errors" message a few times over a period of about four hours
> > and again rebooted. The errors continue.
> >
> > Any idea what the specific problem is?
> >
> > Regards,
> > Peter
> >
> > Peter Stern
> > Chemical Physics Department
> > Weizmann Institute of Science
> > 76100 Rehovot, ISRAEL
> >
> > email: Peter.Stern_at_weizmann.ac.il
> > phone: 972-8-9342096
> > fax: 972-8-9344123
> >
>
>
Received on Tue Feb 06 2007 - 15:32:41 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:45 NZDT