Greetings, list
A relatively new DS20E here (arr. Sept) we use purely as a research number
cruncher crashed suddenly the yesterday when running at a load of about 4,
consuming roughly half (says user) of our 6GB of swap. The box has twin
666MHz ev6 Alphas, 2.25GB memory, is Tru64 5.1. Typically only a single
user has been running large fortran codes on this machine.
When going through the crash-data.0 dump, I'm seeing many instances of
invalid symbols:
(unallocated - symbol optimized away)
The top of crash-data issues the age-old -g3 debug compile warning.
Here's the top of the file, to summarize the environment:
------
#
# Crash Data Collection (Version 1.4)
#
_crash_data_collection_time: Thu Nov 15 08:51:14 NST 2001
_current_directory: /
_crash_kernel: /var/adm/crash/vmunix.0
_crash_core: /var/adm/crash/vmzcore.0
_crash_arch: alpha
_crash_os: Compaq Tru64 UNIX
_host_version: V5.1 (Rev. 732)
Compaq Tru64 UNIX V5.1 (Rev. 732); Tue Aug 7 04:26:47 GMT 2001
_at_(#)msb V2.4E 99/04/30 BL3-11(Rev. 41)
_crash_version: V5.1 (Rev. 732)
Compaq Tru64 UNIX V5.1 (Rev. 732); Tue Aug 7 04:26:47 GMT 2001
_at_(#)msb V2.4E 99/04/30 BL3-11(Rev. 41)
warning: Files compiled -g3: parameter values probably wrong
_crashtime: struct {
tv_sec = 1005826703
tv_usec = 122470
}
_boottime: struct {
tv_sec = 1001172429
tv_usec = 699618
}
_config: struct {
sysname = "OSF1"
nodename = "jetsam.physics.mun.ca"
release = "V5.1"
version = "732"
machine = "alpha"
}
_cpu: 57
_system_string: 0xffffffffffd44b30 = "COMPAQ AlphaServer DS20E 666 MHz"
_ncpus: 2
_avail_cpus: 2
_partial_dump: 1
_physmem(MBytes): 2303
_panic_string: 0xfffffc00008fb930 = "Processor Machine Check"
_paniccpu: 0
_panic_thread: 0xfffffc00048b8380
_preserved_message_buffer_begin:
[...]
---------
Here's an excerpt from a bit later on showing where the machine stopped.
The processes "bndtest" are user Fortran code:
---------
[...]
05208 dtterm
05209 sh
408271 bndtest7.x
410519 bndtest8.x
419945 bndtest10.x
29602 sh
29603 dtpad
227952 rpc.lockd
227956 mountd
227958 nfsd
227961 nfsiod
227964 rpc.statd
42999 lpd
371340 csh
371345 rlogind
371346 csh
257654 dtexec
257655 dtscreen
391395 bndtest3.x
_kernel_process_status_end:
_current_pid: 408271
_current_tid: 0xfffffc00048b8380
_proc_thread_list_begin:
thread 0xfffffc00048b8380 stopped at [boot:2774 ,0xfffffc00005e1778]
Source not available
_proc_thread_list_end:
_dump_begin:
> 0 boot(reason = (unallocated - symbol optimized away), howto =
(unallocated - symbol optimized away)) ["../../../../s
rc/kernel/arch/alpha/machdep.c":2774, 0xfffffc00005e1778]
mp = (unallocated - symbol optimized away)
nmp = 0xfffffc00009081a0
sp = (unallocated - symbol optimized away)
vp = (unallocated - symbol optimized away)
error = (unallocated - symbol optimized away)
rs = 0
mycpu = 0
rpb_ptr = 0x1
rpb_cpu = 0xffffffffffd44180
[...]
---------------
Do I read correctly that this might indicate that PID 408271 (the user job
bndtest7.x) is what CPU 0 stopped on, but that the unallocated symbol
issue relates to kernel code?
When pulling up binary error logs, UERF reports a "100. CPU EXCEPTION" at
Thu Nov 15 08:48:21 2001, and two seconds later, a panic, "302. PANIC".
The system has been running fine since. These are the only two entries
(#'s 16 and 17 in the machine's lifetime) other than regular operational
events.
Do I have cause for concern here, or has the user simply over-optimized
his code? I'm obviously not a crash dump expert. Thanks for any
direction.
Chris
======================================================================
Christopher C Stevenson, C4063 office: (709) 737-2624
Dept. of Physics & Physical Oceanography fax: (709) 737-8739
Memorial University of Newfoundland
St. John's, Newfoundland, CANADA A1B 3X7
URL:
http://www.physics.mun.ca/~csteven
======================================================================
"We are all in the gutter, but some of us are looking at the stars."
-- Oscar Wilde
Received on Fri Nov 16 2001 - 16:35:05 NZDT