Hi Managers:
I had two responses to this query.
----------------------------------------------------------------------------
FROM:
Padraig Houlahan Computer Services - IS
Oregon State University
>From houlahap_at_ucs.orst.edu Wed Oct 18 21:15:00 1995
We're having similar problems and it's driving us nuts. Digital
is totally lost and struggling to come up with a solution.
We notice the deep pause kind of problem when large (multiple GB)
files are being written out and this is usually associated
with a program called gaussian that models chemical structures.
We have tried modifying the UBC with no success. We are now going
to try to shift all the scratch files off the ADVFS that they're
currently written to (Dec told us that the dumps reveal the UBC is
overwhelmed and the problem didn't happen prior to adding a third
CPU and using an ADVFS for tmp space).
----------------------------------------------------------------------------
FROM:
Paul E. Rockwell Northeast Region SBU Technical Support
Digital Equipment Corporation
>From rockwell_at_rch.dec.com Thu Oct 19 08:52:09 1995
This is not a good situation. You have some kind of underlying problem
with either that scsi bus or a drive on the bus. Also, be very careful
about the length of cable on that external bus (and the amount of cable
in the TZ87 drive). You do not have a lot of length to play with if you're
using that external bus configuration - check your installation documentation
for the 2100 for more information.
Did you check your system error log (with uerf) to see if there were
any errors coming out?
I would suggest upgrading Networker to 3.1A (available on the Layered
Product CD) for better performance and other fixes. It will come up in
"SingleServer" if you're using an older version of SingleServer. FYI - any
version of NetWorker beyond 3.0A will come up as NetWorker SingleServer if
you don't have the DECNSR-NET-SVR license loaded.
Digital UNIX 3.2A ( 3.1 + patches from 3.2 Complementary Products disk)
should have the kernel fixes in it for both AdvFS and a problem with
DLT tape drives. There are additional patches for NetWorker 3.0 that deal
with the DLT tape drives that are included in NSR 3.1A.
----------------------------------------------------------------------------
Thank you both.
The SCSI cam errors turned out to be a real problem with a 9 GB disk
we had attached, it has a bad sector that was probably marginal before.
I followed Paul E. Rockwell's advice and upgraded networker. I resumed
the backup schedule, (networker noted the system had crashed during the
last backup) and it completed the level 1's on the early morning schedule
safely. I did not try to recreate the scenario that caused the crash.
Here is the original query:
>
> Today our AlphaServer hung, failing to service any NFS requests. It
> still showed a minimal sign of life. The pointer moved on the local X
> server (though it couldn't do anything). I did not have time to look
> for much else.
>
> This machine does not have any "user" accounts. The only local activity
> that was going on was the completion of a Networker (single save and
> restore) backup that had been delayed a few days due to lack of media.
> I inserted used media, mounted and labeled it and networker proceeded
> catch up (using interleaved session mode) with the backlog. I watched
> the networker display for a minute or two, it indicated a throughput of
> about 700 KB/sec (DLT!). Not more than a minute after that the whole
> plant noticed "NFS server xxxx not responding"!. I returned to the
> console and looked: the disk drive LED's were quiet, the Networker
> backup window indicated that EOF had been written, but I could do
> nothing. Pressing the halt button was the only way to recover. I have
> an AlphaServer 2100 4/200 with one CPU running DEC OSF/1 V3.2A (Rev.
> 17) Firmware revision: 3.9 and PALcode: OSF version 1.35. It has 128 MB
> of memory and a SWXCR RAID controller. The / and /usr filesystems are
> AdvFS and RAID level 0 (2 - 1GB DEC disks) and 128 MB swap. It is a
> NFS server for about 60 workstations and 60 X Terminals.
>
> The Networker 3.0A product is the single save/restore version that came
> with 3.2. On a number of occasions during a reboot the CAM subsystem
> reports an unreadable block on one of the disk drives, unfortunately
> this is inconsistent! In the past, after cycling power and rebooting a
> few times we have been able to get past this, get the disk back on line
> and fsck it.
>
>
--
---------------------------------------------------------------------------
Ted Asocks tra_at_ucolick.org
Systems Administrator VOICE:
(408)459-4020 UCO/Lick Observatory
FAX: (408)454-9863
Received on Sat Oct 21 1995 - 22:55:19 NZDT