Folks,
I have a problem I would like your input on, but
I need to give a little background info first....
I am running a VMS 5.5-2 Cluster comprised of 2
7620's, 1 6620, and an NI clustered 4000-90 hung off the
6620. The 7620's share a mixed HSC70/HSJ40 disk farm of
shadowed devices (DUS shadow sets are predominately RA70,
RA72 and RA90 shadow sets, and DSA shadow sets are all
RZ28s), and the 6620/4000-90 utilizes all DSA type shadow
sets comprised of RZ28s, with the HSJ controllers having
read cache installed and write cache enabled (I'm waiting
on HSJ firmware version 2.7 to use writeback).
Furthermore, this cluster is part of an extended ethernet
network to a site approximately 120 miles away connected
via a Vitalink TransLan III bridge utilizing a full T-1
for connectivity, with extremely active Decnet task to
task links running.
Through analsyis conducted by ourselves, DEC, and
Pioneer Standard Electronics (our local sales support/OEM/
the person you have us deal with nowadays), we know we
have a backbone ethernet congestion problem.
PROBLEM DEFINITION/STATEMENT/QUESTION:
Prior to this weekend, the receiving node of the
Decnet task to task links was an 8550. This has since
been upgraded to a 6420. Since this has happened,
periodically (a couple of times per day) the 6620 claims
that all of the DSA shadow sets have changed state and
gone into mount verify - which completes in under 1
minute. Immediately before this happens, the 4000-90
has an application that receives an RMS-F-WER writing
to some directory on one of the shadow sets.
My supposition is that the 4000-90 is receiving
some network timeout attempting to write to a DSA device
because of network congestion, and is notifying its
servicing node, the 6620. In turn the 6620 is questioning
the state of all known HSJ shadow sets, and checking their
validity (hence the mount verify check) - which is
returning almost immediately because no problem exists.
This supposition is also based on the fact that the two
7620's also write/read from these same devices, but report
no errors.
Make sense to you ??
Thanks for the 'wizardry',