I'd like to get some opinions on what I'm planning to do to a herd of Tru64
V4.0F/Trucluster 1.6 servers we have here. One constraint is that downtime
is very hard to come by on these systems, and it's absolutely critical that
these systems run. If they fail, machinery grinds to a halt, managers get in
a bad mood, and things generally go downhill.
The machines under discussion are:
Two ES40 clusters, consisting (each) of two ES40s, a fiber channel switch,
and an HSG80. Let's call one cluster DB1 and the other cluster DB2.
Two DS20E clusters, consisting (each) of two DS20Es and an RA3000. Let' s
call one cluster AP1 and the other AP2.
What I'm trying to accomplish is to connect all these machines into a single
SAN so they can access a tape library located at the other end of our
campus. We already have connectivity across the campus, and there's another
fiber channel switch sitting with the machines we're discussing ready to
take connections.
The DB1 and DB2 clusters seem straightforward enough: connect their switches
to the switch that connects back to the tape library. I plan to configure
the HSG80s so they serve the appropriate disks only to their respective
ES40s, so there won't be any confusion. We've noted that the fiber channel
switches in these clusters constitutes a single point of failure, so I'm
planning to also cross connect them, so that, in each cluster, one ES40
connects to one switch and the other connects to the other switch. Same
thing for the redundant controllers in the HSG80s: one to one switch and on
to the other. The people that use and oversee these machines are very wary
of any changes, and if I can actually improve reliability with my changes,
it'll be a bit easier to sell.
The AP1 and AP2 clusters are a bit trickier. I have no free slot to add a
fiber channel card. My plan for these is to migrate the shared storage to
the HSG80 controllers, again configuring the disks so they're only presented
to the proper hosts. To minimize downtime and provide fallback options, what
I plan to do is take one DS20E down and leave the other one running. I'll
remove the SCSI card for the RA3000 and install the fiber channel card. On
the HSG80, I'll use some spare disks to create the necessary storage and
restore everything there using the "down" DS20E. When all that looks ok,
we'll take down the running DS20E and bring everything up on the modified
DS20E. Once things are stable and we're convinced it's working, I'll modify
the other DS20E and take the disks out of the RA3000 to replace the spares I
used on the HSG80.
I plan to do the same kind of cross-connect between switches with the DS20E
clusters that I'm doing with the ES40 clusters: connect one cluster member
to one switch and the other to the other switch. Again, this should improve
reliability and make this whole scheme easier to sell. The AP clusters also
depend on databases on the DB clusters, so I plan to locate their shared
storage on the same HSG80 as the databases they depend on, so if an HSG80
goes up in smoke, we don't loose the entire operation; in other words, if
DB1's HSG80 goes down, AP1 won't be able to operate anyway, so why not put
AP1's storage on DB1's HSG80.
One thing I'm concerned about is device names. Again, the powers that be are
very wary of any changes, and if I can make sure that the device names don't
change, things will be a good bit easier.
Does this sounds workable? Am I about to precipitate some sort of disaster?
Thanks in advance for your thoughts, and thanks for reading all the way to
the bottom of this.
---
Bluejay Adametz, CFII, A&P	bluejay_at_fujigreenwood.com
Fuji Photo Film, Inc.		+1 864 223 2888 x1369
Greenwood, SC, USA
---
If you don't like the answer, you shouldn't have asked the question. 
Received on Fri Jul 13 2001 - 12:36:48 NZST