SUMMARY: Netpartitioning from Charlotte.Barsby_at_vf.vodafone.co.uk on 2000-04-18 (tru64-unix-managers)

From: <Charlotte.Barsby_at_vf.vodafone.co.uk>
Date: Tue, 18 Apr 2000 10:45:00 +0100

Dear Managers, I only got a couple of answers to this problem, and it was
eventually solved by Compaq support, - Steve Austin. The original problem is
at the bottom.

Thank you for the answers I received from Anthony Miller & Lucien Hercaud
who both quickly responded suggesting we check out the network timings and
retries.
----------------------------------------------------------------------------
----------------
1)
>From Anthony Miller; -
For multi-site fibre ase clusters we also do the following. Dont know if
this might help you out.

Tony

1.1 TruCluster Tuning Details
=============================
Due to the site separation of the two cluster member, certain cluster
specific network timers have been tuned to account for potential latency in
network response. The current values can be obtained via the asemgr utility
('a', 'C').
In order to override the defaults set in the ASE database the file
/etc/hsm.conf has been created and populated with the following values in
tenths of seconds:
HSM_PNET_MR=8 # Primary network maximum retries
HSM_ANET_MR=8 # Backup network maximum retries

This was not applicable as it was not multi site, fibre and I don't think it
was to do with network timings and retries, as with just the heatbeat and
main network all worked fine. It seemed a more fundamental problem with the
additional interfaces interfering somehow.
----------------------------------------------------------------------------
----------------------------------------------------------------
2) From Lucien Hercaud

did you try to increase the network ping timers ?
do an asemgr -d -C to extract them and eventually modify.

Lucien HERCAUD
Consultant Systeme UNIX en mission chez Bouygues Télécom
DSI / IPR / IRS / INF

Again, I checked the values and they seemed alright, plus the above point
still held, ASE worked fine when only 2 of the interfaces were configured.
----------------------------------------------------------------------------
----------------------------------------------------------------
3) The actual fix, from Steve Austin at Compaq,

Routing tables
Destination Gateway Flags Refs Use Interface
Netmasks:
Inet 0.0.0.0
Inet 255.255.240.0
Inet 255.255.255.0
Inet 255.255.255.252

Route Tree for Protocol Family 2:
default 10.33.48.10 UGS 2 1074 tu1
10.33.48/24 10.33.49.23 U 0 5 tu1
10.33.48/20 10.33.49.23 U 1 22 tu1
10.33.49.22 10.33.49.22 UGHS 4 244 tu1
10.33.49.23 10.33.49.23 UGHS 11 834 tu1
10.245.1.64 10.245.1.65 UH 0 0 tu2
10.245.32/20 10.245.40.35 U 0 0 tu0
10.245.40/24 10.245.40.35 U 1 156 tu0
127.0.0.1 127.0.0.1 UH 7 1047 lo0

These are the routing tables and netmasks as reported by "netstat -rn" for
dectest1.

There is an apparent problem with this configuration because ASE reports a
network split and can no longer run it's heartbeat connection to the other
cluster member out of interface tu0.

The system has three defined netmasks, 255.255.240.0 reported as /20 in the
routing table, 255.255.255.0 reported as /24 in the routing table and
255.255.255.252 which should be reported as /30 in the routing table but is
not.

As far as I can see, the software has been confused because addresses in the
subnetwork 10.245.40.xx are split over the two interfaces tu0 and tu2 when
using the 255.255.255.252 mask and the problem is most easily resolved by
doing two things:

1. Use a single netmask (probably 255.255.240.0, which is what's needed
for the main site network on tu1) for all interfaces. This simplifies the
routing tree and makes it easier to check the configuration.

2. Ensure that all 4 interfaces are configured into different subnets
using the single subnet mask. If using the 255.255.240.0 subnet mask, the
existing addresses for tu0 and tu will be OK, but tu2 and tu3 would need new
addresses. Here's a suggestion:

        Interface Address Subnet Mask
        tu0 10.33.49.23 255.255.240.0
        tu1 10.245.32.35 255.255.240.0
        tu2 10.245.48.65 255.255.240.0
        tu3 10.245.64.69 255.255.240.0

However, if you really want to keep the network addresses as they are,
change the subnet masks on tu2 and tu3 to 255.255.255.0 and use explicit
routes added to the /etc/routes file to route to the two encryption modules:

        -host 10.245.1.66 -interface 10.245.1.65
        -host 10.245.1.70 -interface 10.245.1.69

These should work OK, but a reboot might be needed.

Steve Austin, 17 April 2000.

We went with option 3 which worked fine. :-)
----------------------------------------------------------------------------
----------------------------------------------------------
> -----Original Message-----
> From: Barsby Charlotte, IT VF Ltd
> Sent: 12 April 2000 10:32
> To: 'tru64-unix-managers_at_ornl.gov'
> Subject: Netpartitioning
>
> Dear Managers,
>
> We have a very strange scenario at the moment with consistent
> Netpartitioning occurring, on a cluster of 2 ES40s running
> 4.0f PK3, and ASE 1.6.
>
> Each ES40 has 4 network cards, 2 combi cards which are
> configured to be the main network interface, and heartbeat
> LAN, and 2 standard 10/100BaseT Ethernet cards which will be
> configured as point to point links to an external black box
> being used for encryption.
>
> When only the main network cards are configured, with ASE
> having the heartbeat LAN as its primary network, and the main
> network connection as it's backup, (the backup is monitored)
> everything runs very smoothly. In fact any set-up with these
> networks and ASE behaves impeccably.
>
> However, as soon as I configure a third interface on either
> side, ASE net partitions, which means a failure of both the
> ASE primary and backup networks. In this state though, it is
> still possible to ping the heatbeat LAN, and the main network
> from both sides of the cluster. ie from dectest2 I can still
> ping dectest1 and hdectest1 and from dectest1 I can ping
> dectest2 and hdectest2
>
> The Addressing on the interfaces are as follows;
>
> Dectest1
> interface: address : subnetmask : name :usage
> tu0 : 10.245.40.34 : 255.255.240.0 : hdectest1 : ASE
> primary network, heatbeat LAN.
> tu1 : 10.33.49.22 : 255.255.240.0 : dectest1 : ASE
> backup network, monitored. Main Network
> tu2 : 10.245.1.57 : 255.255.255.252: dt1ra
> : Point to point connection to black box 10.245.1.58
> tu3 : 10.245.1.61 : 255.255.255.252: dt1rb
> : Point to point connection to black box 10.245.1.62
>
> Dectest2
> interface: address : subnetmask : name :usage
> tu0 : 10.245.40.35 : 255.255.240.0 : hdectest2 : ASE
> primary network, heatbeat LAN.
> tu1 : 10.33.49.23 : 255.255.240.0 : dectest2 : ASE
> backup network, monitored. Main Network
> tu2 : 10.245.1.65 : 255.255.255.252: dt2ra
> : Point to point connection to black box 10.245.1.66
> tu3 : 10.245.1.69 : 255.255.255.252: dt2rb
> : Point to point connection to black box 10.245.1.70
>
> Like I said, as soon as I configure tu2 and/or tu3 on either
> side of the cluster, and restart the network, or reboot then
> ASE Netpartitions. As soon as I delete these cards, then ASE
> starts behaving again, (I don't have to restart the network).
>
> Compaq have been looking at this, and so far to no avail.
> Although they have said that there shouldn't be anything
> wrong with this configuration.....
>
> I hope you can help,
>
> Yours,
>
> Charlotte Barsby
>
> Technical Projects
> Vodafone LTD
>
>
Received on Tue Apr 18 2000 - 09:46:30 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT