I'm stumped by an apparent limit in the Tru64 UNIX kernel (v5.1A pk6)
to handle client node MAC-addresses for close to 1000 NFS clients.
We expanded our Linux cluster to 900+ nodes, and suddenly the
Tru64 UNIX NFS file-server randomly looses network communication
with many (or most) of the new nodes. A "ping" doesn't work at
either end of the server-client connection. Communication between
Linux servers and nodes works perfectly, however, so we do not
believe there to be a problem with the network setup.
What happens is I believe "ARP cache trashing": The Tru64 kernel
apparently can't cope with close to 1000 MAC-addresses simultaneously
because a fixed-size ARP cache fills up, and the kernel starts
deleting MAC-addresses from the ARP cache randomly. See "man 7 arp"
on a Linux box about the cache. On the Linux boxes we solve the
ARP cache problem by loading a static cache from the /etc/ethers file,
but on Tru64 UNIX this causes a dead-sure communications failure :-(
Browsing the Tru64 UNIX manuals and the "dxkerneltuner" tool, I
haven't been able to find any kernel parameter which may increase
the maximum size of the ARP cache. Can anyone help ?
Note: The 900 nodes are divided about equally between two Gigabit
interfaces on the Tru64 UNIX server.
Ole Holm Nielsen
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
Received on Fri Sep 17 2004 - 18:50:49 NZST