Hi,
in my original message I wrote:
> we have two AlphaServers 4000 on V4.0F which besides 100 Mbps Ethernet
> also have DEGPA-SA Gbps Ethernet cards (driver rev. 1.0.12, firmware rev.
> 11.3.2). We use them to send data over TCP to Linux nodes with 100 Mpbs
> Ethernet interfaces. Usually this works fine, but occasionally (~< 1 %)
> when the TCP connection has been successfully established (verified on
> both ends) the first write() of 256 kB will return a short count, which
> is always 136 kB. The other side, however, did not receive anything.
We have also seen a few other "magic numbers", always multiples of 8 kB.
> When the program on the AlphaServer then tries to write the remaining
> 120 kB, it gets an EPIPE (Broken pipe). The other side, however, did
> not close the socket and just sits in a recv() waiting for data.
> When the program on the AlphaServer exits, OSF1 does not send a FIN
> packet to the other side, presumably because it thinks the other side
> already broke the connection.
>
> After scrutinizing for many days the program code on both ends we think
> there must be a bug in some piece of the OS, most likely in the Gbit
> driver code.
In the meantime we have installed rev. 2.0.1 of the Gbit driver on our new
GS-80 running V5.1. When we repeated our tests we got a lower error rate,
more like ~< 0.1 %, but definitely non-zero. Then we realized that the TCP
send space (window) used by our Tru64 machines was rather large: 128 kB.
That value surely is too large for the router that we have between our Gbit
ports and the 100 Mbps Linux nodes: it only has a 56 kB buffer for each of
its 100 Mbps outputs. Our transfers should be overflowing those buffers
pretty often, leading to packets being dropped, lowering the bandwidth.
When we set the window to 32 kB and repeated our tests, the error rate
dropped significantly to about 0.01 % on the GS-80; it also dropped quite
dramatically on the AlphaServers 4000, which still use rev. 1.0.12 of the
Gbit driver on V4.0F.
Our conclusion: the Gbit driver cannot completely handle packets getting
dropped - somehow it makes Tru64 jump to the conclusion that the remote end
has closed the connection.
Fortunately we expect that our applications can live with the low error rate
observed, but it would be better if this problem were completely solved.
Our thanks for providing helpful suggestions go to the following people:
Jeffrey Mogul <mogul_at_pa.dec.com>
Guy.Loucks_at_det.nsw.edu.au
Klas.Erlandsson_at_europolitan.se
Regards,
Maarten (Fermilab Computing Division)
Received on Wed Feb 28 2001 - 00:21:25 NZDT