SUMMARY: transmit FIFO underflow: how to prevent occurence from Charles Vachon on 2000-06-27 (tru64-unix-managers)

From: Charles Vachon <cvachon2_at_mrn.gouv.qc.ca>
Date: Mon, 26 Jun 2000 11:02:49 -0400

[***Original message***]
>Hello Tru64 managers,
>
>The issue of "transmit FIFO underflow" has been discussed previously
>in the mailing list, and the consensus was that these messages are of
>normal occurence, appearing as the ethernet adapter adjusts itself to
>the volume of network traffic.
>
>However, we observed (at least one one of our servers, a 2-processor
>AS2100 5/250 running TU4.0F+pk3) that at the very moment that this
>message appears in the log, we experience a short but severe slowdown
in
>network performance: all client connections to the server seem to
freeze
>for several seconds, and certain applications remotely displayed on PCs

>running Exceed simply die, presumably for lack of network service.
>
>The solution I would like to implement to correct this problem is to
>preset the FIFO size to a suitable value, 512 or 1024 bytes, instead of

>the (seemingly) default of 128 bytes. This way, I hope I would prevent
>the occurence of the slowdown altogether.
>
>Does anyone know where this could be set? I've looked a bit in the
>configurable attributes of /etc/sysconfigtab and also in the *.h and
*.c
>files present in /usr/sys and /usr/include, but so far I've found
>nothing relevant.
>
>BTW I will also continue investigation on the network side of the
story,
>for this could also be caused by a switched port disallowing any
traffic
>for some seconds after a link-down/link-up transition (e.g. if such a
>transition occurs when the FIFO size is increased, this could be the
>source of the problem).
>
>Thanks in advance/will summarize

[***Original message ends***]

Thanks to:

Oisin McGuinness <oisin_at_sbcm.com>
George Gallen <ggallen_at_slackinc.com>
Cyndi Smith <cyn_at_odin.mdacc.tmc.edu>
"Dr. Tom Blinn, 603-884-0646" <tpb_at_doctor.zk3.dec.com>
"Hoai TRAN" <HTran_at_freightcorp.com.au>
"Thomas, Phil" <Phil.Thomas_at_compaq.com>
Claude Scarpelli <claude_at_genoscope.cns.fr>

who shared their experiences on this matter. Although no one came in
with a way to increase the initial size of the FIFO buffer, which would
circumvent the problem, there were many suggestions on how to alleviate
it:

-plug the server's NIC to a dedicated Ethernet switch port
-run the link at 100 megabit/s, full-duplex
-turn off autonegotiation both at the switch itself and at the server's
NIC

Phil Thomas provided the following:

PT>I can't comment much on the FIFO underflow error, but you may want to

PT>investigate the cause of the slowdown more, because the symptoms you
PT>describe exacty match a problem with sync that was addressed in pk2
or pk3
PT>for 4.0f. If your freezes are happening every 30 seconds exactly,
then this
PT>is almost certainly the case. Check the notes on "smoothsync" in the
pk3
PT>release notes:
PT>http://ftp1.support.compaq.com/public/dunix/v4.0f/duv40fas0003-20000225.README.

PT>
PT>WARNING: ALTERING THE BEHAVIOUR OF SYNC MAY CAUSE LOSS OF DATA! If
you're
PT>on a non-production system, you can confirm the cause by temporarily
PT>disabling /sbin/update (man update; man sync) and checking for
freezes.
PT>If you're on a production system, or are unsure about update and sync

--
PT>take care! -- contact Compaq support for assistance!
PT>
PT>I've seen the 30 second sync affect a DU system to the point where it
can't
PT>even respond to pings for a few seconds, even though the NIC is not
PT>saturated.  I've also seen X emulators affected by this comms
"brownout" and
PT>clients time out, though this symptom would normally indicate a
different
PT>sort of network bandwidth problem.  It's possible
PT>
PT>>From my reading of the FIFO underflow, the sync freeze problem may
actually
PT>be causing these messages indirectly (because the kernel is unable to
PT>complete transfering a packet to the tulip NIC), and adjusting the
PT>threshold for early transmission is the normal operation of the
driver.  I
PT>think this message is another symptom of the real problem...
PT>
PT>On the other hand, if your freezes aren't 30 seconds apart, and you
have
PT>definite signs of saturated network, then I've probably led you
astray, so
PT>continue with your current line of investigation!  Note that the FIFO
PT>underflow message is more likely triggered by a busy system (run
queue? free
PT>mem? paging out?) rather than a busy network.
PT>
PT>have fun!
PT>Phil T
Dr Blinn's reply points to the tu driver itself:
TPB>As my esteemed colleague who maintains the "tulip" ("tu") driver
noted
TPB>to me:
TPB>
TPB>I found the following in the v4.0D patch documentation.
TPB>
TPB>Of course, for different versions, the numbers change, but
TPB>the patch remains the same.
TPB>
TPB>- --------------------------------------------------
TPB>
TPB>
TPB>NEW PATCHID: 681.00
TPB>PATCH ID: OSF425-651
TPB>REQUIRED PATCHES: NONE
TPB>CONDITIONALLY REQUIRED PATCHES: NONE
TPB>SUPERSEDED PATCHES: OSF425-388-2 (297.02), OSF425-562 (597.00)
TPB>SPECIAL INSTRUCTIONS: NONE
TPB>FULL DESCRIPTION:
TPB>PROBLEM:  (QAR 55766 QAR 60909)    (Patch ID: OSF425-388)
TPB>                =*=*=
TPB>This patch fixes the following problems that may occur on some DE500
TPB>adapters:
TPB>
TPB>o The hardware setup operation may interrupt a pending ARP packet
TPB>transmission.
TPB>
TPB>o If the cable to the adapter is not connected, the hardware setup
TPB>operation
TPB>  will not execute.
TPB>
TPB>PROBLEM:  (CLD ALC-08171)    (Patch ID: OSF425-562)
TPB>                =*=*=
TPB>
TPB>When using a DE504-BA in an AS800 with a second SCSI controller on
the
TPB>shared PCI bus, the DE504 experiences DMA latency and increases it's
TPB>buffersize from 128 bytes to 1024 bytes in four steps during heavy
load
TPB>and finally goes into a store/forward mode. When this happens the
TPB>device does a reset and stops working for approximately 2 seconds.
TPB>During this time all incoming datagrams and messages are lost.
TPB>
TPB>This patch adds a kernel global variable to the tu driver that
TPB>specifies whether store/forward mode should be permanently enabled
when
TPB>the tu driver starts. To enable this mode, patch the variable
TPB>tu_sf_mode using dbx:
TPB>
TPB># dbx -k /vmunix
TPB>
TPB>(dbx) patch tu_sf_mode=1
TPB>1
TPB>(dbx) quit
TPB>#
TPB>
TPB>And the following DOES require an actual patch for V4.0D, I believe
the
TPB>fix is in teh later releases.
TPB>
TPB>PROBLEM:  (QAR 65058, QAR 65259)    (Patch ID: OSF425-651)
TPB>                =*=*=
TPB>The patch fixes a problem in the tulip driver. The tulip driver
needs to
TPB>support DC21143-xD Errata V4.0 for ethernet connections. This chip
is
TPB>currently being used on Compaq Professional Workstation XP1000
TPB>(as well as several others in the near future).
TPB>
TPB>
TPB>FILES:
TPB>./sys/BINARY/tu.mod
TPB>        CHECKSUM:       61916 43
TPB>        SUBSET: OSFHWBIN425
TPB>        ./kernel/io/dec/netif/if_tu.c,v   RCS ID: 1.1.145.4
TPB>SUPPORT NOTES: NONE
On my side, I was able to peform a few checks on the network side during
the weekend, and I finally found an acceptable solution (for us at
least):
In a word, the source of observed network slowdown is that switch ports
connecting our Unix servers were incorrectly configured to participate
in Spanning Tree Protocol exchanges.
In more details:
I was able to observe that the DE500 shortly takes the link down upon
enlarging it's FIFO buffer. This brief link transition was seen by the
Ethernet switch port (with STP enabled) as a network topology change,
which induced the port to enter in the blocking state for the duration
specified by the "bridge forward delay timer", which is 15 seconds by
default with our network hardware. So, every time the FIFO size was
adjusted, the server was cut from the rest of the network for this
interval. Since our servers do not act themselves as bridging devices in
the network, there is no need to have STP enabled on the switch ports
they are connected to. Disabling STP on server ports effectively
prevents a flick in link transition from triggering the 15-second wait.
Take note that the interruption in network traffic by the DE500 itself
still exists, but it is down to below 1 second in duration, which is
acceptable for our type of network traffic (having no real-time data
carried on the network).
Thanks again to all who replied.
===============================================
Charles Vachon tel: (418) 627-6355 x2760
  email: cvachon2_at_mrn.gouv.qc.ca
  Administrateur de système
  FRCQ/Ministère des Ressources
  Naturelles du Québec
===============================================

Received on Mon Jun 26 2000 - 15:04:56 NZST

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT