SUMMARY: bad batch of tulips? (and more Q's) from Jim Wright on 1995-12-22 (tru64-unix-managers)

From: Jim Wright <jwright_at_phy.ucsf.edu>
Date: Thu, 21 Dec 1995 19:47:24 -0800 (PST)

While I consider this issue far from closed, I have enough information
and enough people have shown interest to warrant a summary. I still
have some questions, and I need to get more info from DEC. Please write
if you can help. I'll post another summary if I find out anything else.

The essence of my question was, "why do tulip ethernet interfaces
appear to have so many more collisions than lance interfaces?"

The answer is that tulip and lance report collisions in different ways.
A lance interface will report three possible collision counts: 0, 1 or
more than one. The tulip will report exact numbers on number of collisions
for each packet transmitted. (Up to 15 collisions may occur before the
packet is aborted.) Thus "netstat -i" is reporting different statistics
for tulip versus lance.

There have been reports of problems with kernel code in 2.0 and 3.0
for the tulip drivers. Patches have been made available. No one had
information on whether 3.2 needed patches or if the patches exist.

What I _should_have_ asked is "does throughput differ between tulip
and lance, and why?" To this question, I can say yes they differ but
I have not yet had any explanation as to why. The tulip is inferior
to the lance.

Below is a brief test showing throughput results at my site. Also
the original question, plus all answers I have received so far, and
a couple of pertinent articles from this list's archives. There is
some good info below, so it is worth reading.

Thanks everyone! Keep the info coming.

Jim Wright Keck Center for Integrative Neuroscience
jwright_at_phy.ucsf.edu Box 0444, Room HSE-802
voice 415-502-4874 513 Parnassus Ave
fax 415-502-4848 UCSF, San Francisco, CA 94143-0444

===========================================================================
TULIP throughput versus LANCE
===========================================================================

You can get a program to test TCP throughput called ttcp from
    ftp://sgi.com/sgi/src/ttcp/
It opens a socket between two machines and pumps 16 megabytes through.
>From this I derived the following table showing the throughput between
various machines. All machines are on the same subnet, all machines
connect to the same non-switching, Cabletron 10bT hub. Means do not
include test-to-self. Standard disclaimers for benchmarks apply.

keck AlphaStation 600 5/233, tulip
phy DEC 3000/500, lance
amadeus DEC 3000/700, lance
basie AlphaStation 200 4/233, tulip
miles AlphaStation 200 4/233, tulip
satchmo AlphaStation 600 5/233, tulip

     \sink keck phy amadeus basie miles satchmo MEAN
source\ ======== ======== ======== ======== ======== ======== ========

keck ==== 18019 1083 1090 793 777 836 ~~~~ 915

phy ===== 974 4987 1087 898 921 959 ~~~~ 967

amadeus = 953 1080 12351 872 908 959 ~~~~ 954

basie === 809 1093 1079 9832 809 798 ~~~~ 917

miles === 807 1090 1098 761 9504 822 ~~~~ 915

satchmo = 815 1096 1095 812 818 16288 ~~~~ 927
          ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~
MEAN ==== 871 1088 1089 827 846 874 ~~~~ 931

I draw three conclusions from this test:

    tulip is significantly worse at reads than the lance

    tulip is worse at writes than the lance

    the test-to-self gives a rough indication of the power of the machine

===========================================================================
ORIGINAL QUESTION
===========================================================================

I continue to try and find out why the performance of all our Alphas went
to hell when I "upgraded" from 2.0 to 3.2. Today's suspect -- the
tulip ethernet interface.

All the tulip-based machines (about 10) are consistently worse than
the lance-based machines (about 20) in regards to collisions. Most
of the tulip cards show about 10% to 25% collision rates. Most of
the lance cards show 1% to 8%. One machine is particularly bad.
The two previous times I checked, it had 90% and 60% collision rates.
I rebooted the machine this morning. Now it shows

Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
tu0 1500 <Link> 08:00:2b:e4:bb:d8 1273834 0 348590 2 544050

for a whopping 156% collision rate! (collisions/output_pkts) I've
gone by the rule that anything above 10% is unacceptable. In my
experience with HP and Sun machines, I rarely see anything above 2%.
The Alphas seem to average about 4% (ignoring outliers). None of
this makes me feel good.

The boot message shows

Dec 20 09:45:26 basie vmunix: tu0: DECchip 21040-AA: Revision: 2.3
Dec 20 09:45:26 basie vmunix: tu0 at pci0
Dec 20 09:45:26 basie vmunix: tu0: DEC TULIP Ethernet Interface, hardware address: 08-00-2B-E4-BB-D8
Dec 20 09:45:26 basie vmunix: tu0: selecting 10BaseT port

There is a strong correlation on our tulip boxes showing that older
machines have worse collision rates. The above machine was one of
our first alphastation purchases.

All machines are running 3.2B, the upgrade to 3.2C is in progress.
All wiring is 10baseT using commercially made cables (AMP). The
infrastructure wiring is all Cat-5. All systems go to Cabletron hubs.

Is there something fundamentally wrong with tulip boards? HELP!

===========================================================================
REPLIES
===========================================================================
From: C.Morgan_at_soc.staffs.ac.uk (Craig Morgan)

Beware!!

I've been wrong-footed by Tulip reports in the past ... DEC apparently went
to town on writing the Tulip driver and consequently it gives much more
accurate (pedantic in my experience!) reporting.

Basically I run two ethernet cards (one tulip) in the same AlphaServer both
currently on the same backbone, the Tulip generally reports much more
verbosely what is going on. I've basically become sceptical about the
reporting and tend to average out the two sets of results ... there was a
pretty good discussion about this on the list a few months ago.

--
Craig
                            ,,,   Wot, NO mountains!
 ======================oOO=(o o)=OOo===================================
  Craig Morgan              (_)      Lecturer, CS Group
  School of Computing                Email: C.Morgan_at_soc.staffs.ac.uk
  Staffordshire University           Phone: +44 (0)1785 353466
  Beaconside                         Fax:   +44 (0)1785 353497
  Stafford, UK  ST18 0DG             Pager: +44 (0)839 453754
  "It's the downhill thrills, that make the uphill slog worthwhile..."
 ======================================================================
===========================================================================
From: Hellebo Knut <Knut.Hellebo_at_nho.hydro.com>
Regards,
At least for 3.0 I know there are patches for the tulip drivers. Maybe they
didn't make in time to 3.2 and you still have to install these ??
Contact DEC for info.
-- 
      ******************************************************************
      *         Knut Helleboe                    | DAMN GOOD COFFEE !! *
      *         Norsk Hydro a.s                  | (and hot too)       *
      * Phone: +47 55 996870, Fax: +47 55 996342 |                     *
      * Pager: +47 96 500718                     |                     *
      * E-mail: Knut.Hellebo_at_nho.hydro.com       | Dale Cooper, FBI    *
      ******************************************************************
===========================================================================
From: Martyn Johnson <Martyn.Johnson_at_cl.cam.ac.uk>
I think I've read somewhere that different ethernet controller chips REPORT 
collisions differently.  For example, if a packet collides 3 times and then 
goes on the fourth attempt, some chips report that as 1 collsion (because one 
packet collided) whereas others report it as 3 collisions (because that's what 
happened on the wire).
Fundamentally, whether a transmission collides or not is going to depend on 
what is on the wire rather than the particular controller chip. Apart from 
pathological timing effects, the performance of a particular chip or board is 
unlikely to have any effect, except in so far as a high-performance interface 
will load the network more and hence increase the general collision rate.
My guess is that the general difference you are seeing between lance-based and 
tulip-based interfaces is an artefact.  I suspect that there is some hardware 
problem with the machine that is absurdly bad - either the machine itself 
faulty or some problem with its connection.
I only have one tulip-based machine, and its ethernet performance seems fine 
to me (about 7.1 to 7.6 Mbit thoughput using TCP with the machine in normal 
service). It is running 3.2A. It is not meaningful for me to compare collision 
rates because we are using switched ethernet, so traffic levels on different 
segments vary anyway.
I suggest that you pay less attention to collision rate and start measuring 
throughput with something like ttcp. Throughput is, after all, what actually 
matters.
-- 
Martyn Johnson      maj_at_cl.cam.ac.uk
University of Cambridge Computer Lab
Cambridge UK
===========================================================================
From: Dave Cherkus <cherkus_at_UniMaster.COM>
You can't directly compare lance and tulip reports this way.  Here's
something I wrote a while ago on this topic:
    Newsgroups: comp.unix.osf.osf1
    Subject: Re: V3.0 E-Net Collisions with ftp
    Organization: UniMaster, Inc.
    Date: Wed, 4 Jan 1995 02:29:32 GMT
    You are making a reasonable yet inaccurate assumption that the counters
    are maintained the same way on both machines, but they are not because
    the interfaces use two different chips and the chips used in the tu0
    interface are more accurate than the ones used in the 2000/300 (ln0?)
    interface.
    The AMD LANCE ethernet chip, used in the 2000/300 and also used for
    many years in DEC and many other vendor's equipment, tells the kernel
    one of the following things happened after a frame is transmitted:
      - no collisions occurred
      - exactly one collision occurred
      - two or more collisions occurred
    The Ethernet standard says that up to 15 collisions can occur before
    the transmission is aborted,  so the LANCE does not communicate the
    full story back to the kernel.
    The kernel increments the netstat collision counter once when exactly
    one collision occurred, and by two when two or more collisions
    occurred.  This is inaccurate, but it's the best the kernel could do.
    It's not just inaccurate, it's always optimistic.  This is why you
    think you are getting 'excessive' collisions - you've been lied to 
    by the AMD LANCE in the past.
    The older DEC SGEC chip (ne0) and the newer DEC TGEC chip (te0, tu0)
    can tell the kernel exactly how many collisions occurred, and this is
    what netstat reports.  The AMD LANCE used in TurboChannel and ISA
    systems is fading into the sunset...
    You can identify which chip is being used by the message that appears
    at boot time, or by the interface name (ln0 is AMD LANCE, most of the
    others are tu0).
    If you feel more comfortable with the 'classic' statistic, you can run
    the command
	# netstat -I tu0 -is
    and look for 'single colllision' and 'multiple collision', then add the
    'single collision' count to two times the 'multiple collision' count to
    get the 'classic' statistic.
--
Dave Cherkus ----- UniMaster, Inc. ----- Contract Software Development
Specialties: UNIX TCP/IP X OSF/1 AlphaAXP AIX RS/6000 Performance ISDN 
Email: cherkus_at_UniMaster.COM  Tel: (603) 888-8308  Fax: (603) 888-8308
if (cpu.type == PENTIUM && cpu.step < 8)   { panic("Intel Inside!"); }
===========================================================================
From: Mike Iglesias <iglesias_at_draco.acs.uci.edu>
See the message included below for an answer to your question.  I got
it from the WAIS search feature of the 
  http://www-archive.stanford.edu/lists/alpha-osf-managers/hyper/
archive.
Mike
    [S] Tulip Ethernet Controller Collision Rate
    Bivins, Jeff (BIVINS_at_nebeng.otis.utc.com)
    Sat, 30 Sep 1995 10:36:32 -0600 (CST)
    My Original question is:
    > Hello all,
    > I have 35 AlphaStation 250 4/266 workstations and 2 AlphaServer 2100 4/233
    > servers. All of these machines have a DEC TULIP PCI ethernet card. When
    > using the 'monitor' tool I see on the average 30-40 percent of collision
    > on a high throughput transfer.
    > When I send a large file from on of these machine to a DECsystem 5900. The
    > high collision rate only exist in the Alpha side and not the DECsystem
    > side.
    > Is this a tuning issue ?
    Nope. It's normal.
    > How can I resolve this issue ?
    Thanks to those who responded
    Matt Thomas
    Dave Cherkus
    J. Dean Brock
    Dave Golden
    The consensus is that the TULIP controller reveals accurate statistics on
    collisions, where the LANCE controller does not.
    I will look at this problem from a network perspective.
    Thanks,
    Jeff
===========================================================================
From: David Lucas <dlucas_at_worldbank.org>
Jim -
We noticed the same problem with our 2 2100s in a DECsafe ASE
environment.  One of our Digital support people dug around in the
internal archives and found a paper entitled, "The Ethernet Capture
Effect: Analysis and Solution", K.K. Ramakrishnan and Henry Yang, (rama,
yang_at_erlang.enet.dec.com).
In a nutshell, the abstract describes the effect as a situation "where a
station transmits consecutive packets exclusively for a prolonged period
despite other stations contending for access."  Essentially, the Tulip
interfaces, when transmitting, take over the wire never giving other
systems a chance to send their packets.  The solution is a proposed
algorithm, Capture Avoidance Binary Exponential Backoff, that includes
"an enhanced backoff algorithm for collision resolution in the special
case when a station attempts to capture the channel subsequent to an
uninterrupted consecutive transmit."
Of course, none of this offers much practical advice on how to fix the
immediate problem.  In our case, we believed our Alphas were having a
negative effect on our overall network, and simply bridged them onto
their own segment.  It hasn't much improved the performance for those 2
systems, but at least our network guys can't point the finger at us when
they do have problems.  :)
The paper is 31 pages long, and I don't have an electronic copy.  What I
can try and do is scan it and mail it to you.  (I have no way of making
a document available for anonymous ftp.)  It may take a day or so, as
it's a bit hectic today.
Hope this is of some help to you.
d.
=======================================================================
David Lucas				E-mail:	dlucas_at_worldbank.org
The World Bank				Phone:	202.458.5214
	Practice random, senseless acts.
===========================================================================
From: Selden E Ball Jr <SEB_at_LNS62.LNS.CORNELL.EDU>
Jim,
I just took a quick look at the e'net interfaces on our Alphas.
We have old and new "tulip" systems as well as lots of 3000 series systems.
As best I can tell, the collision rates of both types are consistant 
with the traffic on the ethernet segments to which they are connected.
Have you compared the collision rates of all of the systems
which are plugged into the same hub? I'd expect the ratio
of Opkts/Coll to be about the same there.
Selden
===========================================================================
From: "Jonathan B. Craig" <jcraig_at_i2k.net>
I don't know but I have been testing DEC NSR and have found that network
backups on my (very early model) DEC 2100 w/ Tulip cards have an 
incredible amount of collisions (50% normal).  If you get a suitable
response let me know!
-- 
Jonathan B. Craig                                      jcraig_at_gfoods.com
Gordon Food Service
===========================================================================
From: nick_at_alldata.com (Frank "Nick" Riley)
	I was reading through the archive a month or so ago, and I recall
    reading a bunch of messages regarding a bug in the TULIP driver in
    DU 3.? that required a patch. The symptom was intermittent "voids" in
    the interface where absolutely no traffic passed. Look through the
    archive at http://www.ornl.gov/cts/archives/mailing-lists/ and search
    for "TULIP".
===========================================================================
From: ccult1!bommel!dehartog_at_relay.nl.net
Hello Jim,
You may want to ask your friendly Digital support people for
the patch: OSF350-070 (it's mandatory!).
Good luck!
===========================================================================
From: em_at_icess.ucsb.edu (Ed Mehlschau)
We received a tulip interface in a new AlphaStation that yielded very
poor performance until it was configured to run half-duplex instead of
full-duplex.  Apparently DEC ships them in the full dux configuration.
I have been told that the config is changed from the boot PROM, but I
don't know the exact incantation offhand, sorry.
-- Ed
===========================================================================
From: anthony baxter <anthony.baxter_at_aaii.oz.au>
Just as a data point, I just checked our 4/233's and they all show
similar numbers (anything from 20% to 30%). These are 3.2A systems (they
go to 3.2C next week), and they show the same boot info for the tulip
card as your systems. They're plugged into a switching hub, so there is
no way in hell they should be seeing that level of errors. 
tu0: DECchip 21040-AA: Revision: 2.3 
tu0: DEC TULIP Ethernet Interface, _hardware address: 08-00-2B-E4-56-EF 
I'd be very interested in anything you find out - I'm hoping it's just
a bug in the reporting code, but in any case it would be good to have it
fixed...
Anthony
===========================================================================
And a couple things I found in the A-O-M archives.
===========================================================================
Subject: (belated) SUMMARY: ethernet constipation on 2100 A500MP
X-Url: http://www.ornl.gov/its/archives/mailing-lists/alpha-osf-managers/1995/02/msg00346.html
Back in (I think) October I posted a description of a problem with the
Sable's ethernet interface.  (Periodically, and for no apparent reason,
inbound packets would get stuck.  As soon as the system sent a packet
to some other machine, the inbound clog would clear.)
Through a combination of absentmindedness and overwork, I never did get
around to posting a summary.  So better late than never, here it is ...
I got some really helpful replies from a couple of DEC folks (who shall
remain nameless to keep them from getting swamped with unsolicited mail).
The first reply I got said
|  [...] I believe you're seeing a bug in the Tulip driver. One
| that was recently discovered, and that too quite by accident.
| (A line of code was deleted and did not get reinstated.)
| It has to do with the driver failing to reset a timer when the
| transmit ring transitions to an inactive state (0 entries pending).
| Each time a transmit packet is given to the device, a timer is
| reset to go off after 5 seconds. This timer therefore never goes
| off if the device is kept busy. If, however, a new transmit does
| not come in within 5 seconds of the last one, then the timer
| goes off and the interface is reset. I believe this reset is what
| causes things to get hung-up.
The bug apparently first appeared in V2.0b, but was discovered too late
for a fix to make it into V3.0.  Anyway, the helpful DEC person sent me
a patched version of the TULIP driver, and the problems disappeared.  He
also mentioned that he had arranged for the patches to be made available
through Digital's Customer Support Center (for folks covered by a support
contract, of course).  The relevant patch numbers are
        OSFV20-065      (for OSF/1 V2.0b)
and
        OSFV30-40       (for OSF/1 V3.0)
Mark Bartelt                                                416/978-5619
Canadian Institute for                             mark_at_cita.toronto.edu
Theoretical Astrophysics                           mark_at_cita.utoronto.ca
"Clothes not busy being worn are busy drying."  -  Dylan, on laundry day
          [ singing "It's all right, ma (I'm only bleaching)" ]
===========================================================================
Subject: SUMMARY: tuo: packet dropped: no mbuf (again).
X-Url: http://www.ornl.gov/its/archives/mailing-lists/alpha-osf-managers/1995/07/msg00190.html
Thanks for the reply.  DEC was very quick in getting back to me, and I was able
to ftp the patch, install it and rebuild the kernel within an hour of my call
to DEC.  I am including the response I received from Matt Thomas describing the
patch.
thanks again,
dan cambron
ORIGINAL:
---------------------
>I included a previous summary for reference.  I am at V3.2a on a 2100 using
>AdvFS and I'm still getting crashes and the message "tu0: packet dropped: no
>mbuf".  The move to v3.2a doesn't seem to be working. Any thing else I should
>do.  Is there a patch to v3.2a?  I also have a call in to DEC.
>thanks
>dan
REPLIES:
-----------------------------
There is a patch.
/usr/sys/BINARY/if_tu.o                 (USG-01533)
CHECKSUM: 33316     54
/usr/sys/data/if_tu_data.c
CHECKSUM: 13750      7
----------------------
Patch ID:  OSF320-044, OSF320-059
The Tulip (DECchip 21040) driver does not support software selection of
the 10Base2 (Thinwire) and 10Base5 (Thickwire) ports. As per the Tulip
specification, this selection is expected to be carried out in hardware,
and is done so on the DE425 and DE435 modules produced by Digital.
In the absence of a jumper solution or auto-sensing hardware, software can
also select between the 10Base2 and 10Base5 ports if the hardware
implementation utilizes a certain (undocumented) feature of the chip.
In particular, the 3-port PCI Ethernet card made by Standard Microsystem
Corporation (SMC) makes use of this feature, and the driver as shipped
today (since V2.0B), cannot select between the two AUI ports on this module.
This patch contains an enhanced media-sensing algorithm to allow software
selection of the 10Base2 and 10Base5 ports. This improved algorithm will
also provide better diagnostics on boards that use a jumper (such as the DE425
and DE435). For example, the driver will now warn the user if the jumper
position was set for Thinwire but no cable was connected to that port.
The driver will now display the following message:
  tu0: auto sensing: selected BNC (10Base2) port: no carrier
This patch also contains a fix for a problem where the driver will print out
 'packet dropped: no mbuf'  messages to the console repeatedly.  While this
happens, the system becomes unusable for all other activity and is effectively
hung from a user's point-of-view.
A kernel rebuild is required.
===========================================================================

Received on Fri Dec 22 1995 - 05:08:18 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT