Followup #5: transfer times 10x difference from Dan Kirkpatrick on 1999-10-27 (tru64-unix-managers)

From: Dan Kirkpatrick <dkirk_at_suhep.phy.syr.edu>
Date: Tue, 26 Oct 1999 17:05:08 -0400

recap:
ftp from 100mbpsDEC to 100mbpsDEC remote:/dev/null is 4500kbytes/sec (ok)
ftp from 100mbpsDEC to 100mbpsLINUX remote:/tmp/FILE is 4500kbytes/sec (ok)
ftp from 10mbpsDEC to 10mbpsDEC remote:/dev/null is 1100kbytes/sec (ok)
ftp from 10mbpsDEC to 10mbpsDEC remote:/tmp/FILE is 1100kbytes/sec (ok)
ftp from 100mbpsDEC to 100mbpsDEC remote:/tmp/FILE is 33kbytes/sec (!?!?!?!)

Argh... upon suggestion, someone else had a similar problem so they
upgraded their firmware and the problem went away. So I tried that... no
go, but then again, they were running 4.0D, and we were running 4.0B. I
just upgraded to 4.0D with the latest jumbo patch and crossed my fingers
again... no go.

Something is really screwy here... can't get 100mbps between 2 DEC Alphas
running 4.0D with the latest firmware (5.4), latest patches, recompiled
kernel, newer NIC's (DE500-BA). I've forced 100mbps Full Duplex on console
ewa0_mode FastFD, /etc/rc.config, and on switch... which seems consistent
since I can get the throughput in some instances.

If you really want to read further... the analysis is long... but I'm about
to give up and put them at 10mbps, which by the way, does work as expected.

Here's the last 4 followups:
>See below for the checklist of my troubleshooting...
>This is a tough one!
>
>Ok... I've installed the new DEC500-BA NIC (with 21143 chip, revision 3.0)
>cards and still no improvement.
>
>I'm not sure if it's the new cards or another issue, but a transfer to
>/dev/null OR a ram drive of a 25mb file usually takes 4-6 seconds
>(4400KB/s-5600KB/s). When transferring to a disk, it takes 770seconds
>(33KB/s), and NOW it takes 3300sec (7.5KB/s). On a quiet system I got it to
>transfer at the usual problem speed of 770sec (33KB/s). One may suggest a
>scsi bus or disk problem, but when doing anything locally on the system, the
>scsi bus & disks operate at expected wide-scsi speeds, and a transfer from a
>10mbps machine results in fair timing of a 10mbps bottleneck at 22sec
>(1100KB/s).
>
>Is there documentation somewhere on the firmware level settings on these
>Aspen/Durango motherboards? I notice some of the settings at console
>firmware level were unset, such as ewa0_def_subnetmask 0.0.0.0
>ewa0_def_inetaddr 0.0.0.0
>os_type NT
>etc...
>I tried setting the subnetmask & inetaddr which didn't help.
>Should I be checking something else?
>
>I also tried doing transfers on a quiet system, no logins, no nfs mounts,
>direct crossover cable between the two with the following speed settings:
>ewa0_mode=Fast
>rc.config file with IFCONFIG_0="128.230.57.16 netmask 255.255.255.0 speed
100"
>I even forced the speed/duplex with "lan_config -a0 -mutp -s100 -x0"
>
>I'm not use to interpreting this level of network usage, so I'm shooting in
>the dark, but...
>
>spray from 10mbps machine to 100mbps machine:
>ahep4.phy.syr.edu> spray hepsu03
>sending 1162 packets of lnth 86 to hepsu03 ...
> in 0.4 seconds elapsed time,
> no packets dropped by hepsu03
> 3012 packets/sec, 253.0K bytes/sec
>
>spray from 100mbps machine to 10mbps machine:
>hepsu03.phy.syr.edu> spray ahep4
>sending 1162 packets of lnth 86 to ahep4 ...
> in 10.0 seconds elapsed time,
> 623 packets (53.61%) dropped by ahep4 <--reasonable with only
>receive at 10mbps?
>Sent: 115 packets/sec, 9.7K bytes/sec
>Rcvd: 53 packets/sec, 4.5K bytes/sec
>
>spray from 100mbps machine to 100mbps machine:
>hepsu03.phy.syr.edu> spray hepsu02
>sending 1162 packets of lnth 86 to hepsu02 ...
> in 0.1 seconds elapsed time,
> 318 packets (27.37%) dropped by hepsu02 <--reasonable?
>Sent: 20515 packets/sec, 1723.0K bytes/sec
>Rcvd: 14900 packets/sec, 1251.4K bytes/sec
>
>but a spray of larger packets results in better performance
>root_at_hepsu03:> spray -l 1024 -c 1000 hepsu02
>sending 1000 packets of lnth 1026 to hepsu02 ...spray: send error RPC:
>Unable to send
>
> in 0.2 seconds elapsed time,
> 1 packets (0.10%) dropped by hepsu02
>Sent: 4530 packets/sec, 4539.8K bytes/sec
>Rcvd: 4526 packets/sec, 4535.3K bytes/sec
>You have new mail in /usr/spool/mail/root
>
>Should this suggest something?
>
>Perhaps a firmware problem? Here's the versions we have... The motherboard
>is labeled as a Durango II, I'm not sure who to inquire about firmware,
>Aspen or Dec.
>May 20 16:56:22 hepsu03 vmunix: Digital AlphaPC 164LX 533 MHz system
>May 20 16:56:22 hepsu03 vmunix: Firmware revision: 4.9
>May 20 16:56:22 hepsu03 vmunix: Digital UNIX PALcode version 1.22
>
>After analyzing tcpdump output, it showed fragmenting...
>>Also, fragments received is indicitive of some network problem
>>that is causing fragmentation. Again, a rogue node, a connector
>>that is shorting out...
>>
>>I would look at the network with a sniffer or tcpdump to
>>see if there is some network problem.
>>
>>Also the speeds between the card and the switch.
>
>Although a connector or duplicate IP is plausible, I've checked and can see
>no duplicate IP's. I've checked and double checked the speed of the cards,
>and they transfer to a ramdrive or /dev/null at expected speeds with little
>to no errors/collisions, so I would think that would rule that out at this
>point.
>The problem persists even when isolated to 2 machines with a direct
>crossover cable.
>
>when doing a "rcp 25mbfile hepsu02:/usr/_test/."
>SOURCE:
>hepsu03.phy.syr.edu> netstat
>Active Internet connections
>Proto Recv-Q Send-Q Local Address Foreign Address (state)
>tcp 0 0 hepsu03.telnet hepsu02.1031
ESTABLISHED
>tcp 0 39956 hepsu03.1023 hepsu02.shell
ESTABLISHED
>tcp 0 0 hepsu03.1254 hepsu02.telnet
ESTABLISHED
>tcp 0 0 hepsu03.telnet suhep.1575
ESTABLISHED
>tcp 0 0 hepsu03.telnet suhep.1486
ESTABLISHED
>
>DESTINATION:
>root_at_hepsu02:> netstat
>Active Internet connections
>Proto Recv-Q Send-Q Local Address Foreign Address (state)
>tcp 0 0 hepsu02.shell hepsu03.1023
ESTABLISHED
>tcp 0 0 hepsu02.telnet hepsu03.1254
ESTABLISHED
>
>
>I'll try to remember my checklist here...
>---------------------------------------------------------
>1. force NIC speed settings at console level, /etc/rc.config, and lan_config
>(done)
>
>2. install latest jumbo patch for DEC unix 4.0B (done)
>
>3. unreachable host? no, single and floods of ping responds reasonably
>
>4. routing problems? no, traceroute shows no additional hops, just a direct
>connection
>
>5. gateway problems? no, traceroute shows it doesn't go through the gateway,
>additionally, there are 0 bad checksums with nfsstat -s
>
>6. system cpu load problem? no, same timings on 100% load and 0% load. But
>a spray command shows consistent 20-30% packets droped, even at 0% load.
>
>7. nfs load problem? no, disabled all nfsmounts with same timing problems.
>
>8. cabling problem? no, since transfer to ramdrive or /dev/null shows
>>10mbps performance
> and transfer to file to/from a 10mbps machine shows 1100KB/s performance
>
>9. network congestion? no, tried a direct isolated connection with a
>crossover cable, very little to zero network collisions, also see #13
>regarding netstat -i output
>
>10. scsi bus/disk problem? no, local access is reasonable, and transfer
>to/from 10mbps machine transfers fine.
>
>11. NIC card problem? no, tried several different systems and replaced 2
>cards with newer revision with same timing problems
>
>12. what do I use for timing? both the built in timing in ftp, and the time
>command when using a copy on nfs mounted disks and when using rcp to
>transfer files
>
>13. data corruption on network? tried troubleshooting in O'Reilly's "System
>Performance Tuning" pp176-206, netstat -i shows < 0.020% of Ierrs & Oerrs
>compared to Ipkts & Opkts.
>
>14. network integrity data from nfs? tried troubleshooting in O'Reilly's
>"System Performance Tuning" pp176-206, nfsstat -c shows less than 0.01%
>retrans of client udp calls.
>
>15. timeouts? nfs timeouts are 0 for these tests with nfsstat -c
>
>16. nfs workload and kernel table size? I've ruled out nfs problems by
>trying other tests, but have increased maxusers to 128 instead of the
>default of 32 in /etc/sysconfigtab with no improvement.
>
>17. sys_check? I've installed sys_check and ran it during a 25mb file
>transfer. Results before any changes are at
>http://www.phy.syr.edu/~dkirk/hepsu03.html results after changes to
>/etc/sysconfigtab are at http://www.phy.syr.edu/~dkirk/hepsu03b.html
>
>18. STREAMS table? netstat -m shows each setting near the peak usage for
>each one. I'm not sure if this is any issue since it says it's particularly
>important for System V.2, V.3, and xenix which have fixed stream parameters.
> V.4 dynamically control parameters.
>
>That's the end of what I can remember, and what's listed in O'Reilly's
>"System Perforamce Tuning" Network Performance section.
>
>
>Perhaps someone has some further suggestions of something to look at or
>something to run, such as kernel configuration, /etc/rc.config file,
>/etc/sysconfigtab, etc... noone has yet to mention any settings in the
>/usr/sys/conf/<SYSNAME> kernel config file or much in the /etc/rc.config
>file yet.
>
>Any advice at this point would be great...
>Dan

>>Ok... quick recap...
>>ftp from 100mbps to 100mbps remote:/dev/null is 4500kbytes/sec
>>ftp from 100mbps to 100mbps remote:/tmp/FILE is 33kbytes/sec
>>if I do the same from a 10mbps machine to the same 100mbps machine, it goes
>>1100kbytes/sec whether to a file or /dev/null on that same 100mbps machine.
>>
>>If you look at the timing to /dev/null, it shows network is fine, i've
>>checked duplex settings, cables, etc.
>>It's not related to disk speeds since they are wide devices and the
>>1100kbps consistent xfer is to slower disks.
>>It's not load on the machine since I get the same timing for 100% load and
>>0% load.
>>I also tried adding speed 100 or speed 200 to /etc/rc.config and matching
>>these duplex settings with the switch and verified all sides are forced to
>>the right equal settings. I dont suggest it's hardware since it's the same
>>between 3 machines and i've tried a direct crossover cable (eliminating
>>switch possibility) with the same results.
>>Packetfiltering (tcpdump) shows 1-2 sec lags between streams of packets
>>when ftping to a remote file, but no lags when ftping to remote /dev/null...
>>I think i've ruled out the disks since a transfer from a 10mbps machine
>>transfers at rate it should to and from these disks, and local copies are
>>acceptable. I've also tried to different disks...
>>
>>Hmmm.... ?
>>It's so far boggled me, this group, and dec support for over 2 weeks.
>>Perhaps there's a way to make a ram drive to try a write to something other
>>than /dev/null and disks on this scsi bus.
>>Any other ideas!?
>>
>>Just in case it may spark something...
>>Here's an excerpt from /var/adm/messages (2 scsi cards on this one):
>>
>>Apr 30 13:58:43 hepsu02 vmunix: Alpha boot: available memory from 0xae8000
>>to 0xfffe000
>>Apr 30 13:58:43 hepsu02 vmunix: Digital UNIX V4.0B (Rev. 564); Fri Apr 30
>>13:29:46 EDT 1999
>>Apr 30 13:58:43 hepsu02 vmunix: physical memory = 256.00 megabytes.
>>Apr 30 13:58:43 hepsu02 vmunix: available memory = 245.21 megabytes.
>>Apr 30 13:58:43 hepsu02 vmunix: using 975 buffers containing 7.61 megabytes
>>of memory
>>Apr 30 13:58:43 hepsu02 vmunix: Digital AlphaPC 164LX 533 MHz system
>>Apr 30 13:58:43 hepsu02 vmunix: Firmware revision: 4.9
>>Apr 30 13:58:44 hepsu02 vmunix: Digital UNIX PALcode version 1.22
>>Apr 30 13:58:44 hepsu02 vmunix: Module 1095:646 not in pci option table,
>>can't configure it
>>Apr 30 13:58:44 hepsu02 vmunix: pci0 at nexus
>>Apr 30 13:58:44 hepsu02 vmunix: itpsa0 at pci0 slot 5
>>Apr 30 13:58:44 hepsu02 vmunix: ITPSA VERSION V1.1.25 1998/03/26
>>Apr 30 13:58:44 hepsu02 vmunix: IntraServer ROM Version V1.0 c1998
>>Apr 30 13:58:44 hepsu02 vmunix: scsi0 at itpsa0 slot 0
>>Apr 30 13:58:44 hepsu02 vmunix: rz0 at scsi0 target 0 lun 0 (LID=0)
>>(SEAGATE ST34555W 0930) (Wide16)
>>Apr 30 13:58:44 hepsu02 vmunix: rz1 at scsi0 target 1 lun 0 (LID=1)
>>(SEAGATE ST423451W 0013) (Wide16)
>>Apr 30 13:58:44 hepsu02 vmunix: rz2 at scsi0 target 2 lun 0 (LID=2)
>>(SEAGATE ST423451W 0013) (Wide16)
>>Apr 30 13:58:44 hepsu02 vmunix: tz4 at scsi0 target 4 lun 0 (LID=3)
>>(Quantum DLT4000 CD50)
>>Apr 30 13:58:44 hepsu02 vmunix: trio0 at pci0 slot 6
>>Apr 30 13:58:44 hepsu02 vmunix: trio0: S3 Trio64V+ (SVGA) Plug-N-Play,
2.0 Mb
>>Apr 30 13:58:44 hepsu02 vmunix: tu0: DECchip 21140: Revision: 2.0
>>Apr 30 13:58:44 hepsu02 vmunix: tu0: auto negotiation capable device
>>Apr 30 13:58:44 hepsu02 vmunix: tu0 at pci0 slot 7
>>Apr 30 13:58:44 hepsu02 vmunix: tu0: DEC TULIP (10/100) Ethernet Interface,
>>hardware address: 00-00-F8-06-87-E0
>>Apr 30 13:58:45 hepsu02 vmunix: tu0: auto negotiation off: selecting
>>100BaseTX (UTP) port: half duplex
>>Apr 30 13:58:45 hepsu02 vmunix: isa0 at pci0
>>Apr 30 13:58:45 hepsu02 vmunix: gpc0 at isa0
>>Apr 30 13:58:45 hepsu02 vmunix: ace0 at isa0
>>Apr 30 13:58:45 hepsu02 vmunix: ace1 at isa0
>>Apr 30 13:58:45 hepsu02 vmunix: lp0 at isa0
>>Apr 30 13:58:45 hepsu02 vmunix: fdi0 at isa0
>>Apr 30 13:58:45 hepsu02 vmunix: fd0 at fdi0 unit 0
>>Apr 30 13:58:45 hepsu02 vmunix: itpsa1 at pci0 slot 9
>>Apr 30 13:58:45 hepsu02 vmunix: ITPSA VERSION V1.1.25 1998/03/26
>>Apr 30 13:58:45 hepsu02 vmunix: IntraServer ROM Version V1.0 c1998
>>Apr 30 13:58:45 hepsu02 vmunix: scsi1 at itpsa1 slot 0
>>Apr 30 13:58:45 hepsu02 vmunix: rz10 at scsi1 target 2 lun 0 (LID=4)
>>(SEAGATE ST118273LW 5766) (Wide16)
>>Apr 30 13:58:45 hepsu02 vmunix: rz11 at scsi1 target 3 lun 0 (LID=5)
>>(SEAGATE ST118273LW 5766) (Wide16)
>>Apr 30 13:58:45 hepsu02 vmunix: rz12 at scsi1 target 4 lun 0 (LID=6)
>>(SEAGATE ST118273LW 5766) (Wide16)
>>Apr 30 13:58:45 hepsu02 vmunix: rz13 at scsi1 target 5 lun 0 (LID=7)
>>(SEAGATE ST118273LW 5766) (Wide16)
>>Apr 30 13:58:45 hepsu02 vmunix: lvm0: configured.
>>Apr 30 13:58:46 hepsu02 vmunix: lvm1: configured.
>>Apr 30 13:58:46 hepsu02 vmunix: kernel console: ace0
>>Apr 30 13:58:46 hepsu02 vmunix: dli: configured
>>Apr 30 13:59:03 hepsu02 vmunix: SuperLAT. Copyright 1994 Meridian
>>Technology Corp. All rights reserved.
>>Apr 30 13:59:15 hepsu02 vmunix: pcxal_init_keyboard: keyboard init
>unsuccessful
>>Apr 30 13:59:16 hepsu02 vmunix: tu0: transmit FIFO underflow: threshold
>>raised to: 256 bytes
>>Apr 30 13:59:16 hepsu02 vmunix: tu0: transmit FIFO underflow: threshold
>>raised to: 512 bytes
>>Apr 30 14:02:04 hepsu02 vmunix: tu0: transmit FIFO underflow: threshold
>>raised to: 1024 bytes
>>Apr 30 14:19:33 hepsu02 vmunix: tu0: transmit FIFO underflow: using
>>store-forward:
>>
>>
>>And here's the rest of the thread...
>>
>>>I did have to disable autonegotiation, but that didn't resolve the problem
>>>either. It did eliminate the "late collisions" and "FDS errors" which are
>>>systematic of duplex mismatch.
>>>
>>>Turns out it must not be the switch. I've connected two machines directly
>>>with a crossover cable and it still has the same long delay (~1mbps, vs.
>>>10mbps or 100mbps).
>>>They are TULIP (10/100) Ethernet Interface (on a DEC Alpha 566, running
>>>Digital Unix 4.0b with jumbo patch of 7/1/98).
>>>
>>>Does Dunix need to be told it's 100mbps? The ewa0_mode is set at boot
level
>>>to force 100mbps half or full duplex. I've tried both and both are the
>>>same. All I can think of next is to bump them down to 10mbps and see what
>>>happens, but in the end we need 100mbps...
>>>I've tried both nfs copies and ftp of non nfs partitions.
>>>Seems fine though with the same nfs mounted and ftp from one of these
>>>machines to and from a 10mbps machine.
>>>
>>>Any thing else to recommend?
>>>
>>>
>>>>The predominant suggestion was to check/force the duplex on the
machine and
>>>>on the switch to either full or half duplex. I tried it both ways,
>>>>disabling the port, setting to half or full on both sides, then
reenabling
>>>>the port.
>>>>all I have in /etc/rc.config for IFCONFIG_0="<machineip> netmask
>>>255.255.255.0"
>>>>so I tried changing the duplex at boot level
>>>>
>>>>I tried all combinations... at boot level, the machines had ewa0_mode
>FastFD
>>>>and the switch was set at auto-negotiate which selected half
duplex. Ok...
>>>>so just change it right?
>>>>I tried explicitly telling the switch to use full duplex... same problem
>>>>I tried explicitly telling the machine and the switch to do half
duplex...
>>>>same problem
>>>>And disabled ports before change, and enabled after change. Even tried a
>>>>powerdown.
>>>>I don't expect it's the cables (cat 5, 1meter) since they transfer 5-10x
>>>>faster to 10mbps machines.
>>>>
>>>>I realize cpu/disk speed may result in a greater bottleneck than the
>>>>switch/network, but it should at least approach the performance of the
same
>>>>file from a 10mbps machine. I've tried it with both nfs copies, and with
>>>>ftp (eliminating the cause of an nfs problem?).
>>>>netstat -i shows virtually no Ierrs Oerrs or Coll. "monitor" also didn't
>>>>show anything strange.
>>>>
>>>>Still scratching my head...
>>>>
>>>>Here's the original thread...
>>>>
>>>>>Ok... we have a bunch of servers... 3 of which are 100mbps and 6 are
>10mbps.
>>>>>Here's the times of nfs copy/ftp of a ~3.5mb file
>>>>>
>>>>>10mbps machine --> 10mbps machine 4.3 sec
>>>>>10mbps machine --> 100mbps machine 7.6 sec
>>>>>100mbps machine --> 10mbps machine 5.6 sec
>>>>>100mbps machine --> 100mbps machine 142.5 sec ----WHY?!
>>>>>
>>>>>I think i've ruled out nfs problems since all machines are using
automount
>>>>>and all are mounted using 2mb read/write cache (I've tried disabling
cache
>>>>>too).
>>>>>They are all on the same subnet and same Cisco 2916 10/100 switch, and
>>>>>settings on the switch are pretty much default set when it comes out of
>the
>>>>box
>>>>>the 100mbps machines are set at eprom to do 100mbps, and the switch
>>>>>autoconfigures as 100mbps, half duplex
>>>>>
>>>>>I don't think the gateway is the problem since they are all on the same
>>>>>subnet but how does a machine determine which of 2 gateways to use? All
>>>>>machines are running /usr/sbin/gated and most or all don't have a
>>>>>/etc/gated.conf or /etc/gateways file
>>>>>
>>>>>one 100mbps machine's ifconfig tu0:
>>>>>tu0: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,MULTICAST,SIMPLEX>
>>>>> inet 128.230.57.16 netmask ffffff00 broadcast 128.230.57.255 ipmtu
>1500
>>>>>
>>>>>one 10mbps machine's ifconfig ln0:
>>>>>ln0: flags=c63<UP,BROADCAST,NOTRAILERS,RUNNING,MULTICAST,SIMPLEX>
>>>>> inet 128.230.57.3 netmask ffffff00 broadcast 128.230.57.255
ipmtu 1500
>>>>>
>>>>>Any other ideas of why the 10x difference or something I may be missing?
>>>>>

--------------------------------------------------------------------------
Dan Kirkpatrick dkirk_at_phy.syr.edu
Computer Systems Manager
Department of Physics
Syracuse University, Syracuse, NY
http://www.phy.syr.edu/~dkirk Fax: (315) 443-9103
--------------------------------------------------------------------------
Received on Tue Oct 26 1999 - 21:22:21 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:40 NZDT