See below for the checklist of my troubleshooting...
This is a tough one!
Ok... I've installed the new DEC500-BA NIC (with 21143 chip, revision 3.0)
cards and still no improvement.
I'm not sure if it's the new cards or another issue, but a transfer to
/dev/null OR a ram drive of a 25mb file usually takes 4-6 seconds
(4400KB/s-5600KB/s). When transferring to a disk, it takes 770seconds
(33KB/s), and NOW it takes 3300sec (7.5KB/s). On a quiet system I got it
to transfer at the usual problem speed of 770sec (33KB/s). One may suggest
a scsi bus or disk problem, but when doing anything locally on the system,
the scsi bus & disks operate at expected wide-scsi speeds, and a transfer
from a 10mbps machine results in fair timing of a 10mbps bottleneck at
22sec (1100KB/s).
Is there documentation somewhere on the firmware level settings on these
Aspen/Durango motherboards? I notice some of the settings at console
firmware level were unset, such as ewa0_def_subnetmask 0.0.0.0
ewa0_def_inetaddr 0.0.0.0
os_type NT
etc...
I tried setting the subnetmask & inetaddr which didn't help.
Should I be checking something else?
I also tried doing transfers on a quiet system, no logins, no nfs mounts,
direct crossover cable between the two with the following speed settings:
ewa0_mode=Fast
rc.config file with IFCONFIG_0="128.230.57.16 netmask 255.255.255.0 speed 100"
I even forced the speed/duplex with "lan_config -a0 -mutp -s100 -x0"
I'm not use to interpreting this level of network usage, so I'm shooting in
the dark, but...
spray from 10mbps machine to 100mbps machine:
ahep4.phy.syr.edu> spray hepsu03
sending 1162 packets of lnth 86 to hepsu03 ...
in 0.4 seconds elapsed time,
no packets dropped by hepsu03
3012 packets/sec, 253.0K bytes/sec
spray from 100mbps machine to 10mbps machine:
hepsu03.phy.syr.edu> spray ahep4
sending 1162 packets of lnth 86 to ahep4 ...
in 10.0 seconds elapsed time,
623 packets (53.61%) dropped by ahep4 <--reasonable with only
receive at 10mbps?
Sent: 115 packets/sec, 9.7K bytes/sec
Rcvd: 53 packets/sec, 4.5K bytes/sec
spray from 100mbps machine to 100mbps machine:
hepsu03.phy.syr.edu> spray hepsu02
sending 1162 packets of lnth 86 to hepsu02 ...
in 0.1 seconds elapsed time,
318 packets (27.37%) dropped by hepsu02 <--reasonable?
Sent: 20515 packets/sec, 1723.0K bytes/sec
Rcvd: 14900 packets/sec, 1251.4K bytes/sec
but a spray of larger packets results in better performance
root_at_hepsu03:> spray -l 1024 -c 1000 hepsu02
sending 1000 packets of lnth 1026 to hepsu02 ...spray: send error RPC:
Unable to send
in 0.2 seconds elapsed time,
1 packets (0.10%) dropped by hepsu02
Sent: 4530 packets/sec, 4539.8K bytes/sec
Rcvd: 4526 packets/sec, 4535.3K bytes/sec
You have new mail in /usr/spool/mail/root
Should this suggest something?
Perhaps a firmware problem? Here's the versions we have... The motherboard
is labeled as a Durango II, I'm not sure who to inquire about firmware,
Aspen or Dec.
May 20 16:56:22 hepsu03 vmunix: Digital AlphaPC 164LX 533 MHz system
May 20 16:56:22 hepsu03 vmunix: Firmware revision: 4.9
May 20 16:56:22 hepsu03 vmunix: Digital UNIX PALcode version 1.22
After analyzing tcpdump output, it showed fragmenting...
>Also, fragments received is indicitive of some network problem
>that is causing fragmentation. Again, a rogue node, a connector
>that is shorting out...
>
>I would look at the network with a sniffer or tcpdump to
>see if there is some network problem.
>
>Also the speeds between the card and the switch.
Although a connector or duplicate IP is plausible, I've checked and can see
no duplicate IP's. I've checked and double checked the speed of the cards,
and they transfer to a ramdrive or /dev/null at expected speeds with little
to no errors/collisions, so I would think that would rule that out at this
point.
The problem persists even when isolated to 2 machines with a direct
crossover cable.
when doing a "rcp 25mbfile hepsu02:/usr/_test/."
SOURCE:
hepsu03.phy.syr.edu> netstat
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp 0 0 hepsu03.telnet hepsu02.1031 ESTABLISHED
tcp 0 39956 hepsu03.1023 hepsu02.shell ESTABLISHED
tcp 0 0 hepsu03.1254 hepsu02.telnet ESTABLISHED
tcp 0 0 hepsu03.telnet suhep.1575 ESTABLISHED
tcp 0 0 hepsu03.telnet suhep.1486 ESTABLISHED
DESTINATION:
root_at_hepsu02:> netstat
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp 0 0 hepsu02.shell hepsu03.1023 ESTABLISHED
tcp 0 0 hepsu02.telnet hepsu03.1254 ESTABLISHED
I'll try to remember my checklist here...
---------------------------------------------------------
1. force NIC speed settings at console level, /etc/rc.config, and
lan_config (done)
2. install latest jumbo patch for DEC unix 4.0B (done)
3. unreachable host? no, single and floods of ping responds reasonably
4. routing problems? no, traceroute shows no additional hops, just a direct
connection
5. gateway problems? no, traceroute shows it doesn't go through the
gateway, additionally, there are 0 bad checksums with nfsstat -s
6. system cpu load problem? no, same timings on 100% load and 0% load. But
a spray command shows consistent 20-30% packets droped, even at 0% load.
7. nfs load problem? no, disabled all nfsmounts with same timing problems.
8. cabling problem? no, since transfer to ramdrive or /dev/null
shows >10mbps performance
and transfer to file to/from a 10mbps machine shows 1100KB/s performance
9. network congestion? no, tried a direct isolated connection with a
crossover cable, very little to zero network collisions, also see #13
regarding netstat -i output
10. scsi bus/disk problem? no, local access is reasonable, and transfer
to/from 10mbps machine transfers fine.
11. NIC card problem? no, tried several different systems and replaced 2
cards with newer revision with same timing problems
12. what do I use for timing? both the built in timing in ftp, and the
time command when using a copy on nfs mounted disks and when using rcp to
transfer files
13. data corruption on network? tried troubleshooting in O'Reilly's
"System Performance Tuning" pp176-206, netstat -i shows < 0.020% of Ierrs &
Oerrs compared to Ipkts & Opkts.
14. network integrity data from nfs? tried troubleshooting in O'Reilly's
"System Performance Tuning" pp176-206, nfsstat -c shows less than 0.01%
retrans of client udp calls.
15. timeouts? nfs timeouts are 0 for these tests with nfsstat -c
16. nfs workload and kernel table size? I've ruled out nfs problems by
trying other tests, but have increased maxusers to 128 instead of the
default of 32 in /etc/sysconfigtab with no improvement.
17. sys_check? I've installed sys_check and ran it during a 25mb file
transfer. Results before any changes are at
http://www.phy.syr.edu/~dkirk/hepsu03.html results after changes to
/etc/sysconfigtab are at
http://www.phy.syr.edu/~dkirk/hepsu03b.html
18. STREAMS table? netstat -m shows each setting near the peak usage for
each one. I'm not sure if this is any issue since it says it's
particularly important for System V.2, V.3, and xenix which have fixed
stream parameters. V.4 dynamically control parameters.
That's the end of what I can remember, and what's listed in O'Reilly's
"System Perforamce Tuning" Network Performance section.
Perhaps someone has some further suggestions of something to look at or
something to run, such as kernel configuration, /etc/rc.config file,
/etc/sysconfigtab, etc... noone has yet to mention any settings in the
/usr/sys/conf/<SYSNAME> kernel config file or much in the /etc/rc.config
file yet.
Any advice at this point would be great...
Dan
--------------------------------------------------------------------------
Dan Kirkpatrick dkirk_at_phy.syr.edu
Computer Systems Manager
Department of Physics
Syracuse University, Syracuse, NY
http://www.phy.syr.edu/~dkirk Fax: (315) 443-9103
--------------------------------------------------------------------------
Received on Fri May 21 1999 - 18:01:56 NZST