I have an AlphaServer 2100 4/200 that I'm using as my news server. Two
months ago I upgraded the O/S from DU 2.0.B to 4.0.B, upgraded the
firmware as necessary for the new O/S version, upgraded the netnews
software to the latest INN and innfeed, and started using AdvFS on all
partitions except for / and the article spools.
At that time, the system started logging errors of the types
"psiop_hardintr", "ss_perform_timeout, timeout on disconnected
request", and a few others. A service call produced reassurances that
the errors were spurious and everything was fine. Yeah, right. Also
around that time, the news system performance became, well, odd.
Today one of the disks started giving I/O errors, and we got a
replacement. After a reboot, I was able to use the old disk, so I am
now transferring the data from the bad disk to the new disk with tar.
Both disks are model RZ29B, with one huge partition containing a UFS
filesystem made with "newfs -c 8 -i 2048 rz11a" (for extra inodes --
these disks are part of the netnews article spool), and they are on
separate controllers. At the moment, not much else is happening on
that machine except for my file transfer, and an iostat watching over
it. So what's the problem?
Well, (1) the data is transferring extremely slowly, at a rate of about
300 MB/hour(!!!), (2) iostat is showing 3 or 4 times more activity for the
new disk than for the bad disk (average bps 520 vs 144, tps 102 vs 39,
running presumably close to flat-out), and (3) according to the system
log, the new disk switched from TIME to SPACE optimization while only
around 20% full!
If (1) is somehow a result of a hardware problem on the "bad" disk (for
which no errors have been logged since I rebooted), then perhaps
everything will go back to normal once I get the bad disk off the
system. I do intend to test this with another transfer once this one
is done some time tonight, since trying to debug performance on a
running news system is a nightmare. But if the performance problem
doesn't go away, then I have to figure out what to do, so suggestions
of what to test for tonight while the system is still down would be
appreciated.
I can't imagine what could be causing (2), and (3) is just plain
ridiculous, and worries me greatly. Did DEC somehow trash the UFS
performance between 2.0.B and 4.0.B? Does any of this resemble a
known problem?
In case it's relevant, I take performance samples every 10 minutes with
vmstat and iostat, and graph the monthly averages. After the upgrade,
there was a very sharp increase in CPU "system" time, and a smaller but
noticeable increase in CPU "user" time. I thought that *might* be
caused by new INN features, but am beginning to doubt that. Also,
mem-free dropped to zero, but page-outs remain fairly steady at about
4/second, which leads me to think that the memory management model has
changed.
Ideas, anyone?
I'd appreciate being cc'd on any replies.
Anne.
--
Ms. Anne Bennett, Computing Services, Concordia University, Montreal H3G 1M8
anne_at_alcor.concordia.ca (514) 848-7606
Received on Fri May 16 1997 - 21:23:30 NZST