SUMMARY: High system load - can't find anything running

From: Robert Carsey <rcarsey_at_monmouth.edu>
Date: Tue, 08 Aug 2000 09:54:39 -0400

Well, I rebooted the system, and the problem hasn't recurred, yet. It is
possible that some process was forking tons of children and then dying --
hence the "defunct" processes in my process table.

Allan pointed this out to me and told me to try and kill the PPID of those
processes, unfortunately, I had already rebooted :( Would have liked to try
that though.

Alan (Compaq) said to check and see if I was using NFS v2 -- I wasn't... NFS
v2 is a beast.. use v3 whenever humanly possible.

All responses follows. Thanks to J. Joseph, A Simeone, D. Klienhesselink,
Alan, O. Orgeron, N Milutinovic..
----------------------------------------------------------
We have noticed this on our DEC systems too (4100, 4.0F). A reboot will
take care of it. But, we are not sure what caused it or how to prevent it
from happening again.

-- JJ
------------------------------------------------------------
If you have your filesystems mounted so that root_at_localhost can't get to
the nfs stuff (security reasons or something), if root does a vi on a file
on the nfs mount and hte vi hangs, it's because the perms aren't there,
but it'll add a load of 1 to the host, even though the process won't
actually suck up any cpu.

It's weird. Only way I've found to correct the high load is to reboot the
box, but remember, system "load" (uptime, w, top, et al) is a realyl bad
indicator of how busy the system really is. it's 80,000 foot level at
best. Fire up sar (sys's crontab) and let it run for a while, you'll get a
much better breakdown of what your host is doing.

-dave
----------------------------------------------------------------
My take is you have 16 zombie processes. Do a ps -ef | grep
defunct.
If you have defunct processes, Get the PPID of the process and do a
ps -ef of the PPID of the defunct process until you get the actual
PID of the process. If you find the process is actually defunct,
try to kill it
by doing a kill or kill -9 of the process (PPID of the defunct
process). Zombie
or looping processes usually take up the CPU cycles. Careful on
killing
them though, you don't want to kill a legit process. I usually run
into this from
the SAS application quite a bit.

If you can't kill the defunct process, the only way to get rid of
them is to reboot.

Allan
------------------------------------------------------------
All those smbd processes are probably doing disk I/O or accessing the
disks and that's what's causing the high load average. I forgot exactly
what the definition of load average is, or how it's calculated, but
processes that do extensive disk access cause high load average. Someone
else will probably get this exactly.
--------------------------------------------------------------
What version of NFS are the clients using? NFS V2 requires
that all writes be synchronous on the server. A moderate
write load could have lots of waiting clients which might
count against the load. NFS V3 is bit more friendly.
--------------------------------------------------------------
I would check out the I/O and network stats using
netstat and iostat. Then if nothing comes up there, I
would look at all of the running processes with ps.
You could have an application that is forking child
processes that only take up a really small percentage,
yet add up. Other than that, I would suggest that you
stop any non-escental daemons or software to see if
that helps. You could also see if you are running out
of tmp space or some other resource.

-------------------------------------------------------------
If I understand correctly, this load represents the number of processes
that are waiting to be processed or being processed per second.
Processes that are sleeping or waiting are not counted here. So, it has
got to be something CPU intensive, although our machines would run into
high loads with Oracle7 queries, because it was swapping a lot.

------------------------------------------------------------------

> Hello, this seems really odd on my v5.1 machine. I can't seem to find
> anything running that would give me a system load of 14+. The only thing
> "interesting" about this machine is that it has 4 filesystems while are
> served from NFS and I'm running LSM. Can anyone give me some pointers as
to
> where I should look for the source of this load?
>
>
> load averages: 14.75, 14.50, 14.69
> 21:19:04
> 144 processes: 4 running, 54 sleeping, 70 idle, 16 zombie
> CPU states: 3.9% user, 0.0% nice, 96.0% system, 0.0% idle
> Memory: Real: 230M/493M act/tot Virtual: 33M/7825M use/tot Free: 121M
>
> PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND
> 28328 root 42 0 8352K 1302K sleep 0:00 4.40% smbd
> 14526 root 49 0 6704K 434K run 0:00 2.80% smbd
> 26145 root 46 0 8440K 1327K sleep 0:01 2.70% smbd
> 28517 root 42 0 5448K 712K sleep 0:00 1.20% login
> 26853 root 42 0 8672K 1531K sleep 0:04 1.00% smbd
> 27226 root 44 0 8616K 1376K run 0:00 0.90% smbd
> 27276 root 45 0 8464K 1433K sleep 0:00 0.70% smbd
> 27325 root 42 0 8520K 1359K sleep 0:00 0.60% smbd
> 28450 root 46 0 4272K 1777K run 0:01 0.40% top
> 23382 root 44 0 1944K 327K sleep 0:00 0.40% telnetd
> 18731 root 44 0 4744K 933K sleep 0:06 0.30% radiusd
> 22164 root 42 0 8776K 1605K sleep 0:09 0.10% smbd
> 27335 root 42 0 8528K 1376K sleep 0:01 0.10% smbd
> 14170 root 54 10 8128K 516K run 0:00 0.00% dtscreen
> 803 root 44 0 9760K 3391K sleep 6:17 0.00% Xdec
> ------------------------------------------------------------------------
> # ps -Af -p0 -m
> UID PID PPID C STIME TTY TIME CMD
> root 0 0 0.5 Sep 30 ?? 02:00:53 [kernel idle]
> 0.0 0:00.00
> 0.0 13:29.78
> 0.0 0:11.67
> 0.0 0:16.56
> 0.0 0:00.00
> 0.0 0:00.04
> 0.0 0:00.00
> 0.0 0:00.45
> 0.0 0:00.05
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.24
> 0.0 0:00.06
> 0.0 0:00.00
> 0.0 0:00.39
> 0.0 0:00.00
> 0.0 0:00.35
> 0.0 0:00.00
> 0.0 0:00.18
> 0.0 0:00.06
> 0.0 0:00.03
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.62
> 0.0 0:11.87
> 0.0 8:35.46
> 0.0 0:00.00
> 0.0 4:14.77
> 0.0 0:00.05
> 0.0 0:00.00
> 0.2 27:52.07
> 0.2 27:23.66
> 0.1 27:38.22
> 0.0 0:15.24
> 0.0 0:07.25
> 0.0 0:00.57
> 0.0 0:00.00
> 0.0 0:00.05
> 0.0 0:00.00
> 0.0 0:01.29
> 0.0 0:00.00
> 0.0 0:00.20
> 0.0 0:00.00
> 0.0 0:34.74
> 0.0 0:00.00
> 0.0 0:00.18
> 0.0 0:00.00
> 0.0 0:00.94
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.03
> 0.0 3:59.67
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:45.97
> 0.0 0:46.52
> 0.0 0:47.66
> 0.0 0:46.66
> 0.0 0:48.12
> 0.0 0:46.62
> 0.0 0:44.99
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> 0.0 0:00.00
> #
>
>
Received on Tue Oct 10 2000 - 13:56:42 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:41 NZDT