Summary and further Question: problem with DQS under OSF/1

From: Wu Wei 314-935-4746 <wwu_at_thrym.wustl.edu>
Date: Wed, 29 Nov 1995 10:36:32 -0600 (CST)

Hi everyone;

   I posted the following question yesterday:

>DQS can not submit jobs to any host other than qmaster. "qstat -f" gives:
>
> ***************************************************************
> odin1 batch 0/1 0.98 eru
> ***************************************************************
> polar1 batch 0/1 1.03 er
> ***************************************************************
> thrym1 batch 0/1 1.99 aer
>
   Many thanks to these folks for their response and help:
        Tim W. Janes <janes_at_signal.dra.hmg.gb>
        Mike Iglesias <iglesias_at_draco.acs.uci.edu>
        Neil Lincoln <nrl_at_SSESCO.com>

1) Status "u" of odin1 queue means the host is unavailable. A check
showed the dqs_execd daemon died on the host. It was restarted.
2) Status "a" of thrym1 queue means the host is overloaded. It turns out
that the "load_alarm" number needs to be divided by 100 in order to
compare with the system load average. I did "qconf -dq thrym1" and added
thrym1 queue with a higher load_alarm level.
3) I also changed owner of dqs_execd3 to root.daemon, as pointed out by
Tim Janes.

  Now all three queues reported status of "er" (enabled and running).

  Here is the new problem: If a job is grabbed by the the qmaster host
(queue polar1), it runs just fine. If any of the other two queues (on
different hosts) grabs a job, "qstat -f" will show it running for a
couple of seconds, and then the job disappears. No err file or stdout file
generated. Any clue?

  Many thanks. I will summarize.

Wu Wei
Supervisor, Scientific Computing
Physics, Washington University
St. Louis, MO, USA
Received on Wed Nov 29 1995 - 18:09:23 NZDT

This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:46 NZDT