Hi everyone;
I posted the following question yesterday:
>DQS can not submit jobs to any host other than qmaster. "qstat -f" gives:
>
> ***************************************************************
> odin1 batch 0/1 0.98 eru
> ***************************************************************
> polar1 batch 0/1 1.03 er
> ***************************************************************
> thrym1 batch 0/1 1.99 aer
>
Many thanks to these folks for their response and help:
Tim W. Janes <janes_at_signal.dra.hmg.gb>
Mike Iglesias <iglesias_at_draco.acs.uci.edu>
Neil Lincoln <nrl_at_SSESCO.com>
1) Status "u" of odin1 queue means the host is unavailable. A check
showed the dqs_execd daemon died on the host. It was restarted.
2) Status "a" of thrym1 queue means the host is overloaded. It turns out
that the "load_alarm" number needs to be divided by 100 in order to
compare with the system load average. I did "qconf -dq thrym1" and added
thrym1 queue with a higher load_alarm level.
3) I also changed owner of dqs_execd3 to root.daemon, as pointed out by
Tim Janes.
Now all three queues reported status of "er" (enabled and running).
Here is the new problem: If a job is grabbed by the the qmaster host
(queue polar1), it runs just fine. If any of the other two queues (on
different hosts) grabs a job, "qstat -f" will show it running for a
couple of seconds, and then the job disappears. No err file or stdout file
generated. Any clue?
Many thanks. I will summarize.
Wu Wei
Supervisor, Scientific Computing
Physics, Washington University
St. Louis, MO, USA
Received on Wed Nov 29 1995 - 18:09:23 NZDT