The only answer I received to my demand below:
Did you try the mailing list?
http://gridengine.sunsource.net/project/gridengine/maillist.html
-Ron Chen
For sure I did.
There were some hints on the error-mesasge described, but non of them solved my
problem.
So I tried around and found somewhat I would call a
"workaround", where I can live with.
-I used the single Tru64-Unix AS "gridsrv" as qmaster
-started install_execd on the cluster-member "server1...."
=> no sge_execd was running
- so on "server1...." I ran:
#.../rcsge -migrate
=> same error-message: .......commd - qmaster not enrolled at commd-
- going to "gridsrv" and running here (on reverse):
#.../rcsge -migrate
=> the qmaster was successfully started on the single AS
and surprisingly on "server1...." the sge_execd was running
and I could use him as execution-host.
This fullfilled my demands and I stopped further investigations in the problem.
Anyway it would be interesting what caused the problem????
Martin
-------------------------------------------------------------------------------
Demand:
Hi managers,
after a system-crash -> successful restore of a 2ES40-node / HSG80-cluster
running Tru64 V5.1a PK6, all services were restarted successfully except for
"SGE 5.3-gridware".
It came up with the error-message:
-unable to contact qmaster via "server1.mpch-mainz.mpg.de" commd - qmaster not
enrolled at commd-
were "server1.mpch-mainz.mpg.de" is one of the cluster-nodes, used as "qmaster".
-> no "sge_qmaster" was started
-> no "sge_execd" was started
Using a single Alpha-Server (not in the cluster-envitonment) as "qmaster" I
succeded -> all daemons running;
Now using "server1.mpch-mainz.mpg.de" as execution-host and
starting "install_execd" on it, ran without error, but
only "sge_commd" was running !not! "sge_execd" (as on other "execution-hosts"
not in the cluster).
Even on the second cluster-member "server2.mpch-mainz.mpg.de" I got the same
result as on "server1".
Trying a brand new Installation of the "SGE-5.3-Software")
#/soft/gridware/sge/inst_sge
at least I resulted in the error-message:
Grid Engine qmaster and scheduler startup
-----------------------------------------
Starting qmaster and scheduler daemon. Please wait ...
starting sge_qmaster
starting program: /soft/gridware/sge/bin/tru64/sge_commd
using service "sge_commd"
bound to port 536
Reading in complexes:
Complex "host".
Complex "queue".
Reading in execution hosts.
Reading in administrative hosts.
Reading in parallel environments:
PE "make".
Reading in scheduler configuration
starting sge_schedd
error: getting configuration: unable to contact qmaster via "" commd - qmaster
not enrolled at commd
error: can't get configuration from qmaster -- backgrounding
-> "sge_commd" and "sge_schedd" were started but
"sge_qmaster-" and "sge_execd-" were missing
So I came to the conclusion that due to the system-restore on the cluster
something is missing (possibly a "socket" or something else).
Anybody has any idea, why the "sge_qmaster-" and "sge_execd-" not were started
on the cluster-nodes, but run on the Single Alpha-Server???
Right now (after a week working on it) I am out of ideas.
Any help would be appreciated
Thanks in advance
Martin Körfer
--
Dr.Martin Körfer
Max-Planck-Institut für Chemie
Elektronik
J.J.Becherweg 27
55128 Mainz
Tel.: -49-6131-305488
Fax: -49-6131-305318
-------------------------------------------------
This mail sent through IMP: webmail.mpch-mainz.mpg.de
Received on Mon Mar 07 2005 - 10:17:15 NZDT