select/cray: fix failure to set nodes with NULL coordinates down (f09febfe) · Commits · Manuel G. Marciani / ces_slurm_simulator

Commit f09febfe authored May 16, 2011 by

Morris Jette

select/cray: fix failure to set nodes with NULL coordinates down

This case was observed after taking a blade out of a CLE 2.x system. ALPS does not
list the removed nodes, but they still appear in the XTAdmin.processor table, with
NULL coordinates. Hence set node down if at least one coordinate is NULL.

Also add a check to compare how many out of the nodes in slurm.conf are visible to
ALPS (the absence of this test masked the bug), always list DOWN nodes at startup,
and clarify that not failing due to ALPS errors during the initial SLURM
configuration is not an option.

On the system which is missing a blade, the log information now is
 [2011-05-16T16:09:54] error: ALPS sees only 12/16 slurm.conf nodes
 [2011-05-16T16:09:54] Recovered state of 16 nodes
 [2011-05-16T16:09:54] Recovered state of 2 front_end nodes
 [2011-05-16T16:09:54] Recovered information about 0 jobs
 [2011-05-16T16:09:54] error: nid00028: unknown coordinates - hardware failure?
 [2011-05-16T16:09:54] error: nid00029: unknown coordinates - hardware failure?
 [2011-05-16T16:09:54] error: nid00030: unknown coordinates - hardware failure?
 [2011-05-16T16:09:54] error: nid00031: unknown coordinates - hardware failure?
Patch from Gerrit Renker, CSCS.

parent 76066dcc

Hide whitespace changes

Inline Side-by-side

Please register or to comment