Commit f09febfe authored by Morris Jette's avatar Morris Jette
Browse files

select/cray: fix failure to set nodes with NULL coordinates down

This case was observed after taking a blade out of a CLE 2.x system. ALPS does not
list the removed nodes, but they still appear in the XTAdmin.processor table, with
NULL coordinates. Hence set node down if at least one coordinate is NULL.

Also add a check to compare how many out of the nodes in slurm.conf are visible to
ALPS (the absence of this test masked the bug), always list DOWN nodes at startup,
and clarify that not failing due to ALPS errors during the initial SLURM
configuration is not an option.

On the system which is missing a blade, the log information now is
 [2011-05-16T16:09:54] error: ALPS sees only 12/16 slurm.conf nodes
 [2011-05-16T16:09:54] Recovered state of 16 nodes
 [2011-05-16T16:09:54] Recovered state of 2 front_end nodes
 [2011-05-16T16:09:54] Recovered information about 0 jobs
 [2011-05-16T16:09:54] error: nid00028: unknown coordinates - hardware failure?
 [2011-05-16T16:09:54] error: nid00029: unknown coordinates - hardware failure?
 [2011-05-16T16:09:54] error: nid00030: unknown coordinates - hardware failure?
 [2011-05-16T16:09:54] error: nid00031: unknown coordinates - hardware failure?
Patch from Gerrit Renker, CSCS.
parent 76066dcc
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment