select/cray: fix failure to set nodes with NULL coordinates down
This case was observed after taking a blade out of a CLE 2.x system. ALPS does not list the removed nodes, but they still appear in the XTAdmin.processor table, with NULL coordinates. Hence set node down if at least one coordinate is NULL. Also add a check to compare how many out of the nodes in slurm.conf are visible to ALPS (the absence of this test masked the bug), always list DOWN nodes at startup, and clarify that not failing due to ALPS errors during the initial SLURM configuration is not an option. On the system which is missing a blade, the log information now is [2011-05-16T16:09:54] error: ALPS sees only 12/16 slurm.conf nodes [2011-05-16T16:09:54] Recovered state of 16 nodes [2011-05-16T16:09:54] Recovered state of 2 front_end nodes [2011-05-16T16:09:54] Recovered information about 0 jobs [2011-05-16T16:09:54] error: nid00028: unknown coordinates - hardware failure? [2011-05-16T16:09:54] error: nid00029: unknown coordinates - hardware failure? [2011-05-16T16:09:54] error: nid00030: unknown coordinates - hardware failure? [2011-05-16T16:09:54] error: nid00031: unknown coordinates - hardware failure? Patch from Gerrit Renker, CSCS.
Please register or sign in to comment