Commit dc8d97eb authored by Morris Jette's avatar Morris Jette
Browse files

select/cray: increase robustness of initialisation code

This improves the initial configuration code:
 a) Better handling of DownNodes lines
    The previous basil_geometry() would set the node Reason field on failure,
    irrespective of whether that node has been marked using a DownNode line.

 b) Check all cases of nodes being invisible to ALPS
    Up until now basil_geometry() had to be fixed each time a new source of
    discrepancy between ALPS and SDB state had been discovered (most recent
    case was NULL coordinates when taking out a blade). Depending on ALPS
    interface changes, there may be other possibilities. Instead of fixing the
    SLURM code for each new case, it is better to check whether SLURM and ALPS
    agree. The price is some tiny delay at SLURM initialisation time (since each
    node is first looked up in the ALPS inventory), but it pays well off as it
    eases system administration by pointing to the source of error.
    Any node that has suddenly disappeared from ALPS horizon will now show up in
    the logs, and also in marked down in sinfo.

 c) At initialisation time, give a summary as to how many ALPS nodes are online.

 d) Turn ALPS-node-invisibility error into warning message, since such nodes may
    already have been covered in a DownNodes statement.

By merging basil_get_initial_state() into basil_geometry(), the previously separate
knowledge about system state (database state, ALPS inventory) is combined, allowing
to more easily identify sources of failure.
Patch from Gerrit Renker, CSCS.
parent 03a8f312
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment