select/cray: increase robustness of initialisation code
This improves the initial configuration code: a) Better handling of DownNodes lines The previous basil_geometry() would set the node Reason field on failure, irrespective of whether that node has been marked using a DownNode line. b) Check all cases of nodes being invisible to ALPS Up until now basil_geometry() had to be fixed each time a new source of discrepancy between ALPS and SDB state had been discovered (most recent case was NULL coordinates when taking out a blade). Depending on ALPS interface changes, there may be other possibilities. Instead of fixing the SLURM code for each new case, it is better to check whether SLURM and ALPS agree. The price is some tiny delay at SLURM initialisation time (since each node is first looked up in the ALPS inventory), but it pays well off as it eases system administration by pointing to the source of error. Any node that has suddenly disappeared from ALPS horizon will now show up in the logs, and also in marked down in sinfo. c) At initialisation time, give a summary as to how many ALPS nodes are online. d) Turn ALPS-node-invisibility error into warning message, since such nodes may already have been covered in a DownNodes statement. By merging basil_get_initial_state() into basil_geometry(), the previously separate knowledge about system state (database state, ALPS inventory) is combined, allowing to more easily identify sources of failure. Patch from Gerrit Renker, CSCS.
Please register or sign in to comment