1. 18 May, 2011 1 commit
    • Morris Jette's avatar
      select/cray: increase robustness of initialisation code · dc8d97eb
      Morris Jette authored
      This improves the initial configuration code:
       a) Better handling of DownNodes lines
          The previous basil_geometry() would set the node Reason field on failure,
          irrespective of whether that node has been marked using a DownNode line.
      
       b) Check all cases of nodes being invisible to ALPS
          Up until now basil_geometry() had to be fixed each time a new source of
          discrepancy between ALPS and SDB state had been discovered (most recent
          case was NULL coordinates when taking out a blade). Depending on ALPS
          interface changes, there may be other possibilities. Instead of fixing the
          SLURM code for each new case, it is better to check whether SLURM and ALPS
          agree. The price is some tiny delay at SLURM initialisation time (since each
          node is first looked up in the ALPS inventory), but it pays well off as it
          eases system administration by pointing to the source of error.
          Any node that has suddenly disappeared from ALPS horizon will now show up in
          the logs, and also in marked down in sinfo.
      
       c) At initialisation time, give a summary as to how many ALPS nodes are online.
      
       d) Turn ALPS-node-invisibility error into warning message, since such nodes may
          already have been covered in a DownNodes statement.
      
      By merging basil_get_initial_state() into basil_geometry(), the previously separate
      knowledge about system state (database state, ALPS inventory) is combined, allowing
      to more easily identify sources of failure.
      Patch from Gerrit Renker, CSCS.
      dc8d97eb
  2. 17 May, 2011 8 commits
    • Danny Auble's avatar
      03a8f312
    • Morris Jette's avatar
      Remove redundant AGENT_IS_THREAD parameter · f98781bf
      Morris Jette authored
      f98781bf
    • Morris Jette's avatar
      Merge pull request #20 from SchedMD/master · 62ce4af3
      Morris Jette authored
      Latest Cray-specific modifiations
      62ce4af3
    • Morris Jette's avatar
      select/cray: move local enum declaration back into function · 0f7b0ba3
      Morris Jette authored
      The enum is only needed and referenced in basil_geometry() and has otherwise
      no special meaning since it directly depends on the selected output columns.
      Patch from Gerrit Renker, CSCS.
      0f7b0ba3
    • Morris Jette's avatar
      select/cray: fix failure to set nodes with NULL coordinates down · f09febfe
      Morris Jette authored
      This case was observed after taking a blade out of a CLE 2.x system. ALPS does not
      list the removed nodes, but they still appear in the XTAdmin.processor table, with
      NULL coordinates. Hence set node down if at least one coordinate is NULL.
      
      Also add a check to compare how many out of the nodes in slurm.conf are visible to
      ALPS (the absence of this test masked the bug), always list DOWN nodes at startup,
      and clarify that not failing due to ALPS errors during the initial SLURM
      configuration is not an option.
      
      On the system which is missing a blade, the log information now is
       [2011-05-16T16:09:54] error: ALPS sees only 12/16 slurm.conf nodes
       [2011-05-16T16:09:54] Recovered state of 16 nodes
       [2011-05-16T16:09:54] Recovered state of 2 front_end nodes
       [2011-05-16T16:09:54] Recovered information about 0 jobs
       [2011-05-16T16:09:54] error: nid00028: unknown coordinates - hardware failure?
       [2011-05-16T16:09:54] error: nid00029: unknown coordinates - hardware failure?
       [2011-05-16T16:09:54] error: nid00030: unknown coordinates - hardware failure?
       [2011-05-16T16:09:54] error: nid00031: unknown coordinates - hardware failure?
      Patch from Gerrit Renker, CSCS.
      f09febfe
    • Morris Jette's avatar
      select/cray: update documentation · 76066dcc
      Morris Jette authored
      This fixes some errors in the documentation of how memory is allocated, and adds missing bits.
      Patch from Gerrit Renker, CSCS.
      76066dcc
    • Morris Jette's avatar
      Merge remote branch 'upstream/master' · 2477d9d7
      Morris Jette authored
      2477d9d7
    • Danny Auble's avatar
      BLUEGENE - Added block node cnt to be able to differentiate between a... · 5d8e9150
      Danny Auble authored
      BLUEGENE - Added block node cnt to be able to differentiate between a sub-block job and a regular full block job.
      5d8e9150
  3. 16 May, 2011 6 commits
  4. 14 May, 2011 1 commit
  5. 13 May, 2011 16 commits
  6. 12 May, 2011 8 commits