1. 06 Mar, 2011 3 commits
    • Moe Jette's avatar
      select/cray: check on the ALPS side whether node is allocated · dd03bb07
      Moe Jette authored
      This fixes a bug in handling nodes: the code so far ignored whether nodes
                                          are still allocated to jobs.
      
      The patch therefore adds the following ALPS test:
      
       "If any node still has an ALPS reservation for CPUs or memory, it is
        considered allocated (has an active ALPS reservation associated with it)."
      
      Details of changes:
      -------------------
       1. general: resurrected node_is_allocated() libalps function
          - returns true if there is an ALPS reservation for CPUs/memory on a node;
       2. basil_get_initial_state():
          - clarified reliance on reset_job_bitmaps() and _sync_nodes_to_jobs(), to
            clean up associated jobs (the latter function to kill jobs on DOWN nodes),
          - added missing case for nodes that are still allocated after SLURM restart,
          - fixed an error in documentation: comment about allocation was wrong!;
       3. basil_inventory():
          - now looks at both SLURM/ALPS node-allocation state,
          - if ALPS-allocated and not SLURM-allocated, sets 'mismatch' flag (if this
            case is triggered by an orphaned ALPS reservation, the flag is set again),
          - if there is a SLURM/ALPS mismatch, scheduling is deferred.
      dd03bb07
    • Moe Jette's avatar
      select/cray: better resiliency against bad nodes · 0a25539d
      Moe Jette authored
      This lets the select/cray code deal more gracefully with bad nodes:
       * avoid sscanf(NULL, ...) in basil_geometry();
       * avoid fatal() if node_ptr->name[0] == '\0'.
      
      The other three functions,
       * basil_node_ranking(),
       * basil_get_initial_state(), and
       * basil_inventory()
      rely on find_node_record() to return NULL on finding a bad node - which will
      trigger an error condition, but not cause the program to abort.
      0a25539d
    • Moe Jette's avatar
      select/cray: fix error in 'is_gemini' logic · 6c927b3f
      Moe Jette authored
      The is_gemini logic is too simple: as just observed on a SeaStar system, it can
      be fooled into the wrong result if more than 1 row has NULL coordinates. 
      
      This case happens if a blade has been powered down completely, so that the SeaStar
      network chip is also powered off. The routing system recognizes this case and 
      routes around the powered-down node in the torus. It is plausible that in such a
      case the torus coordinates are NULL, since the node(s) are no longer part of the
      torus. 
      
      (It is also possible to set all nodes on a blade down, but leave power switched
       on. The SeaStar chip, which is independent of the motherboard, will continue to
       provide routing connectivity, i.e. the torus coordinates would all be non-NULL,
       but no computing can be done by the node, the ALPS state is "ROUTING".)
      
      Here is the example which revealed this behaviour: one blade, nodes 804-807,
      had been powered down after system failure.
      
      mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor;
      +----------+-----------------------------------------+
      | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
      +----------+-----------------------------------------+
      |     1882 |                                    1878 | 
      +----------+-----------------------------------------+
      
      ==> There are 4 more node IDs than there are distinct coordinates.
      
      mysql> select processor_id,x_coord,y_coord,z_coord from processor\
             WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL;
      +--------------+---------+---------+---------+
      | processor_id | x_coord | y_coord | z_coord |
      +--------------+---------+---------+---------+
      |          804 |    NULL |    NULL |    NULL | 
      |          805 |    NULL |    NULL |    NULL | 
      |          806 |    NULL |    NULL |    NULL | 
      |          807 |    NULL |    NULL |    NULL | 
      +--------------+---------+---------+---------+
      
      ==> The corrected query now also gives the correct result (equality):
      mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\
             WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL;
      +----------+-----------------------------------------+
      | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
      +----------+-----------------------------------------+
      |     1878 |                                    1878 | 
      +----------+-----------------------------------------+
      6c927b3f
  2. 04 Mar, 2011 23 commits
  3. 03 Mar, 2011 14 commits