1. 06 Mar, 2011 2 commits
    • Moe Jette's avatar
      select/cray: better resiliency against bad nodes · 0a25539d
      Moe Jette authored
      This lets the select/cray code deal more gracefully with bad nodes:
       * avoid sscanf(NULL, ...) in basil_geometry();
       * avoid fatal() if node_ptr->name[0] == '\0'.
      
      The other three functions,
       * basil_node_ranking(),
       * basil_get_initial_state(), and
       * basil_inventory()
      rely on find_node_record() to return NULL on finding a bad node - which will
      trigger an error condition, but not cause the program to abort.
      0a25539d
    • Moe Jette's avatar
      select/cray: fix error in 'is_gemini' logic · 6c927b3f
      Moe Jette authored
      The is_gemini logic is too simple: as just observed on a SeaStar system, it can
      be fooled into the wrong result if more than 1 row has NULL coordinates. 
      
      This case happens if a blade has been powered down completely, so that the SeaStar
      network chip is also powered off. The routing system recognizes this case and 
      routes around the powered-down node in the torus. It is plausible that in such a
      case the torus coordinates are NULL, since the node(s) are no longer part of the
      torus. 
      
      (It is also possible to set all nodes on a blade down, but leave power switched
       on. The SeaStar chip, which is independent of the motherboard, will continue to
       provide routing connectivity, i.e. the torus coordinates would all be non-NULL,
       but no computing can be done by the node, the ALPS state is "ROUTING".)
      
      Here is the example which revealed this behaviour: one blade, nodes 804-807,
      had been powered down after system failure.
      
      mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor;
      +----------+-----------------------------------------+
      | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
      +----------+-----------------------------------------+
      |     1882 |                                    1878 | 
      +----------+-----------------------------------------+
      
      ==> There are 4 more node IDs than there are distinct coordinates.
      
      mysql> select processor_id,x_coord,y_coord,z_coord from processor\
             WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL;
      +--------------+---------+---------+---------+
      | processor_id | x_coord | y_coord | z_coord |
      +--------------+---------+---------+---------+
      |          804 |    NULL |    NULL |    NULL | 
      |          805 |    NULL |    NULL |    NULL | 
      |          806 |    NULL |    NULL |    NULL | 
      |          807 |    NULL |    NULL |    NULL | 
      +--------------+---------+---------+---------+
      
      ==> The corrected query now also gives the correct result (equality):
      mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\
             WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL;
      +----------+-----------------------------------------+
      | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
      +----------+-----------------------------------------+
      |     1878 |                                    1878 | 
      +----------+-----------------------------------------+
      6c927b3f
  2. 04 Mar, 2011 23 commits
  3. 03 Mar, 2011 15 commits