1. 08 Mar, 2011 6 commits
  2. 07 Mar, 2011 5 commits
  3. 06 Mar, 2011 10 commits
    • Moe Jette's avatar
      salloc: disable --no-shell mode · ca906680
      Moe Jette authored
      Since "aprun" is used on Cray instead of srun, the --no-shell option does not
      make any difference: with or without this option, the ALPS reservation is made, 
      and since it is confirmed using the SID of the current shell, aprun will run 
      even if the BASIL_RESERVATION_ID is not set.
      
      NB: the patch aborts with an error message. If deciding to turn this into a
          warning, and continue processing, opt.no_shell should be disabled, since
          otherwise interactive mode (and thus job control) is disabled.
      ca906680
    • Moe Jette's avatar
      slurmctld: remove dead code · 7b5a5dee
      Moe Jette authored
      return_hostlist is not populated in validate_nodes_via_front_end,
      hence never printed out.
      7b5a5dee
    • Moe Jette's avatar
      select/cray: typos and outdated comments · 10f20cfc
      Moe Jette authored
      This 
       * removes outdated and no longer applicable comments regarding
         consecutive node numbering (dating from an earlier revision);
       * fixes a typo and clarifies condition on XT/SeaStar systems.
      10f20cfc
    • Moe Jette's avatar
      libalps: use proper type for timestamps · d089c7c9
      Moe Jette authored
      This fixes an inconsistency: time_t is not necessarily u32, use a separate
      routine to parse the absolute value and use proper time_t type.
      
      Also tidied up code where possible.
      d089c7c9
    • Moe Jette's avatar
      select/cray: handling errors in do_basil_release() · 70869e06
      Moe Jette authored
      This reduces the amount of error text printed on failure of do_basil_release():
       * parameter failures are caught by the existing calls to error(),
       * internal (ALPS) errors are printed by basil_release(),
       * there is no need to return additional error information via errno,
       * functions calling select_g_job_fini() just interpret the error, but no
         further action is taken, hence it is not necessary to indicate failure
         more than once.
      
      The following shows how setting SLURM_ERROR/errno produces unnecessarily long error text:
      
       [2011-02-09T18:19:51] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP uid=21215
       [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286
       [2011-02-09T18:19:51] error: releasing ALPS resId 286 for JobId 2940 FAILED with -5
       [2011-02-09T18:19:51] error: select_g_job_fini(2940): No error
      
      With the patch, only						       
       [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286
      would be printed, which is sufficient to diagnose the problem (resId 286 had been
      terminated by ALPS internally, after not receiving a confirmation quickly enough).
      70869e06
    • Moe Jette's avatar
      select/cray: perform "safe" release of ALPS reservations · e0801444
      Moe Jette authored
      This introduces a function which first checks prior to releasing an ALPS
      reservation if there are any application APIDs still associated with it.
      
      If there are, it attempts to kill those presumably stray job steps using the
      Cray apkill(1) binary. In most cases this is sufficient to successfully release
      the reservation. If on the other hand the reservation is formally released
      while still APIDs are associated with it, the reservation will remain (and
      its resources not released back) until the associated applications (APIDs)
      have terminated.
      
      Use of this function is restricted to cleaning up orphaned reservations. When
      trying to also use this for normal (non-abortive) job termination, it resulted
      in error messages, where the APID was still associated with the reservation,
      but had just shortly before been released, i.e. it generated false positives.
      
      The patch passed the following test case:
       1. set up an ALPS reservation: salloc -N 12
       2. spawn long-running apruns:  for i in {1..13};do aprun sleep 3600&done
       3. (in a different window)     kill -9 $(pidof salloc)
                                      scancel -u $USER
       4. after the job had completed within slurm, the following cleanup happened:
          [2011-03-05T13:19:07] debug2: purge_old_job: purged 1 old job records
          [2011-03-05T13:19:37] debug:  BASIL 3.1 INVENTORY: 128/176 batch nodes available
          [2011-03-05T13:19:37] debug:  ALPS: 12 node(s) still held
          [2011-03-05T13:19:37] error: orphaned ALPS reservation 147, trying to remove
          [2011-03-05T13:19:37] error: apkill live apid 168913 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168912 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168911 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168910 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168909 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168908 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168907 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168906 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168905 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168904 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168903 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168902 of ALPS resId 147
      
       ==> Subsequently, reservation 147 was released, and a new salloc could be granted.
      e0801444
    • Moe Jette's avatar
      select/cray: do not override the node reason field · cc4dc6ac
      Moe Jette authored
      With the current configuration, setting DownNodes in slurm.conf was not possible,
      since node_ptr->reason gets overwritten by basil_get_initial_state().
      
      The patch updates setting the initial state so that
       * initial 'reason' fields remain untouched;
       * a new 'reason' is set only if 
         - the node is not already recognized as down or
         - no reason has been set so far;
       * it frees any previously set 'reason' if the node is allocated or idle.
      
      This code has been tested to work while we were waiting for a missing replacement
      blade (marked as 'DownNodes' in slurm.conf).
      cc4dc6ac
    • Moe Jette's avatar
      select/cray: check on the ALPS side whether node is allocated · dd03bb07
      Moe Jette authored
      This fixes a bug in handling nodes: the code so far ignored whether nodes
                                          are still allocated to jobs.
      
      The patch therefore adds the following ALPS test:
      
       "If any node still has an ALPS reservation for CPUs or memory, it is
        considered allocated (has an active ALPS reservation associated with it)."
      
      Details of changes:
      -------------------
       1. general: resurrected node_is_allocated() libalps function
          - returns true if there is an ALPS reservation for CPUs/memory on a node;
       2. basil_get_initial_state():
          - clarified reliance on reset_job_bitmaps() and _sync_nodes_to_jobs(), to
            clean up associated jobs (the latter function to kill jobs on DOWN nodes),
          - added missing case for nodes that are still allocated after SLURM restart,
          - fixed an error in documentation: comment about allocation was wrong!;
       3. basil_inventory():
          - now looks at both SLURM/ALPS node-allocation state,
          - if ALPS-allocated and not SLURM-allocated, sets 'mismatch' flag (if this
            case is triggered by an orphaned ALPS reservation, the flag is set again),
          - if there is a SLURM/ALPS mismatch, scheduling is deferred.
      dd03bb07
    • Moe Jette's avatar
      select/cray: better resiliency against bad nodes · 0a25539d
      Moe Jette authored
      This lets the select/cray code deal more gracefully with bad nodes:
       * avoid sscanf(NULL, ...) in basil_geometry();
       * avoid fatal() if node_ptr->name[0] == '\0'.
      
      The other three functions,
       * basil_node_ranking(),
       * basil_get_initial_state(), and
       * basil_inventory()
      rely on find_node_record() to return NULL on finding a bad node - which will
      trigger an error condition, but not cause the program to abort.
      0a25539d
    • Moe Jette's avatar
      select/cray: fix error in 'is_gemini' logic · 6c927b3f
      Moe Jette authored
      The is_gemini logic is too simple: as just observed on a SeaStar system, it can
      be fooled into the wrong result if more than 1 row has NULL coordinates. 
      
      This case happens if a blade has been powered down completely, so that the SeaStar
      network chip is also powered off. The routing system recognizes this case and 
      routes around the powered-down node in the torus. It is plausible that in such a
      case the torus coordinates are NULL, since the node(s) are no longer part of the
      torus. 
      
      (It is also possible to set all nodes on a blade down, but leave power switched
       on. The SeaStar chip, which is independent of the motherboard, will continue to
       provide routing connectivity, i.e. the torus coordinates would all be non-NULL,
       but no computing can be done by the node, the ALPS state is "ROUTING".)
      
      Here is the example which revealed this behaviour: one blade, nodes 804-807,
      had been powered down after system failure.
      
      mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor;
      +----------+-----------------------------------------+
      | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
      +----------+-----------------------------------------+
      |     1882 |                                    1878 | 
      +----------+-----------------------------------------+
      
      ==> There are 4 more node IDs than there are distinct coordinates.
      
      mysql> select processor_id,x_coord,y_coord,z_coord from processor\
             WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL;
      +--------------+---------+---------+---------+
      | processor_id | x_coord | y_coord | z_coord |
      +--------------+---------+---------+---------+
      |          804 |    NULL |    NULL |    NULL | 
      |          805 |    NULL |    NULL |    NULL | 
      |          806 |    NULL |    NULL |    NULL | 
      |          807 |    NULL |    NULL |    NULL | 
      +--------------+---------+---------+---------+
      
      ==> The corrected query now also gives the correct result (equality):
      mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\
             WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL;
      +----------+-----------------------------------------+
      | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
      +----------+-----------------------------------------+
      |     1878 |                                    1878 | 
      +----------+-----------------------------------------+
      6c927b3f
  4. 04 Mar, 2011 19 commits