1. 10 Mar, 2011 3 commits
  2. 09 Mar, 2011 8 commits
  3. 08 Mar, 2011 16 commits
  4. 07 Mar, 2011 6 commits
  5. 06 Mar, 2011 7 commits
    • Moe Jette's avatar
      salloc: disable --no-shell mode · ca906680
      Moe Jette authored
      Since "aprun" is used on Cray instead of srun, the --no-shell option does not
      make any difference: with or without this option, the ALPS reservation is made, 
      and since it is confirmed using the SID of the current shell, aprun will run 
      even if the BASIL_RESERVATION_ID is not set.
      NB: the patch aborts with an error message. If deciding to turn this into a
          warning, and continue processing, opt.no_shell should be disabled, since
          otherwise interactive mode (and thus job control) is disabled.
    • Moe Jette's avatar
      slurmctld: remove dead code · 7b5a5dee
      Moe Jette authored
      return_hostlist is not populated in validate_nodes_via_front_end,
      hence never printed out.
    • Moe Jette's avatar
      select/cray: typos and outdated comments · 10f20cfc
      Moe Jette authored
       * removes outdated and no longer applicable comments regarding
         consecutive node numbering (dating from an earlier revision);
       * fixes a typo and clarifies condition on XT/SeaStar systems.
    • Moe Jette's avatar
      libalps: use proper type for timestamps · d089c7c9
      Moe Jette authored
      This fixes an inconsistency: time_t is not necessarily u32, use a separate
      routine to parse the absolute value and use proper time_t type.
      Also tidied up code where possible.
    • Moe Jette's avatar
      select/cray: handling errors in do_basil_release() · 70869e06
      Moe Jette authored
      This reduces the amount of error text printed on failure of do_basil_release():
       * parameter failures are caught by the existing calls to error(),
       * internal (ALPS) errors are printed by basil_release(),
       * there is no need to return additional error information via errno,
       * functions calling select_g_job_fini() just interpret the error, but no
         further action is taken, hence it is not necessary to indicate failure
         more than once.
      The following shows how setting SLURM_ERROR/errno produces unnecessarily long error text:
       [2011-02-09T18:19:51] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP uid=21215
       [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286
       [2011-02-09T18:19:51] error: releasing ALPS resId 286 for JobId 2940 FAILED with -5
       [2011-02-09T18:19:51] error: select_g_job_fini(2940): No error
      With the patch, only						       
       [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286
      would be printed, which is sufficient to diagnose the problem (resId 286 had been
      terminated by ALPS internally, after not receiving a confirmation quickly enough).
    • Moe Jette's avatar
      select/cray: perform "safe" release of ALPS reservations · e0801444
      Moe Jette authored
      This introduces a function which first checks prior to releasing an ALPS
      reservation if there are any application APIDs still associated with it.
      If there are, it attempts to kill those presumably stray job steps using the
      Cray apkill(1) binary. In most cases this is sufficient to successfully release
      the reservation. If on the other hand the reservation is formally released
      while still APIDs are associated with it, the reservation will remain (and
      its resources not released back) until the associated applications (APIDs)
      have terminated.
      Use of this function is restricted to cleaning up orphaned reservations. When
      trying to also use this for normal (non-abortive) job termination, it resulted
      in error messages, where the APID was still associated with the reservation,
      but had just shortly before been released, i.e. it generated false positives.
      The patch passed the following test case:
       1. set up an ALPS reservation: salloc -N 12
       2. spawn long-running apruns:  for i in {1..13};do aprun sleep 3600&done
       3. (in a different window)     kill -9 $(pidof salloc)
                                      scancel -u $USER
       4. after the job had completed within slurm, the following cleanup happened:
          [2011-03-05T13:19:07] debug2: purge_old_job: purged 1 old job records
          [2011-03-05T13:19:37] debug:  BASIL 3.1 INVENTORY: 128/176 batch nodes available
          [2011-03-05T13:19:37] debug:  ALPS: 12 node(s) still held
          [2011-03-05T13:19:37] error: orphaned ALPS reservation 147, trying to remove
          [2011-03-05T13:19:37] error: apkill live apid 168913 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168912 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168911 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168910 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168909 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168908 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168907 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168906 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168905 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168904 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168903 of ALPS resId 147
          [2011-03-05T13:19:37] error: apkill live apid 168902 of ALPS resId 147
       ==> Subsequently, reservation 147 was released, and a new salloc could be granted.
    • Moe Jette's avatar
      select/cray: do not override the node reason field · cc4dc6ac
      Moe Jette authored
      With the current configuration, setting DownNodes in slurm.conf was not possible,
      since node_ptr->reason gets overwritten by basil_get_initial_state().
      The patch updates setting the initial state so that
       * initial 'reason' fields remain untouched;
       * a new 'reason' is set only if 
         - the node is not already recognized as down or
         - no reason has been set so far;
       * it frees any previously set 'reason' if the node is allocated or idle.
      This code has been tested to work while we were waiting for a missing replacement
      blade (marked as 'DownNodes' in slurm.conf).