1. 02 Jun, 2011 5 commits
  2. 01 Jun, 2011 6 commits
  3. 31 May, 2011 4 commits
  4. 30 May, 2011 1 commit
  5. 29 May, 2011 6 commits
    • Morris Jette's avatar
      select/cray alps emulation coordinate fix · c65510a9
      Morris Jette authored
      Fix a couple of problems in alps emulation mode caused by recent changes
      in the select/cray plugin: node cordinates and signal return code
      c65510a9
    • Morris Jette's avatar
      select/cray: whitespace fixes and removal of unused code · 5761c40e
      Morris Jette authored
      select/cray: whitespace fixes and removal of unused code
      Patch 10_Cray_COSMETICS-whitespace.diff from Gerrit Renker, CSCS
      5761c40e
    • Morris Jette's avatar
      slurmd: suppress frontend debug messages · 70d22622
      Morris Jette authored
      On the slurmd, the function build_all_frontend_info() is called before logging
      is fully initialized. This causes the frontend debug messages (which also get
      redundantly printed in the slurmctld log file) to be sent to stderr.
      
      On our system (where all slurmds get started remotely, via pdsh) the particular
      implementation caused the startup to hang.
      
      The patch uses a solution similar to build_all_node_line_info(), where a
      boolean flag is used to avoid repeating the slurmctld message in slurmd
      context.
      Patch 08_Multiple-Frontend_suppress_initial_debug_message.diff from Gerrit Renker, CSCS
      70d22622
    • Morris Jette's avatar
      select/cray: fix race condition · ea3c31fe
      Morris Jette authored
      select/cray: fix race condition when canceling job during batch launch
      
      This fixes a race condition which occurs when a job is cancelled during batch launch.
      It is a bug since the condition causes the frontend node to be set in state DRAINING.
      
      The fix is in catching this particular condition and isolating it as a non-fatal
      error. This ensures continued robustness of operation, by not draining the entire
      frontend node.
      
      Short logfile dump of condition:
      ================================
       [2011-05-19T17:20:41] ALPS RESERVATION #2878, JobId 76343: BASIL -n 60 -N 0 -d 1 -m 1333
       [2011-05-19T17:20:41] backfill: Started JobId=76343 on nid0[1037,1549,1805,2061,2317]
       [2011-05-19T17:20:43] sched: Cancel of JobId=76343 by UID=21329, usec=389791
       [2011-05-19T17:20:45] error: slurmd error 4014 running JobId=76343 on front_end=rosa2: Slurmd could not set up environment for batch job
       [2011-05-19T17:20:45] update_front_end: set state of rosa2 to DRAINING
      
       apsched0519:
       17:20:41: File new reservation resId 2878 pagg 0
       17:20:41: Confirmed apid 125156 resId 2878 pagg 0 nids: 1037,1549,1805,2061,2317
       17:20:43: ...cancel_msg:249: cancel reservation resId 2878
       17:20:43: Canceled apid 125156 resId 2878 pagg 0
       17:20:45: type bind uid 0 gid 0 apid 0 pagg 13516639560892680485 resId 2878 numCmds 0
       17:20:45: placeApp message:0x1 cannot find resId 2878
      
       frontend node: rosa2.log
       [2011-05-19T17:20:41] Launching batch job 76343 for UID 21329
       [2011-05-19T17:20:45] Job 76343 killed while launch was in progress
       [2011-05-19T17:20:45] [76343] *** JOB 76343 CANCELLED AT 2011-05-19T17:20:45 ***
       [2011-05-19T17:20:45] [76343] PERMANENT ALPS BACKEND error: ALPS error: cannot find resId 2878
       [2011-05-19T17:20:45] [76343] confirming ALPS resId 2878 of JobId 76343 FAILED: ALPS backend error
       [2011-05-19T17:20:45] [76343] could not confirm ALPS reservation #2878
       [2011-05-19T17:20:45] [76343] job_manager exiting abnormally, rc = 4014
      
      Detailed analysis:
      ==================
      The slurmctld first created a reservation in select_nodes() -> select_g_job_begin() -> do_basil_reserve():
       [2011-05-19T10:56:19] ALPS RESERVATION #2511, JobId 74991: BASIL -n 12 -N 0 -d 1 -m 1333
       [2011-05-19T10:56:19] backfill: Started JobId=74991 on nid01347
      
       10:56:19: File new reservation resId 2511 pagg 0
       10:56:19: Confirmed apid 123762 resId 2511 pagg 0 nids: 1347
      
      The next call after select_nodes() in backfill.c:_start_job() was launch_job(), which on the
      slurmd node rosa12 produced the following message in _rpc_batch_job() upon receipt
      of REQUEST_BATCH_JOB_LAUNCH:
      
       [2011-05-19T10:56:19] Launching batch job 74991 for UID 21487
      
      This caused the launch_mutex to be taken and then the subsequent rc = _forkexec_slurmstepd().
      While this was in operation, the user decided to scancel his job, apparently with the default SIGTERM:
      
       [2011-05-19T10:56:20] sched: Cancel of JobId=74991 by UID=21487, usec=358632
       [2011-05-19T10:56:20] sched: Cancel of JobId=74994 by UID=21487, usec=783954
      
      This was in _slurm_rpc_job_step_kill() upon receiving REQUEST_CANCEL_JOB_STEP from scancel.
      While the slurmstepd was preparing the job steps, it signalled cancellation
      
       [2011-05-19T10:56:20] [74991] *** JOB 74991 CANCELLED AT 2011-05-19T10:56:20 ***
      
      via _rpc_signal_tasks() of the slurmd. Most likely this was from slurmctld:job_signal() -> _signal_batch_job(),
      which means that the reservation had already been cancelled via select_g_job_signal() -> do_basil_release():
      
       10:56:20: ...cancel_msg:249: cancel reservation resId 2511
       10:56:20: type cancel uid 0 gid 0 apid 0 pagg 0 resId 2511 numCmds 0
       10:56:20: Canceled apid 123762 resId 2511 pagg 0
      
      Meanwhile the slurmstepd continued to run by starting job_manager():
       [2011-05-19T10:56:20] [74991] PERMANENT ALPS BACKEND error: ALPS error: cannot find resId 2511
       [2011-05-19T10:56:20] [74991] confirming ALPS resId 2511 of JobId 74991 FAILED: ALPS backend error
       [2011-05-19T10:56:20] [74991] could not confirm ALPS reservation #2511
       [2011-05-19T10:56:20] [74991] job_manager exiting abnormally, rc = 4014
      
      where the ALPS BACKEND error happened at the begin of job_manager(), in  rc = _select_cray_plugin_job_ready(job),
      which returned the result from select_g_job_ready() -> do_basil_confirm(). The return result was READY_JOB_FATAL,
      since the ALPS error was not a transient error.
      
      Back in slurmstepd, the READY_JOB_FATAL was translated into ESLURMD_SETUP_ENVIRONMENT_ERROR, which then caused
      the node to drain.
      
      Detailed description of fix
      ===========================
      The fix is by
       * catching the condition "reservation ID not found" in the BasilResponse as 'BE_NO_RESID'
         (which is already used to catch errors calling RELEASE more than 1 time);
       * interpreting the return of BE_NO_RESID as non-serious error condition during CONFIRM.
      
      If the "reservation ID not found" was indeed caused due to the race condition, the fix will prevent ALPS
      from introducing further complications (such as draining the node). If there is a separate ALPS problem
      behind it (which is not expected), jobs will continue to run, but without ALPS support (all aprun
      requests would fail). Such a condition (fixing ALPS issues) would need to be handled separately.
      Based upon 03_Cray_BUG-Fix_race-condition-on-job-cancel.diff by Gerrit Renker, CSCS
      ea3c31fe
    • Morris Jette's avatar
      Restore local enum declaration to header · 0d9f3480
      Morris Jette authored
      This reverts commit 0f7b0ba3 (Mon 16 May),
      "select/cray: move local enum declaration back into function" since the
      emulation code depends on it.
      02_Cray_BUG-Fix-basil_geometry-column-names.diff from Gerrit Renker, CSCS
      0d9f3480
    • Morris Jette's avatar
      Cray documentation updates · 410f5abb
      Morris Jette authored
      01_Cray-documentation-update.diff from Gerrit Renker, CSCS
      410f5abb
  6. 28 May, 2011 4 commits
  7. 27 May, 2011 8 commits
  8. 26 May, 2011 5 commits
  9. 25 May, 2011 1 commit