- 08 Jun, 2011 7 commits
-
-
Morris Jette authored
-
Morris Jette authored
When using an emulated setting, the user is likely to specify both apbasil and apkill, hence the AlpsDir variable could be refactored, which is what the patch does. It falls back to the Cray-like defaults (/usr/bin) if nothing is set. 05_COSMETICS_Cray-conf-AlpsDir-simplification.diff from Gerrit Renker, CSCS
-
Morris Jette authored
-
Morris Jette authored
Especially on long listings of jobs or records, there is very little variation in adjacent entries, causing repetition of output which can make it difficult to quickly see changes. This also makes it easier to spot outliers: as in the example a the end of April below, where a user had set a BeginTime of about 1 month in advance: JOBID USER ACCOUNT NAME REASON NODES TIMELIMIT START_TIME 54936 teyssier s201 cut_largemodes Priority 16 1-00:00:00 18:37:30 33544 jgomez s89 Rh_12x12-4L Priority 8 1-00:00:00 18:39:02 40798 jgomez s89 Rh_8x8x8_D3 Priority 20 23:59:00 18:40:29 31994 xyy s241 new-3g5u-lipid Priority 22 1-00:00:00 18:40:29 32072 gstarek s241 alpha_vs_dp Priority 22 1-00:00:00 18:40:29 32078 gstarek s241 control_vs Priority 22 1-00:00:00 18:40:29 31699 jgomez s89 CHP_120 BeginTime 20 1-00:00:00 29 May 16:43 22121 guerard s263 an18QM370 Dependency 8 2:00:00 N/A The patch introduces a context-sensitive time format which is * disabled by default * enabled via SLURM_TIME_FORMAT environment variable, * prints 24-hour clock based time relative to "now", * takes up at most 12 characters (i.e. 1/2 character per hour). The following compares the formats for a range of settings: 1) SLURM_TIME_FORMAT=standard now 2011-06-06T12:57:22 yesterday 2pm 2011-06-05T14:00:00 19 jan 1904 3:15 1904-01-19T03:15:00 -2 days 4:15pm 2011-06-04T16:15:00 tomorrow 2011-06-07T12:57:22 +2 days 2:17am 2011-06-08T02:17:00 +3 days 2:18pm 2011-06-09T14:18:00 -6 weeks 2011-04-25T12:57:22 +3 weeks + 10 days 2011-07-07T12:57:22 next year 2012-06-06T12:57:22 2) SLURM_TIME_FORMAT=relative now 12:57:22 yesterday 2pm Ystday 14:00 19 jan 1904 3:15 19 Jan 1904 -2 days 4:15pm 4 Jun 16:15 tomorrow Tomorr 12:57 +2 days 2:17am Wed 02:17 +3 days 2:18pm Thu 14:18src/common/parse_time.c -6 weeks 25 Apr 12:57 +3 weeks + 10 days 7 Jul 12:57 next year 6 Jun 2012 3) SLURM_TIME_FORMAT="%a %T" now Mon 12:57:22 yesterday 2pm Sun 14:00:00 19 jan 1904 3:15 Tue 03:15:00 -2 days 4:15pm Sat 16:15:00 tomorrow Tue 12:57:22 +2 days 2:17am Wed 02:17:00 +3 days 2:18pm Thu 14:18:00 04_COSMETICS_context-dependent-timestring.diff Patch by Gerrit Renker, CSCS
-
Morris Jette authored
03_COSMETICS_typos.diff from Gerrit Renker, CSCS
-
Morris Jette authored
Now possible thanks to Danny's -rpath addition to the slurm build process. 02_Cray-simpler-munge-build-process.diff from Gerrit Renker, CSCS
-
Morris Jette authored
This disables setting the kill command on Cray platforms, to ensure that always either SIGHUP (interactive shells) or SIGTERM (all others) is sent. Patch 01_Cray-salloc-no-kill-command.diff from Gerrit Renker, CSCS.
-
- 06 Jun, 2011 1 commit
-
-
Danny Auble authored
-
- 04 Jun, 2011 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 03 Jun, 2011 1 commit
-
-
Morris Jette authored
Add an configure option of --enable-salloc-kill-cmd which will cause the salloc command to signal its child processes when the job ends. Job signalling will be the default for Cray systems. Jobs will not be signalled by default on other systems. SIGHUP will be used for interactive jobs and SIGTERM will be used for other jobs.
-
- 02 Jun, 2011 8 commits
-
-
Moe Jette authored
Change the reason that a node is marked DOWN and the log message from node "silent reboot" to "unexpected reboot"
-
Moe Jette authored
-
Moe Jette authored
Patch from Don Albert, Bull
-
Moe Jette authored
-
Moe Jette authored
With default configuration on non-Cray systems, enable salloc to be spawned as a background process. Based upon work by Don Albert (Bull) and Gerrit Renker (CSCS).
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 01 Jun, 2011 6 commits
-
-
Moe Jette authored
-
Moe Jette authored
Add support to salloc for a new environment variable SALLOC_KILL_CMD, which is equivalent to the -K/--kill-command option.
-
Moe Jette authored
-
Moe Jette authored
This fixes a bug which is thanks to a report by Don Albert. The problem is that whenever salloc exits with a child process in stopped state (suspended or stopped on terminal input/output), a zombie process is generated, since this case is not caught by the code evaluating the child status. This patch adds the missing case. It uses SIGKILL, which is the only signal that changes the state of a stopped process. It was decided not to try and re-awken the process using SIGCONT, since (a) this happens during session clean-up and (b) if the condition is due to SIGTTIN, the process immediately becomes stopped again. Patch from Gerrit Renker, CSCS.
-
Danny Auble authored
-
Moe Jette authored
Treat the specification of multiple cluster names as a fatal error.
-
- 31 May, 2011 4 commits
- 30 May, 2011 1 commit
-
-
Morris Jette authored
-
- 29 May, 2011 6 commits
-
-
Morris Jette authored
Fix a couple of problems in alps emulation mode caused by recent changes in the select/cray plugin: node cordinates and signal return code
-
Morris Jette authored
select/cray: whitespace fixes and removal of unused code Patch 10_Cray_COSMETICS-whitespace.diff from Gerrit Renker, CSCS
-
Morris Jette authored
On the slurmd, the function build_all_frontend_info() is called before logging is fully initialized. This causes the frontend debug messages (which also get redundantly printed in the slurmctld log file) to be sent to stderr. On our system (where all slurmds get started remotely, via pdsh) the particular implementation caused the startup to hang. The patch uses a solution similar to build_all_node_line_info(), where a boolean flag is used to avoid repeating the slurmctld message in slurmd context. Patch 08_Multiple-Frontend_suppress_initial_debug_message.diff from Gerrit Renker, CSCS
-
Morris Jette authored
select/cray: fix race condition when canceling job during batch launch This fixes a race condition which occurs when a job is cancelled during batch launch. It is a bug since the condition causes the frontend node to be set in state DRAINING. The fix is in catching this particular condition and isolating it as a non-fatal error. This ensures continued robustness of operation, by not draining the entire frontend node. Short logfile dump of condition: ================================ [2011-05-19T17:20:41] ALPS RESERVATION #2878, JobId 76343: BASIL -n 60 -N 0 -d 1 -m 1333 [2011-05-19T17:20:41] backfill: Started JobId=76343 on nid0[1037,1549,1805,2061,2317] [2011-05-19T17:20:43] sched: Cancel of JobId=76343 by UID=21329, usec=389791 [2011-05-19T17:20:45] error: slurmd error 4014 running JobId=76343 on front_end=rosa2: Slurmd could not set up environment for batch job [2011-05-19T17:20:45] update_front_end: set state of rosa2 to DRAINING apsched0519: 17:20:41: File new reservation resId 2878 pagg 0 17:20:41: Confirmed apid 125156 resId 2878 pagg 0 nids: 1037,1549,1805,2061,2317 17:20:43: ...cancel_msg:249: cancel reservation resId 2878 17:20:43: Canceled apid 125156 resId 2878 pagg 0 17:20:45: type bind uid 0 gid 0 apid 0 pagg 13516639560892680485 resId 2878 numCmds 0 17:20:45: placeApp message:0x1 cannot find resId 2878 frontend node: rosa2.log [2011-05-19T17:20:41] Launching batch job 76343 for UID 21329 [2011-05-19T17:20:45] Job 76343 killed while launch was in progress [2011-05-19T17:20:45] [76343] *** JOB 76343 CANCELLED AT 2011-05-19T17:20:45 *** [2011-05-19T17:20:45] [76343] PERMANENT ALPS BACKEND error: ALPS error: cannot find resId 2878 [2011-05-19T17:20:45] [76343] confirming ALPS resId 2878 of JobId 76343 FAILED: ALPS backend error [2011-05-19T17:20:45] [76343] could not confirm ALPS reservation #2878 [2011-05-19T17:20:45] [76343] job_manager exiting abnormally, rc = 4014 Detailed analysis: ================== The slurmctld first created a reservation in select_nodes() -> select_g_job_begin() -> do_basil_reserve(): [2011-05-19T10:56:19] ALPS RESERVATION #2511, JobId 74991: BASIL -n 12 -N 0 -d 1 -m 1333 [2011-05-19T10:56:19] backfill: Started JobId=74991 on nid01347 10:56:19: File new reservation resId 2511 pagg 0 10:56:19: Confirmed apid 123762 resId 2511 pagg 0 nids: 1347 The next call after select_nodes() in backfill.c:_start_job() was launch_job(), which on the slurmd node rosa12 produced the following message in _rpc_batch_job() upon receipt of REQUEST_BATCH_JOB_LAUNCH: [2011-05-19T10:56:19] Launching batch job 74991 for UID 21487 This caused the launch_mutex to be taken and then the subsequent rc = _forkexec_slurmstepd(). While this was in operation, the user decided to scancel his job, apparently with the default SIGTERM: [2011-05-19T10:56:20] sched: Cancel of JobId=74991 by UID=21487, usec=358632 [2011-05-19T10:56:20] sched: Cancel of JobId=74994 by UID=21487, usec=783954 This was in _slurm_rpc_job_step_kill() upon receiving REQUEST_CANCEL_JOB_STEP from scancel. While the slurmstepd was preparing the job steps, it signalled cancellation [2011-05-19T10:56:20] [74991] *** JOB 74991 CANCELLED AT 2011-05-19T10:56:20 *** via _rpc_signal_tasks() of the slurmd. Most likely this was from slurmctld:job_signal() -> _signal_batch_job(), which means that the reservation had already been cancelled via select_g_job_signal() -> do_basil_release(): 10:56:20: ...cancel_msg:249: cancel reservation resId 2511 10:56:20: type cancel uid 0 gid 0 apid 0 pagg 0 resId 2511 numCmds 0 10:56:20: Canceled apid 123762 resId 2511 pagg 0 Meanwhile the slurmstepd continued to run by starting job_manager(): [2011-05-19T10:56:20] [74991] PERMANENT ALPS BACKEND error: ALPS error: cannot find resId 2511 [2011-05-19T10:56:20] [74991] confirming ALPS resId 2511 of JobId 74991 FAILED: ALPS backend error [2011-05-19T10:56:20] [74991] could not confirm ALPS reservation #2511 [2011-05-19T10:56:20] [74991] job_manager exiting abnormally, rc = 4014 where the ALPS BACKEND error happened at the begin of job_manager(), in rc = _select_cray_plugin_job_ready(job), which returned the result from select_g_job_ready() -> do_basil_confirm(). The return result was READY_JOB_FATAL, since the ALPS error was not a transient error. Back in slurmstepd, the READY_JOB_FATAL was translated into ESLURMD_SETUP_ENVIRONMENT_ERROR, which then caused the node to drain. Detailed description of fix =========================== The fix is by * catching the condition "reservation ID not found" in the BasilResponse as 'BE_NO_RESID' (which is already used to catch errors calling RELEASE more than 1 time); * interpreting the return of BE_NO_RESID as non-serious error condition during CONFIRM. If the "reservation ID not found" was indeed caused due to the race condition, the fix will prevent ALPS from introducing further complications (such as draining the node). If there is a separate ALPS problem behind it (which is not expected), jobs will continue to run, but without ALPS support (all aprun requests would fail). Such a condition (fixing ALPS issues) would need to be handled separately. Based upon 03_Cray_BUG-Fix_race-condition-on-job-cancel.diff by Gerrit Renker, CSCS
-
Morris Jette authored
This reverts commit 0f7b0ba3 (Mon 16 May), "select/cray: move local enum declaration back into function" since the emulation code depends on it. 02_Cray_BUG-Fix-basil_geometry-column-names.diff from Gerrit Renker, CSCS
-
Morris Jette authored
01_Cray-documentation-update.diff from Gerrit Renker, CSCS
-
- 28 May, 2011 2 commits
-
-
Danny Auble authored
-
Moe Jette authored
Improve accuracy of REQUEST_JOB_WILL_RUN start time with respect to higher priority pending jobs.
-