Commits · 3eddb3b4efbb0ff3c40c8bbf52d979ae16dd8ac6 · Manuel G. Marciani / ces_slurm_simulator

02 Jun, 2011 5 commits
- Merge branch 'slurm-2.2' · 3eddb3b4
  Moe Jette authored Jun 02, 2011
  
  3eddb3b4
- Enable background salloc command · b7a4a70d
  Moe Jette authored Jun 02, 2011
```
With default configuration on non-Cray systems, enable salloc to be
spawned as a background process. Based upon work by Don Albert (Bull) and
Gerrit Renker (CSCS).
```
  b7a4a70d
- BLUEGENE - Added a bit of validation to the runjob plugin to verify step launch. · 506d00fd
  Danny Auble authored Jun 01, 2011
  
  506d00fd
- BLUEGENE - set extra information for a step so we can verify it with the runjob plugin. · fe1a6763
  Danny Auble authored Jun 01, 2011
  
  fe1a6763
- BLUEGENE - Fixed issue where the ionodes weren't set up correctly on a Q system · dd234777
  Danny Auble authored Jun 01, 2011
  
  dd234777
01 Jun, 2011 6 commits

Cast a variable, eliminate compiler warning · 262daa37
Moe Jette authored Jun 01, 2011

262daa37

salloc: add SALLOC_KILL_CMD env var support · 2cf9c230

Moe Jette authored Jun 01, 2011

Add support to salloc for a new environment variable SALLOC_KILL_CMD,
which is equivalent to the -K/--kill-command option.

2cf9c230

Merge branch 'slurm-2.2' · 769b96de
Moe Jette authored Jun 01, 2011

769b96de

salloc: clean up stopped child processes · 43e7394c

Moe Jette authored Jun 01, 2011

This fixes a bug which is thanks to a report by Don Albert.

The problem is that whenever salloc exits with a child process in stopped state
(suspended or stopped on terminal input/output), a zombie process is generated,
since this case is not caught by the code evaluating the child status.

This patch adds the missing case.  It uses SIGKILL, which is the only signal
that changes the state of a stopped process. It was decided not to try and
re-awken the process using SIGCONT, since (a) this happens during session
clean-up and (b) if the condition is due to SIGTTIN, the process immediately
becomes stopped again.
Patch from Gerrit Renker, CSCS.

43e7394c

Merge github.com:chaos/slurm · c74e1bfb
Danny Auble authored Jun 01, 2011

c74e1bfb
Note that sprio can only support one cluster. · fffdbca8
Moe Jette authored May 31, 2011
```
Treat the specification of multiple cluster names as a fatal error.
```
fffdbca8

31 May, 2011 4 commits
- Note that scontrol supports one cluster · da890f96
  Moe Jette authored May 31, 2011
```
Note that scontrol can only support a single cluster at one time.
```
  da890f96
- dd scancel support for --clusters option · 1240c1f6
  Moe Jette authored May 31, 2011
  
  1240c1f6
- Add -R/--reservation option to squeue command · 9b559a5c
  Moe Jette authored May 31, 2011
```
Add -R/--reservation option to squeue command as a job filter.
```
  9b559a5c
- Add select plugin design doc · 7bddad6f
  Moe Jette authored May 31, 2011
  
  7bddad6f
30 May, 2011 1 commit
- sview: Log no blocks on non-Bluegene systems · 5a2b7c70
  Morris Jette authored May 29, 2011
  
  5a2b7c70
29 May, 2011 6 commits

select/cray alps emulation coordinate fix · c65510a9

Morris Jette authored May 29, 2011

Fix a couple of problems in alps emulation mode caused by recent changes
in the select/cray plugin: node cordinates and signal return code

c65510a9

select/cray: whitespace fixes and removal of unused code · 5761c40e

Morris Jette authored May 29, 2011

select/cray: whitespace fixes and removal of unused code
Patch 10_Cray_COSMETICS-whitespace.diff from Gerrit Renker, CSCS

5761c40e

slurmd: suppress frontend debug messages · 70d22622

Morris Jette authored May 29, 2011

On the slurmd, the function build_all_frontend_info() is called before logging
is fully initialized. This causes the frontend debug messages (which also get
redundantly printed in the slurmctld log file) to be sent to stderr.

On our system (where all slurmds get started remotely, via pdsh) the particular
implementation caused the startup to hang.

The patch uses a solution similar to build_all_node_line_info(), where a
boolean flag is used to avoid repeating the slurmctld message in slurmd
context.
Patch 08_Multiple-Frontend_suppress_initial_debug_message.diff from Gerrit Renker, CSCS

70d22622

select/cray: fix race condition · ea3c31fe

Morris Jette authored May 29, 2011

select/cray: fix race condition when canceling job during batch launch

This fixes a race condition which occurs when a job is cancelled during batch launch.
It is a bug since the condition causes the frontend node to be set in state DRAINING.

The fix is in catching this particular condition and isolating it as a non-fatal
error. This ensures continued robustness of operation, by not draining the entire
frontend node.

Short logfile dump of condition:
================================
 [2011-05-19T17:20:41] ALPS RESERVATION #2878, JobId 76343: BASIL -n 60 -N 0 -d 1 -m 1333
 [2011-05-19T17:20:41] backfill: Started JobId=76343 on nid0[1037,1549,1805,2061,2317]
 [2011-05-19T17:20:43] sched: Cancel of JobId=76343 by UID=21329, usec=389791
 [2011-05-19T17:20:45] error: slurmd error 4014 running JobId=76343 on front_end=rosa2: Slurmd could not set up environment for batch job
 [2011-05-19T17:20:45] update_front_end: set state of rosa2 to DRAINING

 apsched0519:
 17:20:41: File new reservation resId 2878 pagg 0
 17:20:41: Confirmed apid 125156 resId 2878 pagg 0 nids: 1037,1549,1805,2061,2317
 17:20:43: ...cancel_msg:249: cancel reservation resId 2878
 17:20:43: Canceled apid 125156 resId 2878 pagg 0
 17:20:45: type bind uid 0 gid 0 apid 0 pagg 13516639560892680485 resId 2878 numCmds 0
 17:20:45: placeApp message:0x1 cannot find resId 2878

 frontend node: rosa2.log
 [2011-05-19T17:20:41] Launching batch job 76343 for UID 21329
 [2011-05-19T17:20:45] Job 76343 killed while launch was in progress
 [2011-05-19T17:20:45] [76343] *** JOB 76343 CANCELLED AT 2011-05-19T17:20:45 ***
 [2011-05-19T17:20:45] [76343] PERMANENT ALPS BACKEND error: ALPS error: cannot find resId 2878
 [2011-05-19T17:20:45] [76343] confirming ALPS resId 2878 of JobId 76343 FAILED: ALPS backend error
 [2011-05-19T17:20:45] [76343] could not confirm ALPS reservation #2878
 [2011-05-19T17:20:45] [76343] job_manager exiting abnormally, rc = 4014

Detailed analysis:
==================
The slurmctld first created a reservation in select_nodes() -> select_g_job_begin() -> do_basil_reserve():
 [2011-05-19T10:56:19] ALPS RESERVATION #2511, JobId 74991: BASIL -n 12 -N 0 -d 1 -m 1333
 [2011-05-19T10:56:19] backfill: Started JobId=74991 on nid01347

 10:56:19: File new reservation resId 2511 pagg 0
 10:56:19: Confirmed apid 123762 resId 2511 pagg 0 nids: 1347

The next call after select_nodes() in backfill.c:_start_job() was launch_job(), which on the
slurmd node rosa12 produced the following message in _rpc_batch_job() upon receipt
of REQUEST_BATCH_JOB_LAUNCH:

 [2011-05-19T10:56:19] Launching batch job 74991 for UID 21487

This caused the launch_mutex to be taken and then the subsequent rc = _forkexec_slurmstepd().
While this was in operation, the user decided to scancel his job, apparently with the default SIGTERM:

 [2011-05-19T10:56:20] sched: Cancel of JobId=74991 by UID=21487, usec=358632
 [2011-05-19T10:56:20] sched: Cancel of JobId=74994 by UID=21487, usec=783954

This was in _slurm_rpc_job_step_kill() upon receiving REQUEST_CANCEL_JOB_STEP from scancel.
While the slurmstepd was preparing the job steps, it signalled cancellation

 [2011-05-19T10:56:20] [74991] *** JOB 74991 CANCELLED AT 2011-05-19T10:56:20 ***

via _rpc_signal_tasks() of the slurmd. Most likely this was from slurmctld:job_signal() -> _signal_batch_job(),
which means that the reservation had already been cancelled via select_g_job_signal() -> do_basil_release():

 10:56:20: ...cancel_msg:249: cancel reservation resId 2511
 10:56:20: type cancel uid 0 gid 0 apid 0 pagg 0 resId 2511 numCmds 0
 10:56:20: Canceled apid 123762 resId 2511 pagg 0

Meanwhile the slurmstepd continued to run by starting job_manager():
 [2011-05-19T10:56:20] [74991] PERMANENT ALPS BACKEND error: ALPS error: cannot find resId 2511
 [2011-05-19T10:56:20] [74991] confirming ALPS resId 2511 of JobId 74991 FAILED: ALPS backend error
 [2011-05-19T10:56:20] [74991] could not confirm ALPS reservation #2511
 [2011-05-19T10:56:20] [74991] job_manager exiting abnormally, rc = 4014

where the ALPS BACKEND error happened at the begin of job_manager(), in  rc = _select_cray_plugin_job_ready(job),
which returned the result from select_g_job_ready() -> do_basil_confirm(). The return result was READY_JOB_FATAL,
since the ALPS error was not a transient error.

Back in slurmstepd, the READY_JOB_FATAL was translated into ESLURMD_SETUP_ENVIRONMENT_ERROR, which then caused
the node to drain.

Detailed description of fix
===========================
The fix is by
 * catching the condition "reservation ID not found" in the BasilResponse as 'BE_NO_RESID'
   (which is already used to catch errors calling RELEASE more than 1 time);
 * interpreting the return of BE_NO_RESID as non-serious error condition during CONFIRM.

If the "reservation ID not found" was indeed caused due to the race condition, the fix will prevent ALPS
from introducing further complications (such as draining the node). If there is a separate ALPS problem
behind it (which is not expected), jobs will continue to run, but without ALPS support (all aprun
requests would fail). Such a condition (fixing ALPS issues) would need to be handled separately.
Based upon 03_Cray_BUG-Fix_race-condition-on-job-cancel.diff by Gerrit Renker, CSCS

ea3c31fe

Restore local enum declaration to header · 0d9f3480

Morris Jette authored May 29, 2011

This reverts commit 0f7b0ba3 (Mon 16 May),
"select/cray: move local enum declaration back into function" since the
emulation code depends on it.
02_Cray_BUG-Fix-basil_geometry-column-names.diff from Gerrit Renker, CSCS

0d9f3480

Cray documentation updates · 410f5abb
Morris Jette authored May 29, 2011
```
01_Cray-documentation-update.diff from Gerrit Renker, CSCS
```
410f5abb

28 May, 2011 4 commits
- Fixed warnings about unincluded files. · dddd6ed1
  Danny Auble authored May 28, 2011
  
  dddd6ed1
- Improveaccuracy of REQUEST_JOB_WILL_RUN RPC · 75728cb7
  Moe Jette authored May 27, 2011
```
Improve accuracy of REQUEST_JOB_WILL_RUN start time with respect to higher
priority pending jobs.
```
  75728cb7
- Expand explanation of change in NEWS · b57fd5d5
  Moe Jette authored May 27, 2011
```
Expand explanation of multiple DEFAULT values in slurm.conf
```
  b57fd5d5
- Propagate DebugFlag changes by scontrol · 02ff489a
  Moe Jette authored May 27, 2011
```
Propagate DebugFlags changes by scontrol to the various plugins and
other modules. DebugFlags is cached in some places and the changes
cause the cache value to be reset as needed.
```
  02ff489a
27 May, 2011 8 commits
- clarify config parameter use FirstJobId/MaxJobId · ab8f1bd0
  Moe Jette authored May 27, 2011
  
  ab8f1bd0
- Add placeholder pointer to PHP interface · 171dda91
  Moe Jette authored May 27, 2011
```
Add links to SLURM PHP interface from Trinity College High Performce Computing Center
from SLURM's "Downloads" web page.
```
  171dda91
- Merge branch 'slurm-2.2' · 85fe3a50
  Moe Jette authored May 27, 2011
```
Conflicts:
	META
	NEWS
```
  85fe3a50
- Start NEWS for v2.2.7 · 75dea0ab
  Moe Jette authored May 27, 2011
  
  75dea0ab
- Update META for v2.2.6 tag · b7386ace
  Moe Jette authored May 27, 2011
  
  b7386ace
- no error if sbin/scch not found · 563a22ad
  Moe Jette authored May 27, 2011
```
If checkpoint/blcr is configured. only log that scch is not found
using debug() rather than info(). Add documentation about the file.
```
  563a22ad
- Disable SQUEUE_FORMAT in some cases · 6df797cf
  Moe Jette authored May 27, 2011
```
Disable use of SQUEUE_FORMAT env var if squeue -l, -o, or -s option is
used. Patch from Aaron Knister (UMBC).
```
  6df797cf
- Fix the same default association/wckey problem, but in the slurmdbd · 53e31250
  Danny Auble authored May 27, 2011
```
so you don't have update your slurmctlds to have the problem go away.
This fix makes it so you only have to update your slurmdbd.
```
  53e31250
26 May, 2011 5 commits
- Updated the Normalized Usage section in priority_multifactor.shtml · 95471b26
  Don Lipari authored May 26, 2011
  
  95471b26
- Merge remote-tracking branch 'origin/slurm-2.2' · 07ee4c6e
  Danny Auble authored May 26, 2011
  
  07ee4c6e
- Fixed issue in accounting where it was possible for a new · 3668fecd
  Danny Auble authored May 26, 2011
```
association/wckey to be set incorrectly as a default the new object
was added after an original default object already existed.  Before
the slurmctld would need to be restarted to fix the issue.
```
  3668fecd
- Merge remote-tracking branch 'origin/slurm-2.2' · 1bd6c97e
  Danny Auble authored May 25, 2011
  
  1bd6c97e
- fixed typo · 21169b71
  Danny Auble authored May 25, 2011
  
  21169b71
25 May, 2011 1 commit
- Further refinement to the sprio man page · 358ccafa
  Don Lipari authored May 25, 2011
  
  358ccafa