Commits · b698359ceb84b1742dc3fe8adabd27dc17db43a5 · Manuel G. Marciani / ces_slurm_simulator

04 Dec, 2012 3 commits
- BGQ - remove unneeded break · b698359c
  Danny Auble authored Dec 04, 2012
  
  b698359c
- BGQ - remove unused variables · ccdce0e9
  Danny Auble authored Dec 04, 2012
  
  ccdce0e9
- Fix for job allocation API with requested nodes and continguous not set explicitly · 27ccc04d
  jette authored Dec 04, 2012
  
  27ccc04d
03 Dec, 2012 1 commit
- Clear job node state reason of WAIT_FRONT_END after front-end up · 71f00db9
  Morris Jette authored Dec 03, 2012
  
  71f00db9
30 Nov, 2012 4 commits
- Fix inconsistency for hostlists that have more than 1 range. · bc7b91cd
  Danny Auble authored Nov 30, 2012
  
  bc7b91cd
- BGQ - Handle shared blocks that need to be removed and have jobs running · ef81bf67
  Danny Auble authored Nov 29, 2012
```
on them.  This should only happen in extreme conditions.
```
  ef81bf67
- BGQ - add option to tell bg_requeue_job the slurmctld is locked · ee27ba89
  Danny Auble authored Nov 29, 2012
  
  ee27ba89
- BGQ - Sanity Check - Make sure block is resumed before destroying it. · 3ca9531c
  Danny Auble authored Nov 29, 2012
  
  3ca9531c
29 Nov, 2012 4 commits
- Accounting - Fix for if asking for users or accounts that were deleted · 34747e50
  Danny Auble authored Nov 29, 2012
```
with associations get the deleted associations as well.
```
  34747e50
- Fix issue in accounting if a user puts a '\' in their job name. · acc8c531
  Danny Auble authored Nov 28, 2012
  
  acc8c531
- If an salloc is waiting for an allocation to happen and is canceled by the · 74f1b3cf
  Danny Auble authored Nov 28, 2012
```
user mark the state canceled instead of completed.
```
  74f1b3cf
- Accounting - If a job start message fails to the SlurmDBD reset the db_inx · c03c9b46
  Danny Auble authored Nov 28, 2012
```
so it gets sent again.  This isn't a major problem since the start will
happen when the job ends, but this does make things cleaner.
```
  c03c9b46
28 Nov, 2012 2 commits
- BGP - Fix for HTC mode · 27e7b048
  Danny Auble authored Nov 27, 2012
  
  27e7b048
- Accounting - Fixed issue where if nodenames have changed on a system and · b87be8bb
  Danny Auble authored Nov 27, 2012
```
you query against that with -N and -E you will get all jobs during that
time instead of only the ones running on -N.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  b87be8bb
27 Nov, 2012 6 commits
- Correction to error code on crypto_verify_sign call · 1bef9310
  Morris Jette authored Nov 27, 2012
  
  1bef9310
- BLUEGENE - clearer debug message · ba74dcd4
  Danny Auble authored Nov 27, 2012
  
  ba74dcd4
- BGQ - handle pending actions on a block better when trying to deallocate it. · e4431036
  Danny Auble authored Nov 27, 2012
  
  e4431036
- BLUEGENE - With Dynamic layout mode - Fix issue where if a larger block · 0dad50ff
  Danny Auble authored Nov 27, 2012
```
was already in error and isn't deallocating and underlying hardware goes
bad one could get overlapping blocks in error making the code assert when
a new job request comes in.
```
  0dad50ff
- Increase sbcast credential cache from 64 to 256 entries · 60143498
  Morris Jette authored Nov 27, 2012
  
  60143498
- BGQ - Add 64 tasks per node as a valid option for srun when used with · 4f085be3
  Danny Auble authored Nov 26, 2012
```
overcommit.
```
  4f085be3
21 Nov, 2012 5 commits

Remove some currently unused variables · 3e8b10b3
Morris Jette authored Nov 21, 2012

3e8b10b3
Very minor format change to conform with Linux kernel coding style · c4eb8b1a
Morris Jette authored Nov 21, 2012

c4eb8b1a

slurmstepd : correct a bug in the IO thread termination monitoring · f297242e

Matthieu Hautreux authored Nov 13, 2012

A dedicated thread (_kill_thr) is launched by slurmstepd at the end of a
step in order to destroy the IO thread if it does not manage to correctly
terminate by itself after 300 seconds.

Two bugs are corrected in this logic by this patch.

First, the performed sleep(300) is not protected against interruptions
and this delay can be reduced to a few seconds in case of signals received
by slurmstepd, thus, reducing the delay and forcing the IO thread to
terminate before the expiration of the grace time. The logic is modified
to ensure that the delay is respected using a loop around the sleep().

Second, to terminate the IO thread, a SIGKILL is delivered to the IO thread
using pthread_kill. However, sending SIGKILL using pthread_kill is a
process-wide operation (see man pthread_kill), thus all the slurmstepd
threads are killed and slurmstepd is terminated. This logic is modified
by using pthread_cancel() instead of pthread_kill() thus letting the
pthread_join() of _wait_for_io() having a chance to act as expected.

Without this patch, when _kill_thr is interrupted, slurmstepd is
terminated, letting the step in a incomplete state, as the node may not
have been able to send the REQUEST_STEP_COMPLETE to the controler.
Thus, consecutive steps can no longer be executed and stay permanently in
the "Job step creation temporarily disabled, retrying" state.

f297242e

Correct a bug with -w in step management resulting in inadequate memory errors returned to srun · ac86cc37

Matthieu Hautreux authored Nov 12, 2012

When requesting a particular nodelist for a step, if at least one of the node is
still used by a former step (no REQUEST_STEP_COMPLETE received from that node),
the current behavior is to return ESLURM_INVALID_TASK_MEMORY and srun aborting
with "Memory required by task is not available".

This can be reproduced by launching consecutive steps with the -w parameter set
to $SLURM_NODELIST and introducing delays in the spank epilog on the execution
nodes.

The behavior is changed to only defer the execution of the step by returning
ESLURM_NODES_BUSY when it is detected that some nodes are blocked because of
already used memory.

ac86cc37

Correct a bug in consecutive steps management due to asynchronous step completions · 4c97337d

Matthieu Hautreux authored Nov 12, 2012

When using consecutive steps, it appears that in some cases, the time required
by the slurmstepd on the execution nodes to inform the controler of the completion
of the step is higher than the time required to request the following step.
In that scenario, the controler can reject the step by returning the error code
ESLURM_REQUESTED_NODE_CONFIG_UNAVAILABLE even if the step could be executed if
all the former steps were correctly finished.

This can be reproduced by launching consecutive steps and introducing dalys in
the spank epilog on the execution nodes.

The behavior is changed to only defer the execution of the step by returning
ESLURM_NODES_BUSY when all the available nodes are not idle considering the
former steps.

4c97337d

20 Nov, 2012 2 commits
- Accounting - Fix issue where QOS usage was being zeroed out on a · 8b0b5ae7
  Danny Auble authored Nov 20, 2012
```
slurmctld restart.
```
  8b0b5ae7
- Reset node MAINT state flag when a reservation's nodes or flags change · cc97d84b
  Morris Jette authored Nov 19, 2012
  
  cc97d84b
19 Nov, 2012 3 commits
- BGQ - Fix job step timeout actually happen when done from within an · 0500e007
  Danny Auble authored Nov 19, 2012
```
allocation.
```
  0500e007
- Modify use of OOM (out of memory protection) for Linux 2.6.36 kernel or later · 8ae5e73e
  Morris Jette authored Nov 19, 2012
```
NOTE: If you were setting the environment variable SLURMSTEPD_OOM_ADJ=-17,
it should be set to -1000 for Linux 2.6.36 kernel or later.
```
  8ae5e73e
- NEWS for e40883f1 · f42117c4
  Danny Auble authored Nov 19, 2012
  
  f42117c4
09 Nov, 2012 1 commit
- BGQ - when srun -Q is given make runjob be quiet · e40883f1
  Danny Auble authored Nov 08, 2012
  
  e40883f1
07 Nov, 2012 4 commits
- BGQ - better fix for ntasks_per_node verification · e7d8ce15
  Danny Auble authored Nov 06, 2012
  
  e7d8ce15
- remove debug · 295394b2
  Danny Auble authored Nov 06, 2012
  
  295394b2
- BGQ - validate correct ntasks_per_node · 7eb1a451
  Danny Auble authored Nov 06, 2012
  
  7eb1a451
- BGQ - Fix issue when running srun outside of an allocation and only · 9e25da94
  Danny Auble authored Nov 06, 2012
```
specifying the number of tasks and not the number of nodes.
```
  9e25da94
05 Nov, 2012 2 commits
- Cray - Improve signal handling for spawned tasks on job cancel · 3ff9f17e
  Morris Jette authored Nov 05, 2012
```
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send
SIGKILL. Previously just sent SIGKILL to tasks.
```
  3ff9f17e
- Add common function to return KillWait configuration parameter · 91be41da
  Morris Jette authored Nov 05, 2012
  
  91be41da
02 Nov, 2012 3 commits
- Remove duplicate NEWS item · c3fde3ce
  Morris Jette authored Nov 02, 2012
  
  c3fde3ce
- Update NEWS for start of v2.4.5 work · 832ca7df
  Morris Jette authored Nov 02, 2012
  
  832ca7df
- Update META for v2.4.4 tag · b8d6a058
  Morris Jette authored Nov 02, 2012
  
  b8d6a058