- 14 Oct, 2014 16 commits
-
-
Morris Jette authored
This adds checks for NULL pointers in the gres data structures to avoid memory references including string compares with NULL pointers.
-
Morris Jette authored
-
Morris Jette authored
Attempted to free variable that was not a pointer bug 1166
-
Danny Auble authored
It was needed at one point since <=2.5 the srun was only a wrapper to aprun that needed help. This isn't the case anymore and hasn't been since we made srun do all the heavy lifting so we can remove it.
-
Danny Auble authored
-
Brian Christiansen authored
The job could have been purged from a short MinJobAge and the trigger would then point to an invalid job. Bug #1144
-
Danny Auble authored
-
Morris Jette authored
Conflicts: src/slurmd/slurmd/slurmd.c
-
Morris Jette authored
Note that PlugStackConfig defaults to plugstack.conf in the same directory as slurm.conf. The added logic tests if the file actually exists (using stat) and if not found then do not fork/exec slurmstepd to invoke the spank prolog/epilog. This saves about 14msec on startup and 14msec on shutdown if no spank plugins are configured. It also eliminates some possible failures (e.g. if fork() fails, or the slurmstepd processes can not exec()). This logic also caches the PlugStackConfig value and reads it again on reconfigure, but avoid reading the value for each job. bug 982
-
Morris Jette authored
Add "void" argument to a function and rename a local function to have a prefix of "_"
-
Danny Auble authored
Issue is only in rc1. Fix regression from commit bfd4697b for bug
-
Danny Auble authored
-
Nicolas Joly authored
-
Danny Auble authored
9b00f12c
-
Nicolas Joly authored
Signed-off-by: Danny Auble <da@schedmd.com>
-
Morris Jette authored
-
- 13 Oct, 2014 4 commits
-
-
Nicolas Joly authored
-
jette authored
-
jette authored
- 11 Oct, 2014 5 commits
-
-
Morris Jette authored
-
Morris Jette authored
if a node is down, then permit setting its state to power down, which causes the SuspendProgram to run and set the node state back to cloud.
-
Morris Jette authored
If a node is powered down, then do not power it up on slurmctld restart.
-
Morris Jette authored
The power up/down request only takes effect after the ResumeTimeout or SuspendTimeout is reached in order to avoid a race condition.
-
Danny Auble authored
Conflicts: NEWS
-
- 10 Oct, 2014 15 commits
-
-
David Bigagli authored
-
Danny Auble authored
-
Morris Jette authored
Major update to license management tests. "Manager" was changed to "ServerType" and several tests made more complete and improved reporting of failures.
-
Morris Jette authored
For the sacctmgr command, the keyword "Manager" was changed to "ServerType" in some, but not all places. This changes the previously unchanged places.
-
Morris Jette authored
The test keeps failing due to a POE bug
-
Morris Jette authored
This fixes the advanced reservation test with a configuration that sets a node's CPU count to be equal to the core count rather than its thread count.
-
Morris Jette authored
-
Morris Jette authored
The original formatting had a bunch of lists rather than paragraphs, the numbers did not add up in the use case, and some wording was changed for clarity.
-
Brian Christiansen authored
Bug #1143
-
Dorian Krause authored
This commit fixes a bug we observed when combining select/linear with gres. If an allocation was requested with a --gres argument an srun execution within that allocation would stall indefinitely: -bash-4.1$ salloc -N 1 --gres=gpfs:100 salloc: Granted job allocation 384049 bash-4.1$ srun -w j3c017 -n 1 hostname srun: Job step creation temporarily disabled, retrying The slurmctld log showed: debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1 debug3: cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017 debug3: host=j3l02 port=33608 name=hostname network=(null) exclusive=0 debug3: checkpoint-dir=/home/user checkpoint_int=0 debug3: mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0 debug3: overcommit=0 time_limit=0 gres=(null) constraints=(null) debug: Configuration for job 384049 complete _pick_step_nodes: some requested nodes j3c017 still have memory used by other steps _slurm_rpc_job_step_create for job 384049: Requested nodes are busy If srun --exclusive would have be used instead everything would work fine. The reason is that in exclusive mode the code properly checks whether memory is a reserved resource in the _pick_step_node() function. This commit modifies the alternate code path to do the same.
-
Danny Auble authored
-
Dorian Krause authored
This commit fixes a bug we observed when combining select/linear with gres. If an allocation was requested with a --gres argument an srun execution within that allocation would stall indefinitely: -bash-4.1$ salloc -N 1 --gres=gpfs:100 salloc: Granted job allocation 384049 bash-4.1$ srun -w j3c017 -n 1 hostname srun: Job step creation temporarily disabled, retrying The slurmctld log showed: debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1 debug3: cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017 debug3: host=j3l02 port=33608 name=hostname network=(null) exclusive=0 debug3: checkpoint-dir=/home/user checkpoint_int=0 debug3: mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0 debug3: overcommit=0 time_limit=0 gres=(null) constraints=(null) debug: Configuration for job 384049 complete _pick_step_nodes: some requested nodes j3c017 still have memory used by other steps _slurm_rpc_job_step_create for job 384049: Requested nodes are busy If srun --exclusive would have be used instead everything would work fine. The reason is that in exclusive mode the code properly checks whether memory is a reserved resource in the _pick_step_node() function. This commit modifies the alternate code path to do the same.
-
Morris Jette authored
-
Morris Jette authored
-
Brian Christiansen authored
-