- 17 Oct, 2014 1 commit
-
-
Morris Jette authored
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or restart. Previously licenses for suspended jobs were not counted, so the license count could be exceeded with those jobs get resumed.
-
- 16 Oct, 2014 2 commits
-
-
Brian Christiansen authored
-
Morris Jette authored
Treat Cray MPI job calling exit() without mpi_fini() as fatal error for that specific task and let srun handle all timeout logic. Previous logic would cancel the entire job step and srun options for wait time and kill on exit were ignored. The new logic provides users with the following type of response: $ srun -n3 -K0 -N3 --wait=60 ./tmp Task:0 Cycle:1 Task:2 Cycle:1 Task:1 Cycle:1 Task:0 Cycle:2 Task:2 Cycle:2 slurmstepd: step 14927.0 task 1 exited without calling mpi_fini() srun: error: tux2: task 1: Killed Task:0 Cycle:3 Task:2 Cycle:3 Task:0 Cycle:4 ... bug 1171
-
- 15 Oct, 2014 4 commits
-
-
Nicolas Joly authored
This reverts commit 4d03d0b4. Make sure the correct Author is attributed here.
-
Danny Auble authored
This reverts commit 1891936e.
-
Danny Auble authored
This has apparently been broken from the get go. This fixes bug 1172. test21.22 should be updated to test the dump and load of a file that is generated.
-
Danny Auble authored
using --ntasks-per-node. This is related to bug 1145. What was happening is all the cpus were allocated on one socket instead of a cyclic method. While this is allowed it is strange and resulted in this bug. There appears to be a different bug as to why the tasks were laid out in a block fashion in the first place.
-
- 14 Oct, 2014 2 commits
-
-
Danny Auble authored
with no way to get them out. This fixes bug 1134. It is advised the pro/epilog to call xtprocadmin in the script instead of returning a non-zero exit code.
-
Nicolas Joly authored
Signed-off-by: Danny Auble <da@schedmd.com>
-
- 10 Oct, 2014 7 commits
-
-
Danny Auble authored
-
Brian Christiansen authored
Bug #1143
-
Dorian Krause authored
This commit fixes a bug we observed when combining select/linear with gres. If an allocation was requested with a --gres argument an srun execution within that allocation would stall indefinitely: -bash-4.1$ salloc -N 1 --gres=gpfs:100 salloc: Granted job allocation 384049 bash-4.1$ srun -w j3c017 -n 1 hostname srun: Job step creation temporarily disabled, retrying The slurmctld log showed: debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1 debug3: cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017 debug3: host=j3l02 port=33608 name=hostname network=(null) exclusive=0 debug3: checkpoint-dir=/home/user checkpoint_int=0 debug3: mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0 debug3: overcommit=0 time_limit=0 gres=(null) constraints=(null) debug: Configuration for job 384049 complete _pick_step_nodes: some requested nodes j3c017 still have memory used by other steps _slurm_rpc_job_step_create for job 384049: Requested nodes are busy If srun --exclusive would have be used instead everything would work fine. The reason is that in exclusive mode the code properly checks whether memory is a reserved resource in the _pick_step_node() function. This commit modifies the alternate code path to do the same.
-
Danny Auble authored
(i.e ArchiveJobs PurgeJobs). This is only a cosmetic change.
-
Nicolas Joly authored
on slurmdbd startup.
-
Danny Auble authored
-
Danny Auble authored
lots of jobs.
-
- 09 Oct, 2014 2 commits
-
-
Danny Auble authored
did the ALPS reservation. Bug 1115
-
Morris Jette authored
Take more job options into consideration to estimate its node count.
-
- 08 Oct, 2014 3 commits
-
-
Danny Auble authored
-
inodb authored
At work in Sweden we often fika (coffee+buns and what have u) at 3PM. I sometimes accidentally give a start time of 'teatime', so when I return from 'fika' I see my job's just getting started. This fix should make life even easier for the Swedes.
-
Danny Auble authored
-
- 07 Oct, 2014 4 commits
-
-
Danny Auble authored
a reservation.
-
Danny Auble authored
which they have access to (rather then preventing them from seeing ANY reservation). Backport from 14.11 commit 77c2bd25.
-
Danny Auble authored
arbitrary layouts (test1.59).
-
Brian Christiansen authored
option (since it isn't).
-
- 04 Oct, 2014 1 commit
-
-
Morris Jette authored
Do not cause it to be rebooted (powered up).
-
- 03 Oct, 2014 4 commits
-
-
Morris Jette authored
When a node's state is set to power_down, then execute SuspendProgram even if previously executed for that node.
-
Morris Jette authored
Fix logic determining when job configuration (i.e. running node power up logic) is complete. (Will look at better solution for v14.11).
-
Morris Jette authored
When a node's state is set to power_up, then execute ResumeProgram even if previously executed for that node.
-
Danny Auble authored
different times when reservations are using the associations that are being deleted.
-
- 02 Oct, 2014 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
- 30 Sep, 2014 1 commit
-
-
Morris Jette authored
-
- 29 Sep, 2014 5 commits
-
-
Danny Auble authored
-
Morris Jette authored
Remove logic that was creating GRES bitmap for node when not needed (only needed when GRES mapped to specific files).
-
Morris Jette authored
Correct logic to support job GRES specification over 31 bits (problem in logic converting int to uint32_t).
-
Morris Jette authored
-
Danny Auble authored
-
- 26 Sep, 2014 1 commit
-
-
David Bigagli authored
when terminating the job.
-