- 31 Oct, 2014 - 3 commits
-
-
Danny Auble authored
pack it this way so we will not change it in 15.08
-
Danny Auble authored
-
Danny Auble authored
This isn't that big of an issue for 14.03, but 14.11 added more to this string which could overflow the buffer since sprintf is used instead of snprintf. Using xstrfmtcat fixes the issue and is easier to read code.
-
- 27 Oct, 2014 - 2 commits
-
-
Danny Auble authored
are specified. This is a fix to commit b9cc5b31 which just didn't know mc_ptr->ntasks_per_core is initialized to INFINITE. Without it the node_cnt packed would be set to 1 on the user tools. This fixes bug 1148.
-
Morris Jette authored
bug 1207
-
- 24 Oct, 2014 - 1 commit
-
-
David Singleton authored
We've seen slurmctld crashes due to negative job array indices.
-
- 23 Oct, 2014 - 2 commits
-
-
Morris Jette authored
The previous patch should work in most cases, but this should work more reliably and the comment is more clear bug 1196
-
Morris Jette authored
BGQ: Fix race condition when job fails due to hardware failure and is requeued. Previous code could result in slurmctld abort with NULL pointer. bug 1096
-
- 21 Oct, 2014 - 1 commit
-
-
Morris Jette authored
Fix bug that prevented preservation of a job's GRES bitmap on slurmctld restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a job's gres when requeued" and only applies when GRES mapped to specific files). bug 1192
-
- 17 Oct, 2014 - 1 commit
-
-
Morris Jette authored
Correct tracking of licenses for suspended jobs on slurmctld reconfigure or restart. Previously licenses for suspended jobs were not counted, so the license count could be exceeded with those jobs get resumed.
-
- 15 Oct, 2014 - 3 commits
-
-
Morris Jette authored
This fixes a race condition if the slurmctld needed to power up a node shortly after startup. Previously it would execute the ResumeProgram twice for effected nodes.
-
Morris Jette authored
Without this change, a node in the cloud that failed to power up, would not have its NoResponding flag cleared, which would prevent its later use. The NoResponding flag is now cleared when manuallly when the node is modified to PowerDown.
-
Morris Jette authored
If a batch job launch to the cloud fails, permit an unlimited number of job requeues. Previously the job would abort on the second launch failure.
-
- 14 Oct, 2014 - 2 commits
-
-
Danny Auble authored
with no way to get them out. This fixes bug 1134. It is advised the pro/epilog to call xtprocadmin in the script instead of returning a non-zero exit code.
-
Brian Christiansen authored
The job could have been purged from a short MinJobAge and the trigger would then point to an invalid job. Bug #1144
-
- 11 Oct, 2014 - 2 commits
-
-
Morris Jette authored
if a node is down, then permit setting its state to power down, which causes the SuspendProgram to run and set the node state back to cloud.
-
Morris Jette authored
If a node is powered down, then do not power it up on slurmctld restart.
-
- 10 Oct, 2014 - 1 commit
-
-
Dorian Krause authored
This commit fixes a bug we observed when combining select/linear with gres. If an allocation was requested with a --gres argument an srun execution within that allocation would stall indefinitely: -bash-4.1$ salloc -N 1 --gres=gpfs:100 salloc: Granted job allocation 384049 bash-4.1$ srun -w j3c017 -n 1 hostname srun: Job step creation temporarily disabled, retrying The slurmctld log showed: debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1 debug3: cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017 debug3: host=j3l02 port=33608 name=hostname network=(null) exclusive=0 debug3: checkpoint-dir=/home/user checkpoint_int=0 debug3: mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0 debug3: overcommit=0 time_limit=0 gres=(null) constraints=(null) debug: Configuration for job 384049 complete _pick_step_nodes: some requested nodes j3c017 still have memory used by other steps _slurm_rpc_job_step_create for job 384049: Requested nodes are busy If srun --exclusive would have be used instead everything would work fine. The reason is that in exclusive mode the code properly checks whether memory is a reserved resource in the _pick_step_node() function. This commit modifies the alternate code path to do the same.
-
- 09 Oct, 2014 - 1 commit
-
-
Morris Jette authored
Take more job options into consideration to estimate its node count.
-
- 07 Oct, 2014 - 2 commits
-
-
Danny Auble authored
a reservation.
-
Danny Auble authored
which they have access to (rather then preventing them from seeing ANY reservation). Backport from 14.11 commit 77c2bd25.
-
- 04 Oct, 2014 - 2 commits
-
-
Morris Jette authored
Do not cause it to be rebooted (powered up).
-
Morris Jette authored
This permits a sys admin to power down a node that should already be powered down, but avoids setting the NO_RESPOND bit in the node state. Doing so under some conditions prevented the node from being scheduled. The downside is that the node could possibly be allocated when it really isn't ready for use.
-
- 03 Oct, 2014 - 5 commits
-
-
Morris Jette authored
When a node's state is set to power_down, then execute SuspendProgram even if previously executed for that node.
-
Danny Auble authored
which protects against race conditions with the reservations.
-
Morris Jette authored
Fix logic determining when job configuration (i.e. running node power up logic) is complete. (Will look at better solution for v14.11).
-
Morris Jette authored
When a node's state is set to power_up, then execute ResumeProgram even if previously executed for that node.
-
Danny Auble authored
different times when reservations are using the associations that are being deleted.
-
- 30 Sep, 2014 - 2 commits
-
-
Morris Jette authored
Prior logic would always try to reserve nodes. This also slighly modifies the reservation create logic for non-bluegene systems
-
Morris Jette authored
-
- 29 Sep, 2014 - 1 commit
-
-
Danny Auble authored
-
- 22 Sep, 2014 - 2 commits
-
-
David Bigagli authored
modified.
-
Dr. Oliver Fortmeier authored
-
- 19 Sep, 2014 - 1 commit
-
-
Danny Auble authored
to avoid overlapping erroneously. Before you could get overlapping reservations if you asked for a core based reservation and then a whole node reservation. This fixes that.
-
- 17 Sep, 2014 - 1 commit
-
-
Morris Jette authored
Test 3.11 was failing in some configurations without this as the CPU count in the RPC was lower than the number of nodes in the required node list
-
- 16 Sep, 2014 - 3 commits
-
-
David Bigagli authored
and abort the job.
-
Danny Auble authored
MaxNode limit.
-
Danny Auble authored
only needs to be called once.
-
- 11 Sep, 2014 - 1 commit
-
-
Danny Auble authored
warning.
-
- 09 Sep, 2014 - 1 commit
-
-
Morris Jette authored
Eliminate race condition in enforcement of MaxJobCount limit for job arrays. The job count limit was checked for a job array before setting the slurmctld job locks. If new jobs were submitted between the test and the job array creation such that the job array creation would result in MaxJobCount being exceeded, then a fatal error would result. bug 1091
-