- 21 Mar, 2016 5 commits
-
-
Danny Auble authored
# Conflicts: # src/plugins/burst_buffer/cray/burst_buffer_cray.c
-
Danny Auble authored
gang scheduling before doing code for gang scheduling.
-
Morris Jette authored
burst_buffer/cray: Set environment variables just before starting job rather than at job submission time to reflect persistent buffers created or modified while the job is pending. bug 2545
-
Danny Auble authored
buffer is found. Bug 2576 What happened was a function was doing a double read lock which isn't awesome to begin with, but not really horrible (if all you are doing is read locks anyway). The problem was after the first lock was locked a different thread was going for a write lock and so when the second read lock came in it created deadlocked.
-
Tim Wickberg authored
Coverity 77851.
-
- 18 Mar, 2016 4 commits
-
-
Morris Jette authored
Jobs below the specified threshold will not have resources reserved for them. bug 2565
-
Tim Wickberg authored
-
Morris Jette authored
-
Morris Jette authored
Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while creating the job step. The result is that the signal hanlder gets a argument (the signal received) of zero. Here's a log, window 1: $ srun hostname srun: Job step creation temporarily disabled, retrying srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 0 srun: Cancelled pending job step Window 2: $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 .... bug 2494
-
- 17 Mar, 2016 5 commits
-
-
Morris Jette authored
Copy logic from select/cons_res to select/serial that is equivalent to commit ec50cb2f
-
Morris Jette authored
Change how a node's allocated CPU count is calculated to avoid double counting CPUs allocated to multiple jobs at the same time. Previous logic would sum the maximum number of CPUs allocated by each partition for any time slice, which could double count CPUs allocated to multiple jobs. New logic ORs bitmap of allocated CPUs for every partition and time slice, then counts the total for a given node. This avoids double counting CPUs allocated to multiple jobs, but does not remove from the count CPUs which have been allocated to jobs which might be suspended by the gang scheduler (either for time slicing or preemption).
-
Tim Wickberg authored
-
Tim Wickberg authored
Update NEWS as well.
-
Tim Wickberg authored
The uid is used as part of the hash function, must remove old reference and recalculate if it may change, otherwise _delete_assoc_hash will not find it again when the association is removed, causing slurmctld to segfault. Bug 2560.
-
- 16 Mar, 2016 15 commits
-
-
Morris Jette authored
Add --gres-flags=enforce-binding option to salloc, sbatch and srun commands. If set, the only CPUs available to the job will be those bound to the selected GRES (i.e. the CPUs identifed in the gres.conf file will be strictly enforced rather than advisory). bug 1725
-
Tim Wickberg authored
-
Morris Jette authored
-
Morris Jette authored
Previous gang scheduling logic maintained information about resources originally allocated to the job and made scheduling decisions on that basis. bug 2494
-
Morris Jette authored
This will improve ability to diagnose problems if the srun is killed by a signal.
-
Morris Jette authored
Update gang scheduling table when job manually suspended or resumed. Prior logic could mess up job suspend/resume sequencing. bug 2494
-
Danny Auble authored
-
Danny Auble authored
time. https://bugs.schedmd.com/show_bug.cgi?id=2547 The code just wasn't fully baked before and was probably written before a lot of the other supporting code was done i.e assoc_mgr_set_assoc|qos_tres_cnt were done specifically for this kind of thing. Many of the usage structures weren't realloced either as well as the tres_cnt local to each qos and assoc wasn't updated. So all in all pretty bad code - bad Danny. This makes sure all this sets up and no memory corruption happens.
-
Morris Jette authored
-
Morris Jette authored
Generate burst buffer use completion email immediately afer teardown completes rather than at job purge time (likely minutes later). bug 2539
-
Morris Jette authored
Change burst buffer use completion message from "SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to "SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"
-
Alejandro Sanchez authored
-
Morris Jette authored
This is being fixed in shortly be creating a separate library for bcast functionaltiy
-
Brian Christiansen authored
-
Brian Christiansen authored
Bug 2396
-
- 15 Mar, 2016 9 commits
-
-
Alejandro Sanchez authored
-
Morris Jette authored
-
Tim Wickberg authored
Bug 2548. No functional change, documentation only.
-
Tim Wickberg authored
Conflicts: src/plugins/burst_buffer/generic/burst_buffer_generic.c
-
Tim Wickberg authored
Otherwise "not found" value of -1 for tres_pos would cause out-of-bounds memory access.
-
Tim Wickberg authored
Conflicts: src/plugins/burst_buffer/cray/burst_buffer_cray.c
-
Tim Wickberg authored
Bug 2543.
-
Tim Wickberg authored
Fix bad cast in 3a604563, and update pct to 64-bits to prevent truncation of intermediate value (pct * 100).
-
Morris Jette authored
-
- 14 Mar, 2016 2 commits
-
-
Danny Auble authored
resolve NoInAddrAny when doing a strstr. Continuation of commit 775c46de.
-
Danny Auble authored
on only one port like TopologyParam=NoInAddrAny does for everything else.
-