- 07 Apr, 2016 2 commits
-
-
Sami Ilvonen authored
-
Morris Jette authored
Fix for job "--contiguous" option that could cause job allocation/launch failure or slurmctld crash. bug 2573
-
- 06 Apr, 2016 7 commits
-
-
Morris Jette authored
-
Danny Auble authored
constraints mattered in a job. Details include: A job doesn't request memory but the system is running with CR_*MEMORY with no default memory limit and the job requests nodes with features of different sizes. Previously the order of constraints mattered where the smaller memory node would need to be requested first or the job would fail. Bug 2608
-
Danny Auble authored
This reverts commit f559a55c.
-
Danny Auble authored
constraints mattered in a job. Details include: A job doesn't request memory but the system is running with CR_*MEMORY with no default memory limit and the job requests nodes with features of different sizes. Previously the order of constraints mattered where the smaller memory node would need to be requested first or the job would fail. Bug 2608
-
Morris Jette authored
Previous logic would get an account and/or QOS time limit and use that value to overwrite the incoming RPC's NO_VAL value, which would change a job's time limit when changing an unrelated field (e.g. priority, QOS, etc.). bug 2610
-
Danny Auble authored
-
Morris Jette authored
bug 2609
-
- 05 Apr, 2016 1 commit
-
-
Morris Jette authored
Fix backfill scheduler race condition that could cause invalid pointer in select/cons_res plugin. Bug introduced in 15.08.9, commit: efd9d35e The scenario is as follows 1. Backfill scheduler is running, then releases locks 2. Main scheduling loop starts a job "A" 3. Backfill scheduler resumes, finds job "A" in its queue and resets it's partition pointer. 4. Job "A" completes and tries to remove resource allocation record from select/cons_res data structure, but fails to find it because it is looking in the table for the wrong partition. 5. Job "A" record gets purged from slurmctld 6. Select/cons_res plugin attempts to operate on resource allocation data structure, finds pointer into the now purged data structure of job "A" and aborts or gets SEGV Bug 2603
-
- 04 Apr, 2016 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
canceled while launching.
-
- 02 Apr, 2016 2 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
- 01 Apr, 2016 1 commit
-
-
Morris Jette authored
Rename partition configuration from "Shared" to "OverSubscribe". Rename salloc, sbatch, srun option from "--shared" to "--oversubscribe". The old options will continue to function. Output field names also changed in scontrol, sinfo, squeue, and sview.
-
- 31 Mar, 2016 2 commits
-
-
Morris Jette authored
Power/cray: Don't specify NID list to Cray APIs. If any of those nodes are not in a ready state, the API returned an error for ALL nodes rather than valid data for nodes in ready state. bug 2332
-
Matthieu Hautreux authored
and retries are done making the error message a little misleading.
-
- 30 Mar, 2016 5 commits
-
-
Morris Jette authored
Update a node's socket and cores per socket counts as needed after a node boot to reflect configuration changes which can occur on KNL processors. Note that the node's total core count must not change, only the distribution of cores across varying socket counts (KNL NUMA nodes treated as sockets by Slurm).
-
Danny Auble authored
rollup would effectively never run again. bug 2575 and sort of bug 2596
-
Morris Jette authored
Remove the SchedulerParameters option of "assoc_limit_continue", making it the default value. Add option of "assoc_limit_stop". If "assoc_limit_stop" is set and a job cannot start due to association limits, then do not attempt to initiate any lower priority jobs in that partition. Setting this can decrease system throughput and utlization, but avoid potentially starving larger jobs by preventing them from launching indefinitely.
-
Morris Jette authored
-
Morris Jette authored
-
- 29 Mar, 2016 1 commit
-
-
Danny Auble authored
launching a job, instead the job will fail and drain the node if the env isn't loaded normally. bug 2546
-
- 28 Mar, 2016 4 commits
-
-
Danny Auble authored
with the slurmdbd.
-
Danny Auble authored
make the wait to return data only hit after 500 nodes and configurable based on the TcpTimeout value.
-
Morris Jette authored
There was a subtle bug in how tasks were bound to CPUs which could result in an "infinite loop" error. The problem was various socket/core/threasd calculations were based upon the resources allocated to a step rather than all resources on the node and rounding errors could occur. Consider for example a node with 2 sockets, 6 cores per socket and 2 threads per core. On the idle node, a job requesting 14 CPUs is submitted. That job would be allocted 4 cores on the first socket and 3 cores on the second socket. The old logic would get the number of sockets for the job at 2 and the number of cores at 7, then calculate the number of cores per socket at 7/2 or 3 (rounding down to an integer). The logic layouting out tasks would bind the first 3 cores on each socket to the job then not find any remaining cores, report the "infinite loop" error to the user, and run the job without one of the expected cores. The problem gets even worse when there are some allocated cores on a node. In a more extreme case, a job might be allocated 6 cores on one socket and 1 core on a second socket. In that case, 3 of that job's cores would be unused. bug 2502
-
Morris Jette authored
This is a revision to commit 1ed38f26 The root problem is that a pthread is passed an argument which is a pointer to a variable on the stack. If that variable is over-written, the signal number recieved will be garbage, and that bad signal number will be interpretted by srun to possible abort the request.
-
- 26 Mar, 2016 1 commit
-
-
Morris Jette authored
The previous commit obviously fixed a problem, but introduced a different set of problems. This will be pursued later, perhaps in version 16.05.
-
- 25 Mar, 2016 3 commits
-
-
Morris Jette authored
With some configurations and systems, errors of the following sort were occuring: task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block task/cgroup: task[1] unable to set taskset '0x0'
-
Nathan Yee authored
Bug 1706
-
Morris Jette authored
burst_buffer/cray - If the pre-run operation fails then don't issue duplicate job cancel/requeue unless the job is still in run state. Prevents jobs hung in COMPLETING state. bug 2587
-
- 24 Mar, 2016 1 commit
-
-
Danny Auble authored
isn't kept up to date in the cache.
-
- 23 Mar, 2016 4 commits
-
-
Morris Jette authored
Fix gang scheduling resource selection bug which could prevent multiple jobs from being allocated the same resources. Bug was introduced in 15.08.6, commit 44f491b8
-
Morris Jette authored
Here's how to reproduce on smd-server with 2 sockets, 6 cores per socket and 2 threads per core, just run the following command line 3 times in quick succession (all active at the same time): srun --cpus-per-task=4 -m block sleep 30 What was happening is the first job would be allocated cores 0+1 The second job would be allocated cores 2+3 The thrid job would test use of cores 0-3 then exit because the job only needs 4 CPUs. The resulting core binding would include NO CPUs. The new logic tests that the core being considered for use actually has some resources available to the job before updating the counter which is being tested against the needed CPU counter.
-
Morris Jette authored
Specifically add the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag when loading configuration from HWLOC library. Previous logic in task/cgroup did not do this, which was different behaviour from how slurmd gets configuration information. Here's the HWLOC documentation: HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM Detect the whole system, ignore reservations and offline settings. Gather all resources, even if some were disabled by the administrator. For instance, ignore Linux Cpusets and gather all processors and memory nodes, and ignore the fact that some resources may be offline. Without this flag, I was rarely observing a bad core count, which resulted in the logic layout out tasks wrong and generating an error: task/cgroup: task[0] infinite loop broken while trying to provision compute elements using cyclic bug 2502
-
Danny Auble authored
-
- 21 Mar, 2016 2 commits
-
-
Morris Jette authored
burst_buffer/cray: Set environment variables just before starting job rather than at job submission time to reflect persistent buffers created or modified while the job is pending. bug 2545
-
Danny Auble authored
buffer is found. Bug 2576 What happened was a function was doing a double read lock which isn't awesome to begin with, but not really horrible (if all you are doing is read locks anyway). The problem was after the first lock was locked a different thread was going for a write lock and so when the second read lock came in it created deadlocked.
-
- 18 Mar, 2016 2 commits
-
-
Morris Jette authored
Jobs below the specified threshold will not have resources reserved for them. bug 2565
-
Morris Jette authored
Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while creating the job step. The result is that the signal hanlder gets a argument (the signal received) of zero. Here's a log, window 1: $ srun hostname srun: Job step creation temporarily disabled, retrying srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 0 srun: Cancelled pending job step Window 2: $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 .... bug 2494
-