- 28 Mar, 2016 2 commits
-
-
Morris Jette authored
There was a subtle bug in how tasks were bound to CPUs which could result in an "infinite loop" error. The problem was various socket/core/threasd calculations were based upon the resources allocated to a step rather than all resources on the node and rounding errors could occur. Consider for example a node with 2 sockets, 6 cores per socket and 2 threads per core. On the idle node, a job requesting 14 CPUs is submitted. That job would be allocted 4 cores on the first socket and 3 cores on the second socket. The old logic would get the number of sockets for the job at 2 and the number of cores at 7, then calculate the number of cores per socket at 7/2 or 3 (rounding down to an integer). The logic layouting out tasks would bind the first 3 cores on each socket to the job then not find any remaining cores, report the "infinite loop" error to the user, and run the job without one of the expected cores. The problem gets even worse when there are some allocated cores on a node. In a more extreme case, a job might be allocated 6 cores on one socket and 1 core on a second socket. In that case, 3 of that job's cores would be unused. bug 2502
-
Morris Jette authored
This is a revision to commit 1ed38f26 The root problem is that a pthread is passed an argument which is a pointer to a variable on the stack. If that variable is over-written, the signal number recieved will be garbage, and that bad signal number will be interpretted by srun to possible abort the request.
-
- 26 Mar, 2016 1 commit
-
-
Morris Jette authored
The previous commit obviously fixed a problem, but introduced a different set of problems. This will be pursued later, perhaps in version 16.05.
-
- 25 Mar, 2016 2 commits
-
-
Morris Jette authored
With some configurations and systems, errors of the following sort were occuring: task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block task/cgroup: task[1] unable to set taskset '0x0'
-
Morris Jette authored
burst_buffer/cray - If the pre-run operation fails then don't issue duplicate job cancel/requeue unless the job is still in run state. Prevents jobs hung in COMPLETING state. bug 2587
-
- 24 Mar, 2016 4 commits
-
-
Morris Jette authored
Running "scontrol reconfig" releases resources for jobs waiting for the completion of Node Health Check so that other jobs can run. Cray says to always wait for NHC to complete, but in extreme cases that can be 2 hours, during which the entire resource allocation for a job may be unusable. Per advice from NERSC, the logic to release resources is unchanged, but logging is added here.
-
Danny Auble authored
isn't kept up to date in the cache.
-
Danny Auble authored
-
Danny Auble authored
as will.
-
- 23 Mar, 2016 4 commits
-
-
Morris Jette authored
Fix gang scheduling resource selection bug which could prevent multiple jobs from being allocated the same resources. Bug was introduced in 15.08.6, commit 44f491b8
-
Morris Jette authored
Here's how to reproduce on smd-server with 2 sockets, 6 cores per socket and 2 threads per core, just run the following command line 3 times in quick succession (all active at the same time): srun --cpus-per-task=4 -m block sleep 30 What was happening is the first job would be allocated cores 0+1 The second job would be allocated cores 2+3 The thrid job would test use of cores 0-3 then exit because the job only needs 4 CPUs. The resulting core binding would include NO CPUs. The new logic tests that the core being considered for use actually has some resources available to the job before updating the counter which is being tested against the needed CPU counter.
-
Morris Jette authored
Specifically add the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag when loading configuration from HWLOC library. Previous logic in task/cgroup did not do this, which was different behaviour from how slurmd gets configuration information. Here's the HWLOC documentation: HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM Detect the whole system, ignore reservations and offline settings. Gather all resources, even if some were disabled by the administrator. For instance, ignore Linux Cpusets and gather all processors and memory nodes, and ignore the fact that some resources may be offline. Without this flag, I was rarely observing a bad core count, which resulted in the logic layout out tasks wrong and generating an error: task/cgroup: task[0] infinite loop broken while trying to provision compute elements using cyclic bug 2502
-
Danny Auble authored
-
- 22 Mar, 2016 1 commit
-
-
Morris Jette authored
Just in case some job fails to terminate as expected.
-
- 21 Mar, 2016 3 commits
-
-
Danny Auble authored
gang scheduling before doing code for gang scheduling.
-
Morris Jette authored
burst_buffer/cray: Set environment variables just before starting job rather than at job submission time to reflect persistent buffers created or modified while the job is pending. bug 2545
-
Danny Auble authored
buffer is found. Bug 2576 What happened was a function was doing a double read lock which isn't awesome to begin with, but not really horrible (if all you are doing is read locks anyway). The problem was after the first lock was locked a different thread was going for a write lock and so when the second read lock came in it created deadlocked.
-
- 18 Mar, 2016 2 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while creating the job step. The result is that the signal hanlder gets a argument (the signal received) of zero. Here's a log, window 1: $ srun hostname srun: Job step creation temporarily disabled, retrying srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 18 srun: I Got signal 0 srun: Cancelled pending job step Window 2: $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 $ kill -STOP 18696 ; kill -CONT 18696 .... bug 2494
-
- 17 Mar, 2016 2 commits
-
-
Tim Wickberg authored
Update NEWS as well.
-
Tim Wickberg authored
The uid is used as part of the hash function, must remove old reference and recalculate if it may change, otherwise _delete_assoc_hash will not find it again when the association is removed, causing slurmctld to segfault. Bug 2560.
-
- 16 Mar, 2016 6 commits
-
-
Morris Jette authored
Previous gang scheduling logic maintained information about resources originally allocated to the job and made scheduling decisions on that basis. bug 2494
-
Morris Jette authored
This will improve ability to diagnose problems if the srun is killed by a signal.
-
Morris Jette authored
Update gang scheduling table when job manually suspended or resumed. Prior logic could mess up job suspend/resume sequencing. bug 2494
-
Danny Auble authored
time. https://bugs.schedmd.com/show_bug.cgi?id=2547 The code just wasn't fully baked before and was probably written before a lot of the other supporting code was done i.e assoc_mgr_set_assoc|qos_tres_cnt were done specifically for this kind of thing. Many of the usage structures weren't realloced either as well as the tres_cnt local to each qos and assoc wasn't updated. So all in all pretty bad code - bad Danny. This makes sure all this sets up and no memory corruption happens.
-
Morris Jette authored
Generate burst buffer use completion email immediately afer teardown completes rather than at job purge time (likely minutes later). bug 2539
-
Morris Jette authored
Change burst buffer use completion message from "SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to "SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"
-
- 15 Mar, 2016 5 commits
-
-
Alejandro Sanchez authored
-
Morris Jette authored
-
Tim Wickberg authored
Bug 2548. No functional change, documentation only.
-
Tim Wickberg authored
Otherwise "not found" value of -1 for tres_pos would cause out-of-bounds memory access.
-
Tim Wickberg authored
Bug 2543.
-
- 14 Mar, 2016 3 commits
-
-
Danny Auble authored
resolve NoInAddrAny when doing a strstr. Continuation of commit 775c46de.
-
Danny Auble authored
on only one port like TopologyParam=NoInAddrAny does for everything else.
-
Tim Wickberg authored
There's no /proc on *BSD, and BSD handles OOM in a completely different way.
-
- 11 Mar, 2016 2 commits
-
-
Tim Wickberg authored
-
Tim Wickberg authored
Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using a step function. Was inadvertantly broken in 14.11 with commit 5ffdca92. Bug 2535.
-
- 10 Mar, 2016 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray plugin: Prevent a requeued job from being restarted while file stage-out is still in progress. Previous logic could restart the job and not perform a new stage-in. bug 2584, comment #45
-
- 09 Mar, 2016 1 commit
-
-
Morris Jette authored
Fix Cray NHC spawning on job requeue. Previous logic would leave nodes allocated to a requeued job as non-usable on job termination. Specifically, each job has a "cleaning/cleaned" flag. Once a job terminates, the cleaning flag is set, then after the job node health check completes, the value gets set to cleaned. If the job is requeued, on its second (or subsequent) termination, the select/cray plugin is called to launch the NHC. The plugin sees the "cleaned" flag already set, it then logs: error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen and returns, never launching the NHC. Since the termination of the job NHC triggers releasing job resources (CPUs, memory, and GRES), those resources are never released for use by other jobs. Bug 2384
-