Commits · ddf6d9a44a81c7338e082d9e09c69548d3951c91 · Manuel G. Marciani / ces_slurm_simulator

28 Mar, 2016 2 commits

task/cgroup - Fix task binding to CPUs bug · ddf6d9a4

Morris Jette authored Mar 28, 2016

There was a subtle bug in how tasks were bound to CPUs which could result
in an "infinite loop" error. The problem was various socket/core/threasd
calculations were based upon the resources allocated to a step rather than
all resources on the node and rounding errors could occur. Consider for
example a node with 2 sockets, 6 cores per socket and 2 threads per core.
On the idle node, a job requesting 14 CPUs is submitted. That job would
be allocted 4 cores on the first socket and 3 cores on the second socket.
The old logic would get the number of sockets for the job at 2 and the
number of cores at 7, then calculate the number of cores per socket at
7/2 or 3 (rounding down to an integer). The logic layouting out tasks
would bind the first 3 cores on each socket to the job then not find any
remaining cores, report the "infinite loop" error to the user, and run
the job without one of the expected cores. The problem gets even worse
when there are some allocated cores on a node. In a more extreme case,
a job might be allocated 6 cores on one socket and 1 core on a second
socket. In that case, 3 of that job's cores would be unused.
bug 2502

ddf6d9a4

Fix for srun signal handling threading problem · c8d36dba

Morris Jette authored Mar 28, 2016

This is a revision to commit 1ed38f26
The root problem is that a pthread is passed an argument which is
a pointer to a variable on the stack. If that variable is over-written,
the signal number recieved will be garbage, and that bad signal
number will be interpretted by srun to possible abort the request.

c8d36dba

26 Mar, 2016 1 commit

Revert commit · c1dde86c

Morris Jette authored Mar 25, 2016

The previous commit obviously fixed a problem, but introduced a different
set of problems. This will be pursued later, perhaps in version 16.05.

c1dde86c

25 Mar, 2016 2 commits

Revert commit · f5920b77

Morris Jette authored Mar 25, 2016

With some configurations and systems, errors of the following sort were
occuring:
task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block
task/cgroup: task[1] unable to set taskset '0x0'

f5920b77

burst_buffer/cray - pre-run fail fix · 5a48207e

Morris Jette authored Mar 25, 2016

burst_buffer/cray - If the pre-run operation fails then don't issue
    duplicate job cancel/requeue unless the job is still in run state. Prevents
    jobs hung in COMPLETING state.
bug 2587

5a48207e

24 Mar, 2016 4 commits

Select/cray - Log NHC run time on "scontrol reconfig" · 58627d02

Morris Jette authored Mar 24, 2016

Running "scontrol reconfig" releases resources for jobs waiting for
  the completion of Node Health Check so that other jobs can run.
  Cray says to always wait for NHC to complete, but in extreme
  cases that can be 2 hours, during which the entire resource
  allocation for a job may be unusable. Per advice from NERSC,
  the logic to release resources is unchanged, but logging is
  added here.

58627d02

Remove Rgt from the association output of scontrol show assoc_mgr. Rgt · ddfd2781
Danny Auble authored Mar 24, 2016
```
isn't kept up to date in the cache.
```
ddfd2781
Fix incorrect info from commit ce6d3717 · 60979f52
Danny Auble authored Mar 24, 2016

60979f52
Add new expect get_curr_line_num so you can print out the script line · ce6d3717
Danny Auble authored Mar 24, 2016
```
as will.
```
ce6d3717

23 Mar, 2016 4 commits

gang scheduling bug fix · 5f1e78f6

Morris Jette authored Mar 23, 2016

Fix gang scheduling resource selection bug which could prevent multiple jobs
    from being allocated the same resources. Bug was introduced in 15.08.6,
    commit 44f491b8

5f1e78f6

task/cgroup: Fix for task binding anomaly · efa83a02

Morris Jette authored Mar 23, 2016

Here's how to reproduce on smd-server with 2 sockets, 6 cores per
socket and 2 threads per core, just run the following command line
3 times in quick succession (all active at the same time):
srun --cpus-per-task=4 -m block sleep 30
What was happening is the first job would be allocated cores 0+1
The second job would be allocated cores 2+3
The thrid job would test use of cores 0-3 then exit because the
 job only needs 4 CPUs. The resulting core binding would include
 NO CPUs. The new logic tests that the core being considered for
 use actually has some resources available to the job before
 updating the counter which is being tested against the needed
 CPU counter.

efa83a02

task/cgroup: Fix for task layout logic when disabled resources. · 6c14b969

Morris Jette authored Mar 23, 2016

Specifically add the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag when
loading configuration from HWLOC library. Previous logic in
task/cgroup did not do this, which was different behaviour from
how slurmd gets configuration information. Here's the HWLOC
documentation:
HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM
Detect the whole system, ignore reservations and offline settings.
Gather all resources, even if some were disabled by the administrator.
For instance, ignore Linux Cpusets and gather all processors and memory
nodes, and ignore the fact that some resources may be offline.

Without this flag, I was rarely observing a bad core count, which
resulted in the logic layout out tasks wrong and generating an error:
task/cgroup: task[0] infinite loop broken while trying to provision compute elements using cyclic

bug 2502

6c14b969

Fix check of per-user qos limits on the initial run by a user. · 2ed5c7fb
Danny Auble authored Mar 23, 2016

2ed5c7fb

22 Mar, 2016 1 commit
- Add cancel_job calls to test · d2e38a2b
  Morris Jette authored Mar 22, 2016
```
Just in case some job fails to terminate as expected.
```
  d2e38a2b
21 Mar, 2016 3 commits

Continuation to commit 701917cc to make sure we are running · 264b950d
Danny Auble authored Mar 21, 2016
```
gang scheduling before doing code for gang scheduling.
```
264b950d

Change point where burst buffer env vars are set · 54f314e7

Morris Jette authored Mar 21, 2016

burst_buffer/cray: Set environment variables just before starting job rather
    than at job submission time to reflect persistent buffers created or
    modified while the job is pending.
bug 2545

54f314e7

Fix deadlock issue with burst_buffer/cray when a newly created burst · dcfa6ec0

Danny Auble authored Mar 21, 2016

buffer is found.

Bug 2576

What happened was a function was doing a double read lock which isn't
awesome to begin with, but not really horrible (if all you are doing is
read locks anyway).  The problem was after the first lock was locked a
different thread was going for a write lock and so when the second
read lock came in it created deadlocked.

dcfa6ec0

18 Mar, 2016 2 commits

Fix typo. · 886df85b
Tim Wickberg authored Mar 18, 2016

886df85b

Fix for srun abort on SIGSTOP+SIGCONT · 1ed38f26

Morris Jette authored Mar 18, 2016

Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while
    creating the job step. The result is that the signal hanlder gets a
    argument (the signal received) of zero.

Here's a log, window 1:
$ srun hostname
srun: Job step creation temporarily disabled, retrying
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 0
srun: Cancelled pending job step

Window 2:
$  kill -STOP 18696 ; kill -CONT 18696
$  kill -STOP 18696 ; kill -CONT 18696
$  kill -STOP 18696 ; kill -CONT 18696
....

bug 2494

1ed38f26

17 Mar, 2016 2 commits

Merge branch 'slurm-14.11' into slurm-15.08 · dd2324a7
Tim Wickberg authored Mar 17, 2016
```
Update NEWS as well.
```
dd2324a7

Prevent uid update from corrupting assoc_hash table. · 60b58b70

Tim Wickberg authored Mar 17, 2016

The uid is used as part of the hash function, must remove old reference
and recalculate if it may change, otherwise _delete_assoc_hash
will not find it again when the association is removed, causing
slurmctld to segfault.

Bug 2560.

60b58b70

16 Mar, 2016 6 commits

Update gang scheduling data structures when job changes in size · 701917cc

Morris Jette authored Mar 16, 2016

Previous gang scheduling logic maintained information about resources
  originally allocated to the job and made scheduling decisions on
  that basis.
bug 2494

701917cc

Add signal number to error message · 83400184
Morris Jette authored Mar 16, 2016
```
This will improve ability to diagnose problems if the srun is
killed by a signal.
```
83400184

gang scheduling for with manually job suspend/resume · 344d2eab

Morris Jette authored Mar 16, 2016

Update gang scheduling table when job manually suspended or resumed. Prior
    logic could mess up job suspend/resume sequencing.
bug 2494

344d2eab

Fix issue when adding a new TRES to AccountingStorageTRES for the first · 6c436e34

Danny Auble authored Mar 16, 2016

time.

https://bugs.schedmd.com/show_bug.cgi?id=2547

The code just wasn't fully baked before and was probably written before
a lot of the other supporting code was done i.e
assoc_mgr_set_assoc|qos_tres_cnt were done specifically for this kind of
thing.  Many of the usage structures weren't realloced either as well as
the tres_cnt local to each qos and assoc wasn't updated.  So all in all
pretty bad code - bad Danny.  This makes sure all this sets up and no
memory corruption happens.

6c436e34

Send burst buffer teardown immediately · d85cdcc7

Morris Jette authored Mar 16, 2016

Generate burst buffer use completion email immediately afer teardown
    completes rather than at job purge time (likely minutes later).
bug 2539

d85cdcc7

Modify burst buffer stage out message · fae4c3d3

Morris Jette authored Mar 16, 2016

Change burst buffer use completion message from
"SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47" to
"SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47"

fae4c3d3

15 Mar, 2016 5 commits
- acct_gather_energy/ipmi - add threshold for message logging · 18608974
  Alejandro Sanchez authored Mar 15, 2016
  
  18608974
- Document how to create slurmstepd core file · 4305fb7c
  Morris Jette authored Mar 15, 2016
  
  4305fb7c
- Clarify language for NoReserve flag. · 3c98f608
  Tim Wickberg authored Mar 15, 2016
```
Bug 2548. No functional change, documentation only.
```
  3c98f608
- Continue 5708037d with checks for tres_pos > 0. · 7274ef9f
  Tim Wickberg authored Mar 15, 2016
```
Otherwise "not found" value of -1 for tres_pos would cause
out-of-bounds memory access.
```
  7274ef9f
- Check that bb_state.tres_pos is set correctly to avoid overwriting CPU TRES. · 5708037d
  Tim Wickberg authored Mar 15, 2016
```
Bug 2543.
```
  5708037d
14 Mar, 2016 3 commits
- Change NoInAddrAnyCtld to NoCtldInAddrAny so as to not have it also · f5b5e605
  Danny Auble authored Mar 14, 2016
```
resolve NoInAddrAny when doing a strstr.  Continuation of commit 775c46de.
```
  f5b5e605
- Add option for TopologyParam=NoInAddrAnyCtld to make the slurmctld listen · 775c46de
  Danny Auble authored Mar 14, 2016
```
on only one port like TopologyParam=NoInAddrAny does for everything else.
```
  775c46de
- FreeBSD - set_oom_adj is Linux-specific, stub out to avoid errors. · b3f2359f
  Tim Wickberg authored Mar 14, 2016
```
There's no /proc on *BSD, and BSD handles OOM in a completely different way.
```
  b3f2359f
11 Mar, 2016 2 commits
- Merge branch 'slurm-14.11' into slurm-15.08 · a912fb3b
  Tim Wickberg authored Mar 11, 2016
  
  a912fb3b
- Fix job array step function printout. · 03d29e24
  Tim Wickberg authored Mar 11, 2016
```
Return [0-100:2] formatting, rather than [0,2,4,6,8,...] when using
a step function.

Was inadvertantly broken in 14.11 with commit 5ffdca92.

Bug 2535.
```
  03d29e24
10 Mar, 2016 2 commits

Add NEWS for commit 3bb2e602 · a0be0dc5
Morris Jette authored Mar 10, 2016

a0be0dc5

Cray Datawarp job requeue bug fix · 3bb2e602

Morris Jette authored Mar 10, 2016

burst_buffer/cray plugin: Prevent a requeued job from being restarted while
    file stage-out is still in progress. Previous logic could restart the job
    and not perform a new stage-in.
bug 2584, comment #45

3bb2e602

09 Mar, 2016 1 commit

cray job requeue bug · fec5e03b

Morris Jette authored Mar 09, 2016

Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.

Specifically, each job has a "cleaning/cleaned" flag. Once a job
terminates, the cleaning flag is set, then after the job node health
check completes, the value gets set to cleaned. If the job is requeued,
on its second (or subsequent) termination, the select/cray plugin
is called to launch the NHC. The plugin sees the "cleaned" flag
already set, it then logs:
error: select_p_job_fini: Cleaned flag already set for job 1283858, this should never happen
and returns, never launching the NHC. Since the termination of the
job NHC triggers releasing job resources (CPUs, memory, and GRES),
those resources are never released for use by other jobs.

Bug 2384

fec5e03b