Commits · 02876290ac35cfd476accfc11e63e0326e07bf8c · Manuel G. Marciani / ces_slurm_simulator

02 Jun, 2016 3 commits
- Fix wrong NEWS entry location from dbfbcfe9 . · 02876290
  Tim Wickberg authored Jun 02, 2016
  
  02876290
- Print correct return code on failure to update node features through sview · dbfbcfe9
  Tim Wickberg authored Jun 02, 2016
```
Wrong order of operations results in the return code being 0/1.
```
  dbfbcfe9
- Fix issue where slurmd could core when running the ipmi energy plugin. · 880823f7
  Danny Auble authored Jun 02, 2016
```
If the plugin ever returns an error the variables weren't initialized so
when they were freed they could corrupt memory.

Bug 2790
```
  880823f7
31 May, 2016 4 commits
- Start NEWS for v15.08.13 · 29143a27
  Morris Jette authored May 31, 2016
  
  29143a27
- Fix backwards compatibility with sreport going to <= 14.11 coming from · d82263d2
  Danny Auble authored May 31, 2016
```
>= 15.08 for some reports.
```
  d82263d2
- Fix bad order-of-operations in _connect_srun_cr. · c53ce3f6
  Tim Wickberg authored May 31, 2016
```
Prevents correct error handling by rc being 0/1 instead of the original
return code.

Also fix slurm_send_only_controller_msg and slurm_send_only_node_msg
although these only result in bad printed values in the debug message.
```
  c53ce3f6
- Fix Hidden error during _rpc_forward_data call. · ecc2f7a4
  Artem Polyakov authored May 31, 2016
```
Bug 2120
```
  ecc2f7a4
27 May, 2016 3 commits

Fix for tracking a node's allocated CPUs with gang scheduling. · 4ce62678

Morris Jette authored May 26, 2016

This bug was introduced by commit 21c52d2f
which fixed a different problem tracking resources associated with suspended
jobs. There are subtle differences between jobs that are suspended by a
user/administrator and jobs suspended by gang scheduling which resulted in
undercounting allocated CPUs when a job suspended by gang scheduling
was active at the same time of a slurmctld reconfiguration request.
See bugs 2353 (original bug related to commit 21c52d2f
and bug 2765

4ce62678

If no default account is given for a user when creating (only a list of · 9917c49d

Danny Auble authored May 26, 2016

accounts) no default account is printed, previously NULL was printed.

This is just not printing it, but whole function should probably be
revisited as the rigmarole can probably be avoided as we always know what
the default is going to be if none is specified (first off the list).

The problem with that though is if the user has been added to a cluster
already and they have a default, but then added to a new cluster where
they don't have a default. In this case you want to keep the first
clusters default, but set the default for the second cluster.

Bug 2725

9917c49d

Make it so --mail-type=NONE didn't throw an invalid error. · 2a817734
Danny Auble authored May 26, 2016

2a817734

25 May, 2016 2 commits
- Prevent possible deadlock in acct_gather_filesystem/lustre · 8bb7711e
  Tim Wickberg authored May 24, 2016
```
Add missing unlock before return. Coverity 44888.
```
  8bb7711e
- Prevent multiple responses to REQUEST_UPDATE_JOB_STEP from missing break. · 3fa3ad1e
  Tim Wickberg authored May 24, 2016
```
Coverity 44891.
```
  3fa3ad1e
24 May, 2016 6 commits
- Fix assignment instead of comparison to prevent infinite loop on failed execve. · a8601e91
  Tim Wickberg authored May 24, 2016
```
Coverity 44992.
```
  a8601e91
- Prevent possible deadlock in proctrack/lua. · 1d3da383
  Tim Wickberg authored May 24, 2016
```
Needs to unlock here, not re-lock the lock.
```
  1d3da383
- Add missing break to plugstack to prevent wrong verbose message. · 5ac6b927
  Tim Wickberg authored May 24, 2016
  
  5ac6b927
- Add missing break to sbcast option parsing. · 8013d254
  Tim Wickberg authored May 24, 2016
```
Prevent '--preserve' from being inadvertanly enabled by '-j'.
```
  8013d254
- Add missing break to prevent sbatch --array from inadvertanly enabling debug. · ada6f5c8
  Tim Wickberg authored May 24, 2016
  
  ada6f5c8
- Add missing break statement to fix CFULL_BLOCK distribution type. · 1189c849
  Tim Wickberg authored May 24, 2016
  
  1189c849
18 May, 2016 2 commits
- MYSQL - Fix order of operations issue where if the database is locked up · d378b297
  Danny Auble authored May 18, 2016
```
and the slurmctld doesn't wait long enough for the response it would give
up leaving the connection open and create a situation where the next message
sent could receive the response of the first one.

Bug 2739
```
  d378b297
- Fix MemSpecLimit to depend on TaskPlugin=task/cgroup and ConstrainRAMSpace. · f4ebc793
  Alejandro Sanchez authored May 18, 2016
```
Bug #2713.
```
  f4ebc793
17 May, 2016 1 commit
- Fix jobcomp/elasticsearch plugin build when curl is installed in a non standard location. · 370e397f
  Tim Wickberg authored May 17, 2016
  
  370e397f
16 May, 2016 2 commits
- Apply six patches from pkgsrc package to fix NetBSD build. · c8116026
  Jason Bacon authored May 16, 2016
  
  c8116026
- Fix typo in NEWS · c2cd292a
  Morris Jette authored May 16, 2016
  
  c2cd292a
13 May, 2016 1 commit

Fix race condition with respects to cleaning up the profiling threads · b1fbeb85

Danny Auble authored May 12, 2016

when in use.

The problem here is the polling threads in the various acct_gather codes
were detached and could possibly still be polling after the plugin had
been unloaded making a seg fault with a backtrace like this...

#0  0x00007fe7af008c00 in ?? ()
#1  0x00007fe7b1138479 in __nptl_deallocate_tsd () at pthread_create.c:175
#2  0x00007fe7b11398b0 in __nptl_deallocate_tsd () at pthread_create.c:326
#3  start_thread (arg=0x7fe7b1f12700) at pthread_create.c:346
#4  0x00007fe7b0e6fb5d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The fix was to make the threads non-detached and join them before calling
a dlclose.

b1fbeb85

12 May, 2016 1 commit

If the cluster name and state are stored on NFS (with root_squash), · e422127c

Danny Auble authored May 12, 2016

trying to verify the cluster name (which may try to /create/ files or
directories) *before* dropping privs results in a fatal error as
slurmctld tries to create items which ultimately fail.  Moving
this process until after the privs and uid have changed allows
the process to succeed.

Reported by Jon Nelson <jdnelson@dyn.com>

Bug 2728

e422127c

11 May, 2016 2 commits
- MySQL - Fix potential memory leak when rolling up data. · 24ff890f
  Danny Auble authored May 10, 2016
  
  24ff890f
- Fix issue when TopologyParam=NoInAddrAny is set the responses wouldn't · 03a9e836
  Danny Auble authored May 10, 2016
```
make it to the slurmctld when using message aggregation.
```
  03a9e836
10 May, 2016 4 commits

Perlapi - Remove unneeded/undefined mutex. · 7276f432
Danny Auble authored May 10, 2016

7276f432
Add array_inx to job_submit.lua plugin for job arrays. · f52a3317
Tim Wickberg authored May 10, 2016

f52a3317

Changes have been made to the cray/select plugin aeld code · 2959a1e6

Marlys Kohnke authored May 10, 2016

    for better robustness.  This cray/select plugin code has
    been modified to remove a possible timing window where two
    aeld pthreads could exist, interfering with each other through
    the global aeld_running variable.

    An additional validity check has been added to the data provided
    to aeld through an alpsc_ev_set_application_info() call.
    If an error is returned from that call, only certain errors
    need the current socket connection closed to aeld and a new
    connection established.  Other error returns will log an
    error message and keep the current session established with
    aeld.

2959a1e6

Fix issue where daemons would only listen on specific address given in · 79c9a499

Danny Auble authored May 09, 2016

slurm.conf instead of all.  If looking for specific addresses use
TopologyParam options No*InAddrAny.

This was broken in 15.08 with the advent of the referenced TopologyParams
the commits 9378f195 and c5312f52 are no longer needed.

Bug 2696

79c9a499

09 May, 2016 2 commits
- Move code into #ifdef if there is no hwloc · 9129192b
  Danny Auble authored May 09, 2016
  
  9129192b
- MySQL - Fix for possible race condition when archiving multiple clusters · eb208763
  Moe Jette authored May 09, 2016
```
at the same time.

Bug 2683

Turns out making a variable static in a function will make it not safe
when dealing with threads.
```
  eb208763
06 May, 2016 1 commit

Fix for slurmstepd setfault · db0fe22e

John Thiltges authored May 06, 2016

With slurm-15.08.10, we're seeing occasional segfaults in slurmstepd. The logs point to the following line: slurm-15.08.10/src/slurmd/slurmstepd/mgr.c:2612

On that line, _get_primary_group() is accessing the results of getpwnam_r():
    *gid = pwd0->pw_gid;

If getpwnam_r() cannot find a matching password record, it will set the result (pwd0) to NULL, but still return 0. When the pointer is accessed, it will cause a segfault.

Checking the result variable (pwd0) to determine success should fix the issue.

db0fe22e

05 May, 2016 2 commits
- Correct NEWS header · 82a05778
  Morris Jette authored May 05, 2016
  
  82a05778
- Don't power down dead node · b4904661
  Morris Jette authored May 05, 2016
```
Do not attempt to power down a node which has never responded if the
    slurmctld daemon restarts without state.
bug 2698
```
  b4904661
03 May, 2016 4 commits
- Fix sacctmgr to remove a user who has no associations. · 7651754d
  Danny Auble authored May 03, 2016
  
  7651754d
- Fix potential gres underflow on restart of slurmctld · bd93b6e6
  Brian Christiansen authored May 03, 2016
  
  bd93b6e6
- Clarify behavior of 'srun --export=NONE' in man page. · 370695a1
  Tim Wickberg authored May 02, 2016
  
  370695a1
- Correct prolog_epilog.shtml with current behavior if Prolog fails. · 0c3f2709
  Eric Martin authored May 02, 2016
  
  0c3f2709