Commits · 1da1889b6f9256ba84a916e300d90485befd62cb · Manuel G. Marciani / ces_slurm_simulator

13 Dec, 2016 1 commit

Do not attempt to lookup task program in slurmd. · e3ca013b

Tim Wickberg authored Dec 13, 2016

Reverts most of commit 84023f27.

Searching the PATH in slurmd can fail due to root_squash'd NFS
filesystems, leading to the "wrong" program being launched.

If you'd like the performance benefit from avoiding this lookup
during each separate task launch, set SLURM_TEST_EXEC=1 instead
which will perform the lookup once within srun, which then ensures
the lookup happens under the users own environment and not that
of the slurmd.

Bug 2992.

e3ca013b

09 Dec, 2016 1 commit
- Remove StoragePass from being printed out in the slurmdbd log at debug2 · b5a0fd98
  Danny Auble authored Dec 09, 2016
```
level.
```
  b5a0fd98
08 Dec, 2016 6 commits

add 16.05.8 for next release to NEWS · cb22a0c3
Danny Auble authored Dec 08, 2016

cb22a0c3

Fix race condition with getgrouplist(). · 8cb636dd

Tim Wickberg authored Dec 04, 2016

If the second call to getgrouplist() found additional groups,
ngroups will be overwritten with this new larger value, while
the gids list would be truncated. (ngroups is a value-result arg.)
This will then lead to _gids_cache_lookup() returning the wrong
number of groups including invalid parts of memory, which are likely
to include some zeros.

Those zeros could then make it to the setgroups() call and thus
give the user access to the root group. Especially as setgroups
will succeed as long as the array does not contain -1 as a gid.

Bug 3320.

8cb636dd

Fix NEWS line from 1ccf8a72 . · b0838df0
Tim Wickberg authored Dec 08, 2016

b0838df0
Fix issue where task/cgroup would not always honor --cpu_bind=threads. · 1ccf8a72
Danny Auble authored Dec 08, 2016

1ccf8a72

Change task/cgroup error message · e6ef1f0c

Morris Jette authored Dec 08, 2016

task/cgroup - Change error message if CPU binding can not take place to
better identify the root cause of the problem. Specifically, if
the hwloc_get_obj_below_by_type() function call completely fails
that is likely due to task/affinity not being configured, so
cpusets are not configured. Previous message was
"task/cgroup: task[%u] infinite loop broken while trying to provision compute elements using %s (bitmap:%s)"
The new message is
"task/cgroup: hwloc_get_obj_below_by_type() failing, task/affinity plugin also required"

e6ef1f0c

Fix printf format specified in elasticsearch plugin. %u not %hu. · fee2645d
Dominik Bartkiewicz authored Dec 08, 2016
```
uint32_t needs %u on 32-bit platforms. Noticed by clang/travisci.
```
fee2645d

07 Dec, 2016 2 commits
- Fix possible memory corruption if a job is using GRES and changing size. · 2973fd06
  Danny Auble authored Dec 07, 2016
```
Bug 3258
```
  2973fd06
- Revert "Fix possible memory corruption if a job is using GRES and changing size." · 4e6df565
  Danny Auble authored Dec 06, 2016
```
This reverts commit 817c2ca4.

# Conflicts:
#	NEWS
```
  4e6df565
06 Dec, 2016 7 commits

Remove error messages about gres counts changing when a job is resized on · 883af4f2
Danny Auble authored Dec 06, 2016
```
a slurmctld restart or reconfig, as they aren't really error messages.

Bug 3258
```
883af4f2
Fix possible memory corruption if a job is using GRES and changing size. · 817c2ca4
Danny Auble authored Dec 06, 2016
```
Bug 3258
```
817c2ca4

Minor re-wording in NEWS · 835b9e11

Morris Jette authored Dec 06, 2016

Done jost to run "git push" again after internal github error on
  previous push:
remote: Resolving deltas: 100% (4/4), completed with 4 local objects.
remote: Unexpected system error after push was received.
remote: These changes may not be reflected on github.com!
remote: Your unique error code: bdecb7b0f321368fe1f037a81a6e9c2c

835b9e11

Add missing early return to _drop_privileges() if _initgroups() call fails. · b5954e60

Tim Wickberg authored Dec 04, 2016

Note that this does not protect against all possible problems here.
The setgroups() call in Linux at least is willing to set any gid_t
value except -1 on a group, so calls will not always fail on corrupted
group lists.

Bug 3320.

b5954e60

Convert _file_bcast_register_file() to use _gids_cache_lookup(). · b3bdf30d
Tim Wickberg authored Dec 04, 2016
```
Remove uncached _get_grouplist() call which was only used here.

Bug 3315.
```
b3bdf30d
Fix test1.92 parsing · 020d4b64
Morris Jette authored Dec 05, 2016
```
Fix parsing in regression test1.92 for some prompts.
bug 2792
```
020d4b64

Don't use NUMA count from slurm.conf with KNL and FastSchedule=0 · 1ce9a7c4

Morris Jette authored Dec 05, 2016

Recognize a KNL's proper NUMA count (rather than setting it to the value
    in slurm.conf) when using FastSchedule=0. Previous logic would change
    the NUMA count on the node to match what was in slurm.conf, which would
    mess up task layout with respect to the sockets.
bug 3306

1ce9a7c4

05 Dec, 2016 2 commits

On state restore in the slurmctld don't overwrite the mem_spec_limit given · 1eeb9e45
Danny Auble authored Dec 05, 2016
```
from the slurm.conf when using FastSchedule=0.
```
1eeb9e45

cray/burst_buffer - slurmctld restart fix · a88a961c

Morris Jette authored Dec 05, 2016

cray/burst_buffer - If slurmctld daemon restarts with pending job and burst
    buffer having unknown file stage-in status, teardown the buffer, defer the
    job, and start stage-in over again.
bug 3295

a88a961c

02 Dec, 2016 3 commits
- NRT - Make it so you can have more than 1 protocol listed in MP_MSG_API · b3b7cf2e
  Danny Auble authored Dec 02, 2016
```
bug 3314
```
  b3b7cf2e
- NRT - Make it so protocols pgas and test are allowed to be used. · adaab822
  Danny Auble authored Dec 02, 2016
  
  adaab822
- Make it so a system running against IBM's PE will work with PE version 1.3 · a037af18
  Danny Auble authored Dec 02, 2016
  
  a037af18
01 Dec, 2016 2 commits

Make sure if a job can't run because of resources we also check accounting · 031c467f

Dominik Bartkiewicz authored Dec 01, 2016

limits after the node selection to make sure it doesn't violate those limits
and if it does change the reason for waiting so we don't reserve resources
on jobs violating accounting limits.

Bug 3029

031c467f

knl_cray: Fix KNL mode/feature race condition · 46128f2b

Morris Jette authored Nov 30, 2016

node_features/knl_cray - Fix possible race condition when changing node
    state that could result in old KNL mode as an active features.
bug 3235

46128f2b

30 Nov, 2016 2 commits

cray/burst_buffer - Increase timer · b4763c75

Morris Jette authored Nov 30, 2016

cray/burst_buffer - Increase time to synchronize operations between threads
    from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
    This should fix a race condition between a thread performing a buffer
    creation (setup) and a thread looking for unexpected buffers. If a
    buffer is found during the time window allowed for creation, it's
    space will be counted twice. First by the status checking thread
    and second by the thread doing the creation. The deallocation only
    happens once, so the used space information can be left with an
    invalid value.
bug 3295

b4763c75

sbcast - prevent segfault in slurmd from multiple zlib compressed transfers · 8c5765c9

Tim Wickberg authored Nov 30, 2016

static variable means multiple active decompression streams will corrupt
zlib's internal state, which can lead to a segfault.

Bug 3299.

8c5765c9

29 Nov, 2016 1 commit

Fix SuspendExcNodes and SuspendExcParts on slurmctld SIGHUP. · bb06dd65

Alejandro Sanchez authored Nov 29, 2016

On a reconfig, the exc_node_bitmap is cleared but then it was
not built again since last_work_scan was declared as a local static
variable in _do_power_work(). The fix is to make it global within the
plugin and reinitialize it to 0 on _init_power_config().

Bug 3078.

bb06dd65

28 Nov, 2016 3 commits
- Make the openssl crypto plugin compile with openssl >= 1.1. · fd747355
  Alejandro Sanchez authored Nov 28, 2016
  
  fd747355
- sacctmgr - prevent segfault when trying to reset usage for an invalid account · 9e028071
  Dominik Bartkiewicz authored Nov 28, 2016
```
Bug 3267.
```
  9e028071
- srun - prevent segfault in launch plugin when terminating not-yet-created step. · d4aa1998
  Dominik Bartkiewicz authored Nov 28, 2016
```
Termination can race against step creation if, e.g., ill-behaved SPANK plugins
are in use.

Bug 3248.
```
  d4aa1998
22 Nov, 2016 5 commits

Correct malloc data type · a12e1a1c

Morris Jette authored Nov 22, 2016

sched/backfill plugin: Make malloc match data type (defined as uint32_t and
allocated as int). No failures observed, if type "int" is smaller than
"uint32_t", it could result in an invalid memory reference.

a12e1a1c

Fix slurm_job_cpus_allocated_str_on_node_id() API call. · 0ed6488e

Sergey Meirovich authored Nov 22, 2016

Fix API call: slurm_job_cpus_allocated_str_on_node_id() and
in turn slurm_job_cpus_allocated_str_on_node() to return correct
results for anything but first node. This was caused by missed logic
to calculate fist bit belongs to particular node. Lookup was always
starting from bit 0.

Bug 3266.

0ed6488e

backfill algorithm logic · e089b63a

Morris Jette authored Nov 22, 2016

After one second of wall time, simulate the termination of all remaining
   running jobs in order to respond in a reasonable time frame.
bug 3275

e089b63a

Modify backfill algorithm · 6008b021

Morris Jette authored Nov 22, 2016

Modify backfill algorithm to improve performance with large numbers of
    running jobs. Group running jobs that end in a "similar" time frame using a
    time window that grows exponentially rather than linearly. The original
    window sizes were (in units of minutes):
    0, 1, 2, 3, 4, 5, 6, 7, ... minutes
    The new window sizes are:
    0.5, 1, 2, 4, 8, 16, 32, ... minutes
    This can dramatically reduce the number of instances where the very time
    consuming "can the pending job run now" operation is executed, especailly
    if there are 1000+ running jobs.
bug 3275

6008b021

testsuite - fix job id output in test17.39 · 44241006
Nicolas Joly authored Nov 22, 2016

44241006

14 Nov, 2016 1 commit

avoid additional job allocations on booting nodes · b927fb08

Morris Jette authored Nov 14, 2016

If a node is booting for some job, don't allocate additional jobs to the
    node until the boot completes.
but 3256

b927fb08

13 Nov, 2016 1 commit
- cgroup plugins - fix two minor memory leaks · 85ab952a
  Alejandro Sanchez authored Nov 13, 2016
```
Found with valgrind. Bug 2846.
```
  85ab952a
11 Nov, 2016 3 commits
- knl_cray plugin - Avoid abort from backup slurmctld at start time · 9702ec22
  Morris Jette authored Nov 11, 2016
```
Move where we set the configuration table bitmaps in order to support
  the backup slurmctld starting and recovering previously saved
  KNL mode information (which can necessitate rebuilding the node
  configuration table).
bug 3241
```
  9702ec22
- Docs - elaborate on how to clear TRES limits in sacctmgr. · cdc737e6
  Tim Wickberg authored Nov 11, 2016
```
Bug 3255.
```
  cdc737e6
- switch/cray plugin - fix use after free in debug message. · e391622e
  David Gloe authored Nov 10, 2016
```
Bug 3253.
```
  e391622e