Commits · 0aadbcc4437c547bd341175918d1500ec3b2f2cf · Manuel G. Marciani / ces_slurm_simulator

05 Dec, 2016 4 commits

Document that KNL requires HWLOC · 0aadbcc4
Morris Jette authored Dec 05, 2016
```
HWLOC is required to properly determine topology
```
0aadbcc4
On state restore in the slurmctld don't overwrite the mem_spec_limit given · 1eeb9e45
Danny Auble authored Dec 05, 2016
```
from the slurm.conf when using FastSchedule=0.
```
1eeb9e45

cray/burst_buffer - slurmctld restart fix · a88a961c

Morris Jette authored Dec 05, 2016

cray/burst_buffer - If slurmctld daemon restarts with pending job and burst
    buffer having unknown file stage-in status, teardown the buffer, defer the
    job, and start stage-in over again.
bug 3295

a88a961c

burst_buffer/cray: Improve logging · 1f0d150b

Morris Jette authored Dec 05, 2016

Add more detail to log message and change from error to debug2
  with an explanation of how this happens

1f0d150b

02 Dec, 2016 3 commits
- NRT - Make it so you can have more than 1 protocol listed in MP_MSG_API · b3b7cf2e
  Danny Auble authored Dec 02, 2016
```
bug 3314
```
  b3b7cf2e
- NRT - Make it so protocols pgas and test are allowed to be used. · adaab822
  Danny Auble authored Dec 02, 2016
  
  adaab822
- Make it so a system running against IBM's PE will work with PE version 1.3 · a037af18
  Danny Auble authored Dec 02, 2016
  
  a037af18
01 Dec, 2016 5 commits
- Fix whitespace from commit 031c467f · ff2f4754
  Dominik Bartkiewicz authored Dec 01, 2016
  
  ff2f4754
- Make sure if a job can't run because of resources we also check accounting · 031c467f
  Dominik Bartkiewicz authored Dec 01, 2016
```
limits after the node selection to make sure it doesn't violate those limits
and if it does change the reason for waiting so we don't reserve resources
on jobs violating accounting limits.

Bug 3029
```
  031c467f
- Clarify core_spec selection algorithm · 49ef3bfc
  Morris Jette authored Dec 01, 2016
  
  49ef3bfc
- Fix misspelling of SLURM_SUCCESS in comments and documentation. · 83e9e0b7
  Nicolas Joly authored Dec 01, 2016
```
Bug 3301.
```
  83e9e0b7
- knl_cray: Fix KNL mode/feature race condition · 46128f2b
  Morris Jette authored Nov 30, 2016
```
node_features/knl_cray - Fix possible race condition when changing node
    state that could result in old KNL mode as an active features.
bug 3235
```
  46128f2b
30 Nov, 2016 5 commits

Update to comments, no change to code · 9dd934c1
Morris Jette authored Nov 30, 2016

9dd934c1
Change variable name for better clarity · 2b3122b0
Morris Jette authored Nov 30, 2016
```
No change in logic
```
2b3122b0

cray/burst_buffer - Increase timer · b4763c75

Morris Jette authored Nov 30, 2016

cray/burst_buffer - Increase time to synchronize operations between threads
    from 5 to 60 seconds ("setup" operation time observed over 17 seconds).
    This should fix a race condition between a thread performing a buffer
    creation (setup) and a thread looking for unexpected buffers. If a
    buffer is found during the time window allowed for creation, it's
    space will be counted twice. First by the status checking thread
    and second by the thread doing the creation. The deallocation only
    happens once, so the used space information can be left with an
    invalid value.
bug 3295

b4763c75

file_bcast.h - remove unused max_blocks member of file_bcast_info · 156e62e9

Tim Wickberg authored Nov 30, 2016

Never used and is uninitialized making backtraces more confusing.
Fix whitespace in bcast_parameters struct while here.

No functional change.

156e62e9

sbcast - prevent segfault in slurmd from multiple zlib compressed transfers · 8c5765c9

Tim Wickberg authored Nov 30, 2016

static variable means multiple active decompression streams will corrupt
zlib's internal state, which can lead to a segfault.

Bug 3299.

8c5765c9

29 Nov, 2016 4 commits
- Fix SuspendExcNodes and SuspendExcParts on slurmctld SIGHUP. · bb06dd65
  Alejandro Sanchez authored Nov 29, 2016
```
On a reconfig, the exc_node_bitmap is cleared but then it was
not built again since last_work_scan was declared as a local static
variable in _do_power_work(). The fix is to make it global within the
plugin and reinitialize it to 0 on _init_power_config().

Bug 3078.
```
  bb06dd65
- docs - change style.css to fix last bullet point in ul lists · 0addff39
  Tim Wickberg authored Nov 29, 2016
  
  0addff39
- docs - reformat publications page to fit in with new style. · 0c42ed74
  Tim Wickberg authored Nov 29, 2016
  
  0c42ed74
- docs - update publications with additional SC16 material · dd8f9ba1
  Tim Wickberg authored Nov 29, 2016
  
  dd8f9ba1
28 Nov, 2016 4 commits
- Make the openssl crypto plugin compile with openssl >= 1.1. · fd747355
  Alejandro Sanchez authored Nov 28, 2016
  
  fd747355
- docs - update publications list with initial set of SC16 booth / BOF talks · 7a1e2041
  Tim Wickberg authored Nov 28, 2016
  
  7a1e2041
- sacctmgr - prevent segfault when trying to reset usage for an invalid account · 9e028071
  Dominik Bartkiewicz authored Nov 28, 2016
```
Bug 3267.
```
  9e028071
- srun - prevent segfault in launch plugin when terminating not-yet-created step. · d4aa1998
  Dominik Bartkiewicz authored Nov 28, 2016
```
Termination can race against step creation if, e.g., ill-behaved SPANK plugins
are in use.

Bug 3248.
```
  d4aa1998
23 Nov, 2016 1 commit
- Fix issue with search not displaying things correctly. · d0884f81
  Danny Auble authored Nov 23, 2016
  
  d0884f81
22 Nov, 2016 5 commits

Correct malloc data type · a12e1a1c

Morris Jette authored Nov 22, 2016

sched/backfill plugin: Make malloc match data type (defined as uint32_t and
allocated as int). No failures observed, if type "int" is smaller than
"uint32_t", it could result in an invalid memory reference.

a12e1a1c

Fix slurm_job_cpus_allocated_str_on_node_id() API call. · 0ed6488e

Sergey Meirovich authored Nov 22, 2016

Fix API call: slurm_job_cpus_allocated_str_on_node_id() and
in turn slurm_job_cpus_allocated_str_on_node() to return correct
results for anything but first node. This was caused by missed logic
to calculate fist bit belongs to particular node. Lookup was always
starting from bit 0.

Bug 3266.

0ed6488e

backfill algorithm logic · e089b63a

Morris Jette authored Nov 22, 2016

After one second of wall time, simulate the termination of all remaining
   running jobs in order to respond in a reasonable time frame.
bug 3275

e089b63a

Modify backfill algorithm · 6008b021

Morris Jette authored Nov 22, 2016

Modify backfill algorithm to improve performance with large numbers of
    running jobs. Group running jobs that end in a "similar" time frame using a
    time window that grows exponentially rather than linearly. The original
    window sizes were (in units of minutes):
    0, 1, 2, 3, 4, 5, 6, 7, ... minutes
    The new window sizes are:
    0.5, 1, 2, 4, 8, 16, 32, ... minutes
    This can dramatically reduce the number of instances where the very time
    consuming "can the pending job run now" operation is executed, especailly
    if there are 1000+ running jobs.
bug 3275

6008b021

testsuite - fix job id output in test17.39 · 44241006
Nicolas Joly authored Nov 22, 2016

44241006

20 Nov, 2016 1 commit
- Fix formatting problem on man page · 924e1dfe
  Morris Jette authored Nov 20, 2016
  
  924e1dfe
15 Nov, 2016 1 commit
- doc/html - style adjustment · c048cace
  Tim Wickberg authored Nov 14, 2016
```
Prevent a scrollbar from appearing on the SchedMD logo in the top left.
```
  c048cace
14 Nov, 2016 5 commits
- avoid additional job allocations on booting nodes · b927fb08
  Morris Jette authored Nov 14, 2016
```
If a node is booting for some job, don't allocate additional jobs to the
    node until the boot completes.
but 3256
```
  b927fb08
- Change http to https on schedmd as well as slurm references. · 41b87940
  Danny Auble authored Nov 14, 2016
  
  41b87940
- Add note about switching accounting_storage plugins at the same time · df00db73
  Danny Auble authored Nov 14, 2016
```
doing an upgrade.

It isn't advised.  Do one then the other.  Basically if you are using the
mysql plugin make sure you add the cluster to the system as the mysql
plugin doesn't do that explicitly.

Bug 3131
```
  df00db73
- Fix spelling in docs. · 9bb15992
  Brian Christiansen authored Nov 14, 2016
  
  9bb15992
- Make it so pages wrap if needed. Before it would just cut them off · 68b5b823
  Danny Auble authored Nov 13, 2016
```
and you wouldn't be able to read anything after the cut.
```
  68b5b823
13 Nov, 2016 2 commits
- cgroup plugins - fix two minor memory leaks · 85ab952a
  Alejandro Sanchez authored Nov 13, 2016
```
Found with valgrind. Bug 2846.
```
  85ab952a
- Make it so the pixels all line up. · 148e6552
  Danny Auble authored Nov 12, 2016
  
  148e6552