Commits · 07ce0773ad7b101a3797704613bb8e64e69c1e6e · Manuel G. Marciani / ces_slurm_simulator

13 Mar, 2017 1 commit

Alejandro Sanchez authored Mar 13, 2017

Code calls list_find_first to search in resv_list whether the requested
name for the new reservation already exists. If it exists, resv_ptr is
set with the pointer to the existing reservation. Then the code goto
bad_parse label and xfreed that resv_ptr, thus corrupting the list data
by freeing the existing reservation. This is fixed by only freeing memory on the
new local resv_ptr instead of always freeing memory. xfree is also not
sufficient for freeing the memory, we needed to call _del_resv_rec() or we would
leak the memory we had transferred from the resv_desc_ptr. This also involved
NULLing out the other variables freed after bad_parse, or you would get
double frees.

Bug 3558.

07ce0773

11 Mar, 2017 1 commit
- Make sure locks are always in place when calling · b2bf5a06
  Danny Auble authored Mar 10, 2017
```
assoc_mgr_make_tres_str_from_array.
```
  b2bf5a06
10 Mar, 2017 5 commits
- Fix double read lock of tres when updating gres or licenses on a job. · 7af1dbb4
  Danny Auble authored Mar 10, 2017
  
  7af1dbb4
- Fix sacct from returning way more than requested when querying against a job · fc0221f2
  Tim Shaw authored Mar 10, 2017
```
array task id.

This is combination with the previous commit.

Bug 3563
```
  fc0221f2
- Export missing slurmdbd_defs_[init|fini] needed for libslurmdb.so to work. · 5a5347c7
  Danny Auble authored Mar 10, 2017
  
  5a5347c7
- Rename seff commands to match other scripts we do. The end product is the same. · 3c75a140
  Danny Auble authored Mar 10, 2017
```
I also made seff work on a non-standard perl install.

I would had loved to do this in multiple commits, but it wasn't that possible.
```
  3c75a140
- Fix memory error when updating a job's licenses. · cfb90cfb
  Dominik Bartkiewicz authored Mar 09, 2017
  
  cfb90cfb
08 Mar, 2017 10 commits
- Fix possible double insertion into database when a job is updated at the · daa2d49a
  Danny Auble authored Mar 08, 2017
```
moment the dbd is assigning a db_index.

Bug 3512

This is a definite culprit of creating runaway jobs.
```
  daa2d49a
- Add ESLURM_JOB_SETTING_DB_INX to errno to note when a job can't be updated · a6254afa
  Danny Auble authored Mar 08, 2017
```
because the dbd is setting a db_index.
```
  a6254afa
- Fix segfault if slurmctld is shutting down and the slurmdbd plugin was · 9d3737e7
  Danny Auble authored Mar 08, 2017
```
in the middle of setting db_indexes.
```
  9d3737e7
- Fix missing unlock when job_list doesn't exist when starting priority/ · c545515b
  Danny Auble authored Mar 08, 2017
```
multifactor.  Almost impossible to make happen, but still a bug.
```
  c545515b
- Fix pam_slurm_adopt. · 68e64e69
  Tim Wickberg authored Mar 08, 2017
```
The stepd->protocol_version field is not set at this point,
if not handled here then the following call to stepd_get_uid
will fail as a protocol_version of zero is rejected.

This appeared to work in 16.05 and older as the stepd_get_uid()
API call used to ignore the invalid protocol_version and
return the correct result anyways, but that was fixed in c177ff95.

Bug .
```
  68e64e69
- Fix gres output of a job if it is updated while pending to be displayed · fec995e0
  Tim Shaw authored Mar 08, 2017
```
correctly with Slurm tools.

What this does is replace the incoming string with the gres actually chosen.

Continuation of last commit.

Bug 3521
```
  fec995e0
- Fix rare segfault when shutting down slurmctld and still sending data to · 44eb7c52
  Danny Auble authored Mar 07, 2017
```
the database.
```
  44eb7c52
- Add s_p_get_uint64 to slurm_xlator.h to fix programs linking against libslurm. · a7699ba4
  Thomas Opfer authored Mar 08, 2017
```
Symbol required for read_slurm_cgroup_conf() call to work.

Bug 3550.
```
  a7699ba4
- burst_buffer/cray - Fix for hostlist parsing · 73328605
  Morris Jette authored Mar 08, 2017
```
burst_buffer/cray - Fix parsing for discontinuous allocated nodes. A job
    allocation of "20,22" must be expressed as "20\n22" when passed to
    the dw_wlm_cli program.
bugs 3540 and 3544
```
  73328605
- Improve the srun documentation for the --resv-ports option · caefb2b4
  Morris Jette authored Mar 07, 2017
  
  caefb2b4
07 Mar, 2017 3 commits

capmc_resume not changing node state · cecaf222

Morris Jette authored Mar 07, 2017

capmc_resume (Cray resume node script) - Do not disable changing a node's
    active features if SyscfgPath is configured in the knl.conf file.
bug 3533

cecaf222

Fix for job cancel when node reconfig fails · 2e21d9f7

Morris Jette authored Mar 07, 2017

If a job is cancelled by the user while it's allocated nodes are being
reconfigured (i.e. the capmc_resume program is rebooting nodes for the job)
and the node reconfiguration fails (i.e. the reboot fails), then don't
requeue the job but leave it in a cancelled state. Note the JOB_RECONFIG_FAIL
state flag is currently only used by capmc_resume, but could be used for
other programs responsible for node reboots.
bug 3392

2e21d9f7

burst_buffer/cray - Add support for line continuation · c8146969
Morris Jette authored Mar 06, 2017
```
bug 3538
```
c8146969

03 Mar, 2017 2 commits
- Update hyperlink to LBNL Node Health Check program · 0aad41d4
  Morris Jette authored Mar 03, 2017
  
  0aad41d4
- Replace clock_gettime with time(NULL) for very old systems without the call. · 3ef2f41e
  Danny Auble authored Mar 02, 2017
  
  3ef2f41e
02 Mar, 2017 9 commits

Update NEWS for start of v17.02.2 work · 40a9bc5f
Morris Jette authored Mar 02, 2017

40a9bc5f
Start NEWS for v16.05.11 · 90f45b7a
Morris Jette authored Mar 02, 2017

90f45b7a

Fix for CPU binding for job steps run under a batch job · 5e80c115

Morris Jette authored Mar 02, 2017

This is a partial reversion of commit 69684648
NOTE: sbatch does not support --cpu_bind (although the documentation does list
  the option) and the --mem_bind options set SBATCH_* environment variables
  that nothing every looks at. In other words, it needs some work.
Bugs 3519 and 3188

5e80c115

Added "SyscfgTimeout" parameter to knl.conf configuration file · 32ded0c3
Felip Moll authored Mar 02, 2017
```
bug 3525
```
32ded0c3

Update NEWS for v16.05 mods · 3f649cec

Morris Jette authored Mar 02, 2017

Copy NEWS item updates from v16.05 applied since v17.02.0 tag to
  to NEWS for v17.02.1

3f649cec

power management data strucure change · 79b76b40

Morris Jette authored Mar 02, 2017

Convert a slurmctd power management data structure from array to list in
    order to eliminate the possibility of zombie child suspend/resume
    processes.
bug 3516

79b76b40

Do not set cpu frequency / governor on batch or extern steps. · 6ea92f5d

Tim Wickberg authored Mar 02, 2017

This now matches the behavior documented in sbatch.

This resolves a problem where the maximum cpu frequency would be set
to the minimum available on the node by the batch step. This is due
to the batch step leaving cpu_freq_{min,max,gov} uninitialized to zero,
which is then translated to a request to set the frequency to the lowest
available in the node. This did not impact 16.05 or earlier, as a request
for a zero frequency was ignored by a quirk of _cpu_freq_freqspec_num.
This quirk was removed by commit f40e1c01 before 17.02.0-rc1.

Bug 3510.

6ea92f5d

Refactor slurmctld agent logic to eliminate some pthreads · e58c2282
Morris Jette authored Mar 02, 2017
```
bug 3516
```
e58c2282
Increase number of current ResumePrograms that can be supported · d87f6de5
Morris Jette authored Mar 01, 2017
```
from 10 to 100.
bug 3516
```
d87f6de5

01 Mar, 2017 3 commits
- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly. · f338c4eb
  Alejandro Sanchez authored Mar 01, 2017
  
  f338c4eb
- Print formatted tres string when creating/updating a reservation. · 8dfffa28
  Danny Auble authored Feb 28, 2017
  
  8dfffa28
- Fix print of consumed energy in sstat when no energy is being collected. · 9a168d20
  Danny Auble authored Feb 28, 2017
  
  9a168d20
28 Feb, 2017 4 commits

If gres is NULL on a job don't try to process it when returning detailed · f7a24285
Dominik Bartkiewicz authored Feb 28, 2017
```
information about a job to scontrol.
```
f7a24285
Fix missing locks in gres logic to avoid potential memory race. · 58a2f450
Dominik Bartkiewicz authored Feb 28, 2017

58a2f450

Remove unneeded job lock when running assoc_mgr cache. This lock could · b17e2aee

Danny Auble authored Feb 28, 2017

cause potential deadlock when/if TRES changed in the database and the
slurmctld wasn't made aware of the change.  This would be very rare.

The lock was originally there to keep new jobs from grabbing the assoc
information.  If the lock was done afterwards the worst case is we get the new
information.

b17e2aee

Fix deadlock scenario when dumping configuration in the slurmctld. · 0c5e3508

Danny Auble authored Feb 28, 2017

It was determined we didn't need the write locks on the job and no locks were
needed on the node either.

Doing these different locked beforehand would create a window where you could
get a config write lock

0c5e3508

27 Feb, 2017 2 commits

Update slurm.spec file to note obsolete RPMs. · 95cf960a
Daniel Letai authored Feb 27, 2017

95cf960a

Reset job update time when job is held after begin failure · e62d288d

Morris Jette authored Feb 27, 2017

This will be triggered after either a burst buffer job_begin function
  or select plugin job_begin function fails. Without this change, the
  "squeue -i" and "scontrol show job" commands can report old job
  state information.
bug 3504

e62d288d