Commits · 1d2545ca746ea0a2f9653664a601052cfd5eb8ad · Manuel G. Marciani / ces_slurm_simulator

26 Aug, 2015 3 commits

Prevent wrong job array task ID from being shown · 1d2545ca

Morris Jette authored Aug 26, 2015

Prevent job array task ID from being reported as NO_VAL if last task in the
array gets requeued. The problem is that when that task starts, the task
bitmap entry for it stays set, but the task counter gets decremented.
If that job then gets requeued, under some conditions a failure to schedule
it results in the array_task_id in the job record getting set to NO_VAL.
Then when building the job info to report for squeue/scontrol, the string
showing the pending task ID's is not rebuilt due to that counter being
zero. All indications are that the job runs fine, only the information
reported to squeue/scontrol is wrong.
bug 1790

1d2545ca

ALPS - Make it so srun --hint=nomultithread works correctly. · 9fc01907
Danny Auble authored Aug 25, 2015

9fc01907
MySQL - Fix minor memory leak if a connection ever goes away whist using it. · c3d88da8
Danny Auble authored Aug 25, 2015

c3d88da8

25 Aug, 2015 2 commits
- Fix testing for CR_Memory when CR_Memory and CR_ONE_TASK_PER_CORE are used with select/linear. · 3227b364
  Brian Christiansen authored Aug 25, 2015
```
Bug 1873
```
  3227b364
- ALPS - Fix compile to not link against -ljob and -lexpat with every lib · 92b6e921
  Danny Auble authored Aug 24, 2015
```
or binary.
```
  92b6e921
21 Aug, 2015 5 commits
- Fix segfault in sreport when there was no response from the dbd. · 307bd82d
  Brian Christiansen authored Aug 21, 2015
```
Bug 1831
```
  307bd82d
- Sort job arrays in job queue according to array_task_id when priorities are equal. · 39ea3440
  Brian Christiansen authored Aug 21, 2015
```
Bug 1869
```
  39ea3440
- Fix a bug in squeue which prevented squeue -tPD to print array jobs. · 703dd6a6
  David Bigagli authored Aug 21, 2015
  
  703dd6a6
- Fix preemption issue that could kill job at startup · 47f96cae
  Morris Jette authored Aug 20, 2015
```
Fix gang scheduling/preemption issue that could cancel job at startup.
I have not been able to reproduce the reported problem, but this
should prevent the reported problem.
bug 1880
```
  47f96cae
- Fix a couple of typos in NEWS · ce20585a
  Morris Jette authored Aug 20, 2015
  
  ce20585a
20 Aug, 2015 1 commit

Revert "Fix bug in conversion unit." · 696a4917

David Bigagli authored Jul 31, 2015

This reverts commit 406cb529.

Conflicts:
	NEWS

This works ok up until 1048576M then it will make the divisor
1048576 and when it goes to divide it only does it once instead of 2 times
returning 1G instead of 1T.

We have decided to wait until 15.08 for a no convert option to be added.

696a4917

14 Aug, 2015 2 commits

Non-default settings of AuthInfo fix · 02c96859

Daniel Ahlin authored Aug 13, 2015

We are using a non-default AuthInfo configuration and based on
log-messages we see I believe this is not properly handled in certain
parts of the code.

Typical log message:
Aug 12 17:06:15 t02n20 slurmd[27001]: error: Munge encode failed:
Failed to access "/var/run/munge/munge.socket.2": No such file or
directory
Aug 12 17:06:15 t02n20 slurmd[27001]: error: Creating authentication
credential: Socket communication error
Aug 12 17:06:15 t02n20 slurmd[27001]: error: stepd_connect to 3165.0
failed: Protocol authentication error
Aug 12 17:06:15 t02n20 slurmd[27001]: error: If munged is up, restart
with --num-threads=10

02c96859

Rebuild core reservation bitmap as needed · 931d1814

Morris Jette authored Aug 13, 2015

Unfortunately the reservation's core_bitmap is a global bitmap and the nodes
in the system may have changed in terms of their node count or nodes have been
added or removed. Make best effort to rebuild the reservation's core_bitmap
on the limited information currently available. The specific cores might
change, but this logic at least leaves their count constant and uses the
same nodes
bug 1850

931d1814

11 Aug, 2015 2 commits
- Fix bug in slurmctld which was displaying job in CF as RUNNING. · aa81c550
  David Bigagli authored Aug 11, 2015
  
  aa81c550
- Update NEWS · 0d8bc878
  Brian Christiansen authored Aug 10, 2015
```
Bug 1770
```
  0d8bc878
06 Aug, 2015 1 commit
- ALPS - Fix issue when using --exclusive flag on srun to do the correct · e84965ff
  Danny Auble authored Aug 06, 2015
```
thing (-F exclusive) instead of -F share.

bug 1849
```
  e84965ff
04 Aug, 2015 2 commits
- Dont reboot nodes with suspended hosts. · 44901a6f
  David Bigagli authored Aug 04, 2015
  
  44901a6f
- Fix for job array priority ordering · 0a51f0ec
  Morris Jette authored Aug 03, 2015
```
Fix scheduling anomaly with job arrays submitted to multiple partitions,
    jobs could be started out of priority order.
bug 1892
```
  0a51f0ec
03 Aug, 2015 2 commits
- MYSQL - Fix issue where if an association id ever changes on at least a · 4796e7d4
  Danny Auble authored Aug 03, 2015
```
portion of a job array is pending after it's initial start in the
database it could create another row for the remain array instead
of using the already existing row.

bug 1840
```
  4796e7d4
- MYSQL - Update mod_time when updating a start job record or adding one. · 097e169a
  Danny Auble authored Aug 03, 2015
  
  097e169a
31 Jul, 2015 1 commit
- Fix bug in conversion unit. · 406cb529
  David Bigagli authored Jul 31, 2015
  
  406cb529
28 Jul, 2015 4 commits
- Fix segfault in Bluegene setups where RebootQOSList is defined in... · 9bf78c9b
  Brian Christiansen authored Jul 28, 2015
```
Fix segfault in Bluegene setups where RebootQOSList is defined in bluegene.conf and accounting is not setup.
```
  9bf78c9b
- Read in correct number of nodes from SLURM_HOSTFILE when specifying nodes and... · 802988d6
  Nathan Yee authored Jul 28, 2015
```
Read in correct number of nodes from SLURM_HOSTFILE when specifying nodes and --distribution=arbitrary.

Bug 1808
```
  802988d6
- Update NEWS · 428514ca
  Brian Christiansen authored Jul 28, 2015
```
From eee63695
```
  428514ca
- Set block distribution when srun requests 0 memory. · eee63695
  Brian Christiansen authored Jul 28, 2015
```
Bug 1810
```
  eee63695
27 Jul, 2015 1 commit

Fix bug in node selection with topology optimization · 9dad2ff7

Morris Jette authored Jul 27, 2015

If node definitions in slurm.conf are spread across multiple lines
and topology/tree is configured, then sub-optimal node selection can
occur.
bug 1645

9dad2ff7

22 Jul, 2015 2 commits
- Capture salloc/srun information in sdiag statistics · dc079ea8
  Nicolas Joly authored Jul 22, 2015
```
Previously only batch job completions were being captured.
bug 1820
```
  dc079ea8
- Correct the sacct.a man page. · 8f1c1a80
  David Bigagli authored Jul 22, 2015
  
  8f1c1a80
21 Jul, 2015 2 commits

fix incorrect reading of cpuinfo on POWER systems · 962dea86

Chandler Wilkerson authored Jul 21, 2015

This patch provides a rewrite of how /proc/cpuinfo is parsed in common_jag.c, as the original code made the incorrect assumption that cpuinfo follows a sane format across architectures ;-)

The motivation for this patch is that the original code was producing stack smashing on a POWER7 running RHEL6.4 Red Hat adds -fstack-protector along with a lot of other protective CFLAGS when building RPMs. The code ran okay with -fno-stack-protector, but that is not the best work-around.

So, the relevant /proc/cpuinfo line on an Intel (Xeon X5675) system looks like:

cpu MHz                : 3066.915

Whereas the relevant line in a POWER7 system is

clock                : 3550.000000MHz

My patch replaces the assumption that the relevant number starts 11 characters into the string with another assumption: That the relevant number starts two characters after a colon in a string that matches (M|G)Hz.

All in all, the function has a few more calls, which may be a performance issue if it has to be called multiple times, but since the section I edited only gets evaluated if we don't know the cpu frequency, getting it right will actually result in fewer string operations and unnecessary opens of /proc/cpuinfo for systems likewise affected.

Finally, I also read the actual value into a double and multiply it up to the size indicated by the suffix, so we end up with KHz? It was unclear what the original code intended, since it matched both MHz and GHz, replaced the decimal point with a zero, and read the result as an int.

--
Chandler Wilkerson
Center for Research Computing
Rice University

962dea86

Revert "ALPS - Remove sanity code to work like it did in 2.5. This is an addition" · 6ab26aa6

Danny Auble authored Oct 16, 2014

This reverts commit 2c95e2d2.

Conflicts:
	src/plugins/select/alps/basil_interface.c

This is related to bug 1822.  It isn't clear why the code was taken out in
this commit in the first place and based off of commit 2e2de6a4 (which is
the reason for the conflict) we tried unsuccessfully to put it back.

It appears the only difference here is the addition of
always setting mppnppn = 1 instead of always to
job_ptr->details->ntasks_per_node when no ntasks is set.

This appears to only be an issue with salloc or sbatch as ntasks
is always set for srun.

6ab26aa6

18 Jul, 2015 1 commit
- Permit update of reservation with no nodes · 4895b81b
  Brian Christiansen authored Jul 17, 2015
```
Prevent slurmctld abort on update of advanced reservation that contains no
    nodes.
bug 1814
```
  4895b81b
17 Jul, 2015 4 commits

srun memory option fix · bbb92b99

Morris Jette authored Jul 17, 2015

srun command line of either --mem or --mem-per-cpu will override both the
SLURM_MEM_PER_CPU and SLURM_MEM_PER_NODE environment variables.
Without this change, salloc or sbatch setting --mem-per-cpu (or a
configuration of DefMemPerCPU) would over-ride the step's --mem
value.

bbb92b99

MYSQL - Fix minor memory leak when modifying an association but no · 340c25fa
Danny Auble authored Jul 17, 2015
```
change was made.
```
340c25fa
MYSQL - Close chance of setting the wrong limit on an association · 22189a75
Danny Auble authored Jul 16, 2015
```
when removing a limit from an association on multiple clusters
at the same time.
```
22189a75

MYSQL - Make it so you don't have to restart the slurmctld · 9b10f0fb

Danny Auble authored Jul 16, 2015

to gain the correct limit when a parent account is root and you
remove a subaccount's limit which exists on the parent account.

9b10f0fb

16 Jul, 2015 1 commit
- Add NEWS for previous commit · 195d4cc8
  Morris Jette authored Jul 16, 2015
  
  195d4cc8
15 Jul, 2015 3 commits
- squeue: Enable filtering for job state SPECIAL_EXIT · d4d51de7
  Morris Jette authored Jul 15, 2015
  
  d4d51de7
- squeue: Removed the new line from job array ID · 8247cb26
  Nathan Yee authored Jul 15, 2015
  
  8247cb26
- Fix plane distribution to allocate in blocks rather than cyclically. · 9d7f1507
  Nathan Yee authored Jul 15, 2015
```
Bug 1798
```
  9d7f1507
14 Jul, 2015 1 commit
- CRAY - Fix seg fault if a blade is replaced and slurmctld is restarted. · 43d0ad6f
  Danny Auble authored Jul 14, 2015
  
  43d0ad6f