Commits · 02c96859e0d1df5f98b9422a64059820cc7035c7 · Manuel G. Marciani / ces_slurm_simulator

14 Aug, 2015 3 commits

Non-default settings of AuthInfo fix · 02c96859

Daniel Ahlin authored Aug 13, 2015

We are using a non-default AuthInfo configuration and based on
log-messages we see I believe this is not properly handled in certain
parts of the code.

Typical log message:
Aug 12 17:06:15 t02n20 slurmd[27001]: error: Munge encode failed:
Failed to access "/var/run/munge/munge.socket.2": No such file or
directory
Aug 12 17:06:15 t02n20 slurmd[27001]: error: Creating authentication
credential: Socket communication error
Aug 12 17:06:15 t02n20 slurmd[27001]: error: stepd_connect to 3165.0
failed: Protocol authentication error
Aug 12 17:06:15 t02n20 slurmd[27001]: error: If munged is up, restart
with --num-threads=10

02c96859

Correct description in function header · 5410bdc2
Morris Jette authored Aug 13, 2015

5410bdc2

Rebuild core reservation bitmap as needed · 931d1814

Morris Jette authored Aug 13, 2015

Unfortunately the reservation's core_bitmap is a global bitmap and the nodes
in the system may have changed in terms of their node count or nodes have been
added or removed. Make best effort to rebuild the reservation's core_bitmap
on the limited information currently available. The specific cores might
change, but this logic at least leaves their count constant and uses the
same nodes
bug 1850

931d1814

11 Aug, 2015 8 commits
- job configuring state fix · 53b85d34
  Morris Jette authored Aug 11, 2015
```
This propagates the logic from commit aa81c550
into another place where the job CONFIGURING flag is set.
```
  53b85d34
- Change spaces to tabs in commit aa81c550 · fc386185
  Morris Jette authored Aug 11, 2015
  
  fc386185
- Fix build problem on BG caused by aa81c550 . · 55743ca0
  David Bigagli authored Aug 11, 2015
  
  55743ca0
- Fix bug in slurmctld which was displaying job in CF as RUNNING. · aa81c550
  David Bigagli authored Aug 11, 2015
  
  aa81c550
- Update NEWS · 0d8bc878
  Brian Christiansen authored Aug 10, 2015
```
Bug 1770
```
  0d8bc878
- Additional fixes to perl API. · 1297e9bb
  Brian Christiansen authored Aug 10, 2015
```
Bug 1770
Prevent segfaults and return reservation with NULL partition.
```
  1297e9bb
- More memory leak fixes in Perl API. · a8776440
  Brian Christiansen authored Aug 10, 2015
```
Bug 1770
Continuation of ad8279d010.
Add custom typemaps that free pointer after it has been copied onto the stack.
```
  a8776440
- Fix memory leaks in Perl API. · 5c1a84da
  Brian Christiansen authored Aug 10, 2015
```
Bug 1770
Just return the static memory, it's not going to increase.
```
  5c1a84da
10 Aug, 2015 1 commit
- Fix minor typo in man page. · 4f2a1e2e
  David Bigagli authored Aug 10, 2015
  
  4f2a1e2e
06 Aug, 2015 1 commit
- ALPS - Fix issue when using --exclusive flag on srun to do the correct · e84965ff
  Danny Auble authored Aug 06, 2015
```
thing (-F exclusive) instead of -F share.

bug 1849
```
  e84965ff
04 Aug, 2015 3 commits
- Dont reboot nodes with suspended hosts. · 44901a6f
  David Bigagli authored Aug 04, 2015
  
  44901a6f
- Correction to commit 0a51f0ec · 93973af2
  Morris Jette authored Aug 03, 2015
  
  93973af2
- Fix for job array priority ordering · 0a51f0ec
  Morris Jette authored Aug 03, 2015
```
Fix scheduling anomaly with job arrays submitted to multiple partitions,
    jobs could be started out of priority order.
bug 1892
```
  0a51f0ec
03 Aug, 2015 3 commits
- MYSQL - Fix issue where if an association id ever changes on at least a · 4796e7d4
  Danny Auble authored Aug 03, 2015
```
portion of a job array is pending after it's initial start in the
database it could create another row for the remain array instead
of using the already existing row.

bug 1840
```
  4796e7d4
- MYSQL - Update mod_time when updating a start job record or adding one. · 097e169a
  Danny Auble authored Aug 03, 2015
  
  097e169a
- Simplify code to continue on case instead of do something on the · 710daf7e
  Danny Auble authored Aug 03, 2015
```
opposite.
```
  710daf7e
31 Jul, 2015 1 commit
- Fix bug in conversion unit. · 406cb529
  David Bigagli authored Jul 31, 2015
  
  406cb529
28 Jul, 2015 6 commits
- remove whitespace · 67fc2cf9
  Danny Auble authored Jul 28, 2015
  
  67fc2cf9
- update FreeBSD note in quickstart admin · 962685ce
  Danny Auble authored Jul 28, 2015
  
  962685ce
- Fix segfault in Bluegene setups where RebootQOSList is defined in... · 9bf78c9b
  Brian Christiansen authored Jul 28, 2015
```
Fix segfault in Bluegene setups where RebootQOSList is defined in bluegene.conf and accounting is not setup.
```
  9bf78c9b
- Read in correct number of nodes from SLURM_HOSTFILE when specifying nodes and... · 802988d6
  Nathan Yee authored Jul 28, 2015
```
Read in correct number of nodes from SLURM_HOSTFILE when specifying nodes and --distribution=arbitrary.

Bug 1808
```
  802988d6
- Update NEWS · 428514ca
  Brian Christiansen authored Jul 28, 2015
```
From eee63695
```
  428514ca
- Set block distribution when srun requests 0 memory. · eee63695
  Brian Christiansen authored Jul 28, 2015
```
Bug 1810
```
  eee63695
27 Jul, 2015 6 commits
- Move logic to optimize performance · 23f6ec89
  Morris Jette authored Jul 27, 2015
```
No change in functionality
```
  23f6ec89
- Fix bug in node selection with topology optimization · 9dad2ff7
  Morris Jette authored Jul 27, 2015
```
If node definitions in slurm.conf are spread across multiple lines
and topology/tree is configured, then sub-optimal node selection can
occur.
bug 1645
```
  9dad2ff7
- Prevent slurmctld segv when delete reservation name is NULL · 18bbf378
  Dorian Krause authored Jul 27, 2015
  
  18bbf378
- More accurate log message · 34b6f814
  Dominik Bartkiewicz authored Jul 27, 2015
  
  34b6f814
- Log build job queue timeout less ofter · aee694eb
  Morris Jette authored Jul 27, 2015
```
Rather than generating loads of log messages about too much time
being used to build the job queue every few seconds, log it only
every 10 minutes.
bug 1827
```
  aee694eb
- Log job-partition pair count in job queue · 4f40d604
  Morris Jette authored Jul 27, 2015
```
Log the number of job-partition pairs added to the queue for job
scheduling.
bug 1827
```
  4f40d604
22 Jul, 2015 2 commits
- Capture salloc/srun information in sdiag statistics · dc079ea8
  Nicolas Joly authored Jul 22, 2015
```
Previously only batch job completions were being captured.
bug 1820
```
  dc079ea8
- Correct the sacct.a man page. · 8f1c1a80
  David Bigagli authored Jul 22, 2015
  
  8f1c1a80
21 Jul, 2015 5 commits

Fix typo on web page · 49560d80
Morris Jette authored Jul 21, 2015

49560d80

fix incorrect reading of cpuinfo on POWER systems · 962dea86

Chandler Wilkerson authored Jul 21, 2015

This patch provides a rewrite of how /proc/cpuinfo is parsed in common_jag.c, as the original code made the incorrect assumption that cpuinfo follows a sane format across architectures ;-)

The motivation for this patch is that the original code was producing stack smashing on a POWER7 running RHEL6.4 Red Hat adds -fstack-protector along with a lot of other protective CFLAGS when building RPMs. The code ran okay with -fno-stack-protector, but that is not the best work-around.

So, the relevant /proc/cpuinfo line on an Intel (Xeon X5675) system looks like:

cpu MHz                : 3066.915

Whereas the relevant line in a POWER7 system is

clock                : 3550.000000MHz

My patch replaces the assumption that the relevant number starts 11 characters into the string with another assumption: That the relevant number starts two characters after a colon in a string that matches (M|G)Hz.

All in all, the function has a few more calls, which may be a performance issue if it has to be called multiple times, but since the section I edited only gets evaluated if we don't know the cpu frequency, getting it right will actually result in fewer string operations and unnecessary opens of /proc/cpuinfo for systems likewise affected.

Finally, I also read the actual value into a double and multiply it up to the size indicated by the suffix, so we end up with KHz? It was unclear what the original code intended, since it matched both MHz and GHz, replaced the decimal point with a zero, and read the result as an int.

--
Chandler Wilkerson
Center for Research Computing
Rice University

962dea86

Revert "ALPS - Remove sanity code to work like it did in 2.5. This is an addition" · 6ab26aa6

Danny Auble authored Oct 16, 2014

This reverts commit 2c95e2d2.

Conflicts:
	src/plugins/select/alps/basil_interface.c

This is related to bug 1822.  It isn't clear why the code was taken out in
this commit in the first place and based off of commit 2e2de6a4 (which is
the reason for the conflict) we tried unsuccessfully to put it back.

It appears the only difference here is the addition of
always setting mppnppn = 1 instead of always to
job_ptr->details->ntasks_per_node when no ntasks is set.

This appears to only be an issue with salloc or sbatch as ntasks
is always set for srun.

6ab26aa6

Clarify GraceTime configuration · afae90b1
Morris Jette authored Jul 20, 2015

afae90b1
Enhance the error message. · 2c937879
david authored Jul 21, 2015

2c937879

20 Jul, 2015 1 commit
- Update SLUG agenda · 6edf1be4
  Morris Jette authored Jul 20, 2015
  
  6edf1be4