Commits · 23f6ec89d718ff2022bc4f2e85b9f491337fc920 · Manuel G. Marciani / ces_slurm_simulator

27 Jul, 2015 6 commits
- Move logic to optimize performance · 23f6ec89
  Morris Jette authored Jul 27, 2015
```
No change in functionality
```
  23f6ec89
- Fix bug in node selection with topology optimization · 9dad2ff7
  Morris Jette authored Jul 27, 2015
```
If node definitions in slurm.conf are spread across multiple lines
and topology/tree is configured, then sub-optimal node selection can
occur.
bug 1645
```
  9dad2ff7
- Prevent slurmctld segv when delete reservation name is NULL · 18bbf378
  Dorian Krause authored Jul 27, 2015
  
  18bbf378
- More accurate log message · 34b6f814
  Dominik Bartkiewicz authored Jul 27, 2015
  
  34b6f814
- Log build job queue timeout less ofter · aee694eb
  Morris Jette authored Jul 27, 2015
```
Rather than generating loads of log messages about too much time
being used to build the job queue every few seconds, log it only
every 10 minutes.
bug 1827
```
  aee694eb
- Log job-partition pair count in job queue · 4f40d604
  Morris Jette authored Jul 27, 2015
```
Log the number of job-partition pairs added to the queue for job
scheduling.
bug 1827
```
  4f40d604
22 Jul, 2015 2 commits
- Capture salloc/srun information in sdiag statistics · dc079ea8
  Nicolas Joly authored Jul 22, 2015
```
Previously only batch job completions were being captured.
bug 1820
```
  dc079ea8
- Correct the sacct.a man page. · 8f1c1a80
  David Bigagli authored Jul 22, 2015
  
  8f1c1a80
21 Jul, 2015 5 commits

Fix typo on web page · 49560d80
Morris Jette authored Jul 21, 2015

49560d80

fix incorrect reading of cpuinfo on POWER systems · 962dea86

Chandler Wilkerson authored Jul 21, 2015

This patch provides a rewrite of how /proc/cpuinfo is parsed in common_jag.c, as the original code made the incorrect assumption that cpuinfo follows a sane format across architectures ;-)

The motivation for this patch is that the original code was producing stack smashing on a POWER7 running RHEL6.4 Red Hat adds -fstack-protector along with a lot of other protective CFLAGS when building RPMs. The code ran okay with -fno-stack-protector, but that is not the best work-around.

So, the relevant /proc/cpuinfo line on an Intel (Xeon X5675) system looks like:

cpu MHz                : 3066.915

Whereas the relevant line in a POWER7 system is

clock                : 3550.000000MHz

My patch replaces the assumption that the relevant number starts 11 characters into the string with another assumption: That the relevant number starts two characters after a colon in a string that matches (M|G)Hz.

All in all, the function has a few more calls, which may be a performance issue if it has to be called multiple times, but since the section I edited only gets evaluated if we don't know the cpu frequency, getting it right will actually result in fewer string operations and unnecessary opens of /proc/cpuinfo for systems likewise affected.

Finally, I also read the actual value into a double and multiply it up to the size indicated by the suffix, so we end up with KHz? It was unclear what the original code intended, since it matched both MHz and GHz, replaced the decimal point with a zero, and read the result as an int.

--
Chandler Wilkerson
Center for Research Computing
Rice University

962dea86

Revert "ALPS - Remove sanity code to work like it did in 2.5. This is an addition" · 6ab26aa6

Danny Auble authored Oct 16, 2014

This reverts commit 2c95e2d2.

Conflicts:
	src/plugins/select/alps/basil_interface.c

This is related to bug 1822.  It isn't clear why the code was taken out in
this commit in the first place and based off of commit 2e2de6a4 (which is
the reason for the conflict) we tried unsuccessfully to put it back.

It appears the only difference here is the addition of
always setting mppnppn = 1 instead of always to
job_ptr->details->ntasks_per_node when no ntasks is set.

This appears to only be an issue with salloc or sbatch as ntasks
is always set for srun.

6ab26aa6

Clarify GraceTime configuration · afae90b1
Morris Jette authored Jul 20, 2015

afae90b1
Enhance the error message. · 2c937879
david authored Jul 21, 2015

2c937879

20 Jul, 2015 1 commit
- Update SLUG agenda · 6edf1be4
  Morris Jette authored Jul 20, 2015
  
  6edf1be4
18 Jul, 2015 2 commits
- Permit update of reservation with no nodes · 4895b81b
  Brian Christiansen authored Jul 17, 2015
```
Prevent slurmctld abort on update of advanced reservation that contains no
    nodes.
bug 1814
```
  4895b81b
- Update SallocDefaultCommand docs for Cray systems. · 676ffd92
  Brian Christiansen authored Jul 17, 2015
```
Bug 1810
```
  676ffd92
17 Jul, 2015 6 commits
- srun memory option fix · bbb92b99
  Morris Jette authored Jul 17, 2015
```
srun command line of either --mem or --mem-per-cpu will override both the
SLURM_MEM_PER_CPU and SLURM_MEM_PER_NODE environment variables.
Without this change, salloc or sbatch setting --mem-per-cpu (or a
configuration of DefMemPerCPU) would over-ride the step's --mem
value.
```
  bbb92b99
- MYSQL - Fix minor memory leak when modifying an association but no · 340c25fa
  Danny Auble authored Jul 17, 2015
```
change was made.
```
  340c25fa
- Expand documentation for GraceTime · f1d55a0a
  Morris Jette authored Jul 17, 2015
  
  f1d55a0a
- Small typo fix. · b94972ec
  Nicolas Joly authored Jul 17, 2015
  
  b94972ec
- MYSQL - Close chance of setting the wrong limit on an association · 22189a75
  Danny Auble authored Jul 16, 2015
```
when removing a limit from an association on multiple clusters
at the same time.
```
  22189a75
- MYSQL - Make it so you don't have to restart the slurmctld · 9b10f0fb
  Danny Auble authored Jul 16, 2015
```
to gain the correct limit when a parent account is root and you
remove a subaccount's limit which exists on the parent account.
```
  9b10f0fb
16 Jul, 2015 6 commits
- Correct recent squeue state filter change · bb2a86da
  Morris Jette authored Jul 16, 2015
```
This fixes changes made in commit d4d51de7
Which fails for a job state of Pending (needed to special case the zero value).
```
  bb2a86da
- Add list of major slurm releases · 4c1d8f9c
  Morris Jette authored Jul 16, 2015
  
  4c1d8f9c
- abort if acct_gather_energy can't be loaded · fb296c70
  Morris Jette authored Jul 16, 2015
```
abort if specified acct_gather_energy plugin can not be loaded
  rather than deadlocking
bug 1797
```
  fb296c70
- Add NEWS for previous commit · 195d4cc8
  Morris Jette authored Jul 16, 2015
  
  195d4cc8
- Prevent a job array task_id being set to NO_VAL · b5d988a4
  Morris Jette authored Jul 16, 2015
```
Under some conditions if an attempt to schedule the last task of
  a job array (the meta-record of the job array) fails, it's
  task ID will be changed from the appropriate value to NO_VAL.
bug 1790
```
  b5d988a4
- Improved job array info logging · 351fb480
  Morris Jette authored Jul 16, 2015
  
  351fb480
15 Jul, 2015 7 commits

squeue: Enable filtering for job state SPECIAL_EXIT · d4d51de7
Morris Jette authored Jul 15, 2015

d4d51de7
squeue: Removed the new line from job array ID · 8247cb26
Nathan Yee authored Jul 15, 2015

8247cb26
Remove whitespace. · 855ab97c
Nathan Yee authored Jul 15, 2015

855ab97c
Fix plane distribution to allocate in blocks rather than cyclically. · 9d7f1507
Nathan Yee authored Jul 15, 2015
```
Bug 1798
```
9d7f1507

Preemption logic could hold job · 93efb1ec

Morris Jette authored Jul 15, 2015

If a job can only be started by preempting other jobs, the old logic
  could report the error:
  "cons_res: sync loop not progressing, holding job #"
  due to the usable CPUs and GRES needed by the pending job not
  matching. This change prevents the error message and job hold
  when job preemption logic is being used. The error message and
  job hold still take place for job scheduling outside of preemption,
  which will match CPUs and GRES at the beginning.
bug 1750

93efb1ec

Prevent changing job HOLD reason set by select plugin · 8e8d80b3

Morris Jette authored Jul 14, 2015

Under some conditions the select/cons_res plugin will hold a job,
  setting it's priority to zero and reason to HELD. The logic in
  slurmctld's main scheduling loop previously kept its priority
  at zero, but changed the reason from HELD to RESOURCES. This
  change leaves the proper job state as set by the select plugin.
This may be related to bug 1750

8e8d80b3

Prevent backfill scheduler overriding job hold · 54b258ec

Morris Jette authored Jul 14, 2015

The backfill scheduler will periodically release locks for other
  actions. If a job is held during the time that locks were released,
  that job might still have been scheduled by the backfill scheduler
  (i.e. it failed to check for a job with a priority of zero).
could be a root cause for bug 1750

54b258ec

14 Jul, 2015 3 commits
- CRAY - Fix seg fault if a blade is replaced and slurmctld is restarted. · 43d0ad6f
  Danny Auble authored Jul 14, 2015
  
  43d0ad6f
- Job array update fix · e2987cf8
  Morris Jette authored Jul 14, 2015
```
Previous logic could fail to update some tasks of a job array for
  some fields.
bug 1777
```
  e2987cf8
- Increase topology info logging detail · 959982dd
  Morris Jette authored Jul 13, 2015
```
Add level to switch table information logged by select plugin
```
  959982dd
13 Jul, 2015 2 commits

Don't purge completing job · c7226213

Morris Jette authored Jul 13, 2015

Old logic could purge a job record for a job that was in
  completing state (if there was also a lot of agent threads).
  This change prevents purging job records for completing jobs.

c7226213

job array update results in bad task ID · 29a52f60

Morris Jette authored Jul 13, 2015

Fix to job array update logic that can result in a task ID of 4294967294.
To reproduce:
$ sbatch --exclusive -a 1,3,5 tmp
Submitted batch job 11825
$ scontrol update jobid=11825_[3,4,5] timelimit=3
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           11825_3     debug      tmp    jette PD       0:00      1 (None)
           11825_4     debug      tmp    jette PD       0:00      1 (None)
           11825_5     debug      tmp    jette PD       0:00      1 (None)
             11825     debug      tmp    jette PD       0:00      1 (Resources)
A new job array entry was created for task ID 4 and the "master" job
array record now has a task ID of 4294967294.
The logic with the bug was using the wrong variable in a test.
bug 1790

29a52f60