Commits · 9ee5af4fa0ac1d62dd6eab6e7ec442297fbaeba3 · Manuel G. Marciani / ces_slurm_simulator

10 May, 2012 1 commit
- Document nature of performance improvements · 9ee5af4f
  Morris Jette authored May 10, 2012
  
  9ee5af4f
09 May, 2012 6 commits
- correct test_id · 8b5f101c
  Morris Jette authored May 09, 2012
  
  8b5f101c
- Revert some v2.3 mods not applicable to v2.4 · 4cbaf7b0
  Morris Jette authored May 09, 2012
  
  4cbaf7b0
- Clarify job step default gres allocation · ca55aee4
  Morris Jette authored May 09, 2012
  
  ca55aee4
- Reset priority of system held jobs when dependency is satisfied · 9e9298b1
  Don Lipari authored May 09, 2012
```
The symptom is that SLURM schedules lower priority jobs to run when higher priority, dependent jobs have their dependencies satisfied.  This happens because dependent jobs still have a priority of 1 when the job queue is sorted in the schedule() function.  The proposed fix forces jobs to have their priority updated when their dependencies are satisfied.
```
  9e9298b1
- sview fix to handle correct values. · 0b7bcc99
  Danny Auble authored May 09, 2012
  
  0b7bcc99
- Subtle changes to credential logic for performance · 3ddb9c92
  Morris Jette authored May 08, 2012
  
  3ddb9c92
07 May, 2012 4 commits

Merge from v2.3 with slight logic change · ec996c21
Morris Jette authored May 07, 2012
```
Job priority of 1 is no longer used as a special case in slurm v2.4
```
ec996c21
Merge branch 'slurm-2.3' · b2c0cff8
Morris Jette authored May 07, 2012

b2c0cff8
Enable zero node allocation only for Cray batch script · 1490e835
Morris Jette authored May 02, 2012

1490e835

Job priority reset bug on slurmctld restart · 5e9dca41

Don Lipari authored May 07, 2012

The commit 8b14f388 on Jan 19, 2011 is causing problems with Moab cluster-scheduled machines.  Under this case, Moab hands off every job submitted immediately to SLURM which gets a zero priority.  Once Moab schedules the job, Moab raises the job's priority to 10,000,000 and the job runs.

When you happen to restart the slurmctld under such conditions, the sync_job_priorities() function runs which attempts to raise job priorities into a higher range if they are getting too close to zero.  The problem as I see it is that you include the "boost" for zero priority jobs.  Hence the problem we are seeing is that once the slurmctld is restarted, a bunch of zero priority jobs are suddenly eligible.  So there becomes a disconnect between the top priority job Moab is trying to start and the top priority job SLURM sees.

I believe the fix is simple:

diff job_mgr.c~ job_mgr.c
6328,6329c6328,6331
<       while ((job_ptr = (struct job_record *) list_next(job_iterator)))
<               job_ptr->priority += prio_boost;
---
       while ((job_ptr = (struct job_record *) list_next(job_iterator))) {
               if (job_ptr->priority)
                       job_ptr->priority += prio_boost;
       }
Do you agree?

Don

5e9dca41

04 May, 2012 4 commits
- split test 22.1 out into 4 different sub tests · e6537e95
  Nathan Yee authored May 04, 2012
  
  e6537e95
- Modifications to Bjørn-Helge Mevik's patch to be more friendly for future · 72d594bf
  Danny Auble authored May 04, 2012
```
developments.
```
  72d594bf
- Original Patch - New feature: GrpMEM limit for QOSes and associations · ec2e363d
  Bjrn-Helge Mevik authored May 04, 2012
```
from Bjørn-Helge Mevik
```
  ec2e363d
- update munge home page · b138060e
  Danny Auble authored May 03, 2012
  
  b138060e
03 May, 2012 5 commits
- Merge branch 'slurm-2.3' · 8c15bc34
  Morris Jette authored May 03, 2012
  
  8c15bc34
- Pick step's relative nodes based upon nodes allocated to job, not nodes available to job · 63833965
  Matthieu Hautreux authored May 03, 2012
  
  63833965
- Fix segv in slurmctld for job step with relative option · 9bb178c3
  Matthieu Hautreux authored May 03, 2012
```
Here is the way to reproduce it :
[root@cuzco27 georgioy]# salloc -n64 -N4 --exclusive
salloc: Granted job allocation 8
[root@cuzco27 georgioy]#srun -r 0 -n 30 -N 2 sleep 300&
[root@cuzco27 georgioy]#srun -r 1 -n 40 -N 3 sleep 300&
[root@cuzco27 georgioy]# srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: Unable to create job step: Zero Bytes were transmitted or received
```
  9bb178c3
- Remove vestigial sinfo "%R" format option, use "%E" instead for node reason · 33b9018a
  Morris Jette authored May 02, 2012
  
  33b9018a
- BLUEGENE - fix more issues where the max nodes of a partition weren't · c655c75c
  Danny Auble authored May 02, 2012
```
honored correctly.  I also put in nice notes where the values aren't
to be altered.
```
  c655c75c
02 May, 2012 10 commits
- More changes to support zero compute node cray allocation · 13dc12d7
  Morris Jette authored May 02, 2012
```
* Specify MinNodes via "scontrol update partition".
* Whenever the zero-node allocation ends, the frontend node is left in a state of COMPLETING until scontrol reconfigure is issued (this doesn't appear to impact the performance of the front end node as other jobs can still be submitted including other zero-node jobs).
```
  13dc12d7
- set total_cpus to be the same as the block_map_size to allow for faking · 4c8005c3
  Danny Auble authored May 02, 2012
```
system of different size than actual hardware.
```
  4c8005c3
- minor formatting · 7ae35983
  Danny Auble authored May 01, 2012
  
  7ae35983
- minor formatting issue and added comments. · f5f51bc0
  Danny Auble authored May 01, 2012
  
  f5f51bc0
- move function to where other static functions are. · fde95d06
  Danny Auble authored May 01, 2012
  
  fde95d06
- Simplify the way the hwloc_cpuset_t to hwloc_bitmap_t conversion is · 5e5f272b
  Danny Auble authored May 01, 2012
```
handled.
```
  5e5f272b
- minor formatting fixes · ec7fda3d
  Danny Auble authored May 01, 2012
  
  ec7fda3d
- code cleanup · dc4bcb9e
  Danny Auble authored May 01, 2012
  
  dc4bcb9e
- move static functions to be with the other static functions · ba97d486
  Danny Auble authored May 01, 2012
  
  ba97d486
- original patch from Martin for Support for cyclic distribution of · 69eff678
  Martin Perrry authored May 01, 2012
```
cpus in task/cgroup plugin
```
  69eff678
01 May, 2012 1 commit
- Move variable definitions to eliminate compiler warning · a7768f49
  Morris Jette authored Apr 30, 2012
  
  a7768f49
27 Apr, 2012 9 commits
- Cray - Add support for batch job with zero compute nodes · cd6fb7e5
  Morris Jette authored Apr 27, 2012
```
Cray - Add support for zero compute note resource allocation to run batch
script on front-end node with no ALPS reservation. Useful for pre- or post-
processing. NOTE: The partition must be configured with MinNodes=0.
```
  cd6fb7e5
- Merge remote-tracking branch 'origin/slurm-2.3' · c0bf968a
  Danny Auble authored Apr 27, 2012
  
  c0bf968a
- Fix minor issue where uid and gid were switched in sview for submitting · 8e5da472
  Danny Auble authored Apr 27, 2012
```
batch jobs.
```
  8e5da472
- BLUEGENE - get rid of debug messages from the database · f140e93d
  Danny Auble authored Apr 27, 2012
  
  f140e93d
- BGQ - better logic for figuring out what conn-type should be. · 7eb60dda
  Danny Auble authored Apr 27, 2012
  
  7eb60dda
- BLUEGENE - when submitting a job request set up the conn_type correctly. · 51276dcc
  Danny Auble authored Apr 27, 2012
```
Before it could break before which could mess things up on a Q.
```
  51276dcc
- BLUEGENE - make sure any NO_VAL coming through gets translated to · 537126da
  Danny Auble authored Apr 27, 2012
```
SELECT_NAV
```
  537126da
- BLUEGENE - move logic to set the correct conn-type if NAV into the · e64a2eaa
  Danny Auble authored Apr 27, 2012
```
respected block allocators.  This also catches the conn-types like
T,T,N,N on a Q system where before those didn't work correctly.
```
  e64a2eaa
- BLUEGENE - if smap resolve is given a bad string and the slurmctld is up · c0caadca
  Danny Auble authored Apr 27, 2012
```
just return a bad result instead of talk to the database.
```
  c0caadca