Commits · 9e9298b1becff9ced5196023b186063f864ec23f · Manuel G. Marciani / ces_slurm_simulator

09 May, 2012 3 commits

Reset priority of system held jobs when dependency is satisfied · 9e9298b1

Don Lipari authored May 09, 2012

The symptom is that SLURM schedules lower priority jobs to run when higher priority, dependent jobs have their dependencies satisfied.  This happens because dependent jobs still have a priority of 1 when the job queue is sorted in the schedule() function.  The proposed fix forces jobs to have their priority updated when their dependencies are satisfied.

9e9298b1

sview fix to handle correct values. · 0b7bcc99
Danny Auble authored May 09, 2012

0b7bcc99
Subtle changes to credential logic for performance · 3ddb9c92
Morris Jette authored May 08, 2012

3ddb9c92

07 May, 2012 4 commits

Merge from v2.3 with slight logic change · ec996c21
Morris Jette authored May 07, 2012
```
Job priority of 1 is no longer used as a special case in slurm v2.4
```
ec996c21
Merge branch 'slurm-2.3' · b2c0cff8
Morris Jette authored May 07, 2012

b2c0cff8
Enable zero node allocation only for Cray batch script · 1490e835
Morris Jette authored May 02, 2012

1490e835

Job priority reset bug on slurmctld restart · 5e9dca41

Don Lipari authored May 07, 2012

The commit 8b14f388 on Jan 19, 2011 is causing problems with Moab cluster-scheduled machines.  Under this case, Moab hands off every job submitted immediately to SLURM which gets a zero priority.  Once Moab schedules the job, Moab raises the job's priority to 10,000,000 and the job runs.

When you happen to restart the slurmctld under such conditions, the sync_job_priorities() function runs which attempts to raise job priorities into a higher range if they are getting too close to zero.  The problem as I see it is that you include the "boost" for zero priority jobs.  Hence the problem we are seeing is that once the slurmctld is restarted, a bunch of zero priority jobs are suddenly eligible.  So there becomes a disconnect between the top priority job Moab is trying to start and the top priority job SLURM sees.

I believe the fix is simple:

diff job_mgr.c~ job_mgr.c
6328,6329c6328,6331
<       while ((job_ptr = (struct job_record *) list_next(job_iterator)))
<               job_ptr->priority += prio_boost;
---
       while ((job_ptr = (struct job_record *) list_next(job_iterator))) {
               if (job_ptr->priority)
                       job_ptr->priority += prio_boost;
       }
Do you agree?

Don

5e9dca41

04 May, 2012 4 commits
- split test 22.1 out into 4 different sub tests · e6537e95
  Nathan Yee authored May 04, 2012
  
  e6537e95
- Modifications to Bjørn-Helge Mevik's patch to be more friendly for future · 72d594bf
  Danny Auble authored May 04, 2012
```
developments.
```
  72d594bf
- Original Patch - New feature: GrpMEM limit for QOSes and associations · ec2e363d
  Bjrn-Helge Mevik authored May 04, 2012
```
from Bjørn-Helge Mevik
```
  ec2e363d
- update munge home page · b138060e
  Danny Auble authored May 03, 2012
  
  b138060e
03 May, 2012 5 commits
- Merge branch 'slurm-2.3' · 8c15bc34
  Morris Jette authored May 03, 2012
  
  8c15bc34
- Pick step's relative nodes based upon nodes allocated to job, not nodes available to job · 63833965
  Matthieu Hautreux authored May 03, 2012
  
  63833965
- Fix segv in slurmctld for job step with relative option · 9bb178c3
  Matthieu Hautreux authored May 03, 2012
```
Here is the way to reproduce it :
[root@cuzco27 georgioy]# salloc -n64 -N4 --exclusive
salloc: Granted job allocation 8
[root@cuzco27 georgioy]#srun -r 0 -n 30 -N 2 sleep 300&
[root@cuzco27 georgioy]#srun -r 1 -n 40 -N 3 sleep 300&
[root@cuzco27 georgioy]# srun: error: slurm_receive_msg: Zero Bytes were transmitted or received
srun: error: Unable to create job step: Zero Bytes were transmitted or received
```
  9bb178c3
- Remove vestigial sinfo "%R" format option, use "%E" instead for node reason · 33b9018a
  Morris Jette authored May 02, 2012
  
  33b9018a
- BLUEGENE - fix more issues where the max nodes of a partition weren't · c655c75c
  Danny Auble authored May 02, 2012
```
honored correctly.  I also put in nice notes where the values aren't
to be altered.
```
  c655c75c
02 May, 2012 10 commits
- More changes to support zero compute node cray allocation · 13dc12d7
  Morris Jette authored May 02, 2012
```
* Specify MinNodes via "scontrol update partition".
* Whenever the zero-node allocation ends, the frontend node is left in a state of COMPLETING until scontrol reconfigure is issued (this doesn't appear to impact the performance of the front end node as other jobs can still be submitted including other zero-node jobs).
```
  13dc12d7
- set total_cpus to be the same as the block_map_size to allow for faking · 4c8005c3
  Danny Auble authored May 02, 2012
```
system of different size than actual hardware.
```
  4c8005c3
- minor formatting · 7ae35983
  Danny Auble authored May 01, 2012
  
  7ae35983
- minor formatting issue and added comments. · f5f51bc0
  Danny Auble authored May 01, 2012
  
  f5f51bc0
- move function to where other static functions are. · fde95d06
  Danny Auble authored May 01, 2012
  
  fde95d06
- Simplify the way the hwloc_cpuset_t to hwloc_bitmap_t conversion is · 5e5f272b
  Danny Auble authored May 01, 2012
```
handled.
```
  5e5f272b
- minor formatting fixes · ec7fda3d
  Danny Auble authored May 01, 2012
  
  ec7fda3d
- code cleanup · dc4bcb9e
  Danny Auble authored May 01, 2012
  
  dc4bcb9e
- move static functions to be with the other static functions · ba97d486
  Danny Auble authored May 01, 2012
  
  ba97d486
- original patch from Martin for Support for cyclic distribution of · 69eff678
  Martin Perrry authored May 01, 2012
```
cpus in task/cgroup plugin
```
  69eff678
01 May, 2012 1 commit
- Move variable definitions to eliminate compiler warning · a7768f49
  Morris Jette authored Apr 30, 2012
  
  a7768f49
27 Apr, 2012 10 commits
- Cray - Add support for batch job with zero compute nodes · cd6fb7e5
  Morris Jette authored Apr 27, 2012
```
Cray - Add support for zero compute note resource allocation to run batch
script on front-end node with no ALPS reservation. Useful for pre- or post-
processing. NOTE: The partition must be configured with MinNodes=0.
```
  cd6fb7e5
- Merge remote-tracking branch 'origin/slurm-2.3' · c0bf968a
  Danny Auble authored Apr 27, 2012
  
  c0bf968a
- Fix minor issue where uid and gid were switched in sview for submitting · 8e5da472
  Danny Auble authored Apr 27, 2012
```
batch jobs.
```
  8e5da472
- BLUEGENE - get rid of debug messages from the database · f140e93d
  Danny Auble authored Apr 27, 2012
  
  f140e93d
- BGQ - better logic for figuring out what conn-type should be. · 7eb60dda
  Danny Auble authored Apr 27, 2012
  
  7eb60dda
- BLUEGENE - when submitting a job request set up the conn_type correctly. · 51276dcc
  Danny Auble authored Apr 27, 2012
```
Before it could break before which could mess things up on a Q.
```
  51276dcc
- BLUEGENE - make sure any NO_VAL coming through gets translated to · 537126da
  Danny Auble authored Apr 27, 2012
```
SELECT_NAV
```
  537126da
- BLUEGENE - move logic to set the correct conn-type if NAV into the · e64a2eaa
  Danny Auble authored Apr 27, 2012
```
respected block allocators.  This also catches the conn-types like
T,T,N,N on a Q system where before those didn't work correctly.
```
  e64a2eaa
- BLUEGENE - if smap resolve is given a bad string and the slurmctld is up · c0caadca
  Danny Auble authored Apr 27, 2012
```
just return a bad result instead of talk to the database.
```
  c0caadca
- BLUEGENE - no need to check here since we are already in the HAVE_BG_FILES · b10b1518
  Danny Auble authored Apr 27, 2012
```
clause
```
  b10b1518
26 Apr, 2012 3 commits
- Add sinfo output format option of "%R" for partition name. · 16efb8e1
  Morris Jette authored Apr 26, 2012
```
Sinfo output format of "%P" now prints "*" after default partition even if
no field width is specified (previously included "*" only if no field width
was specified. Added output format of "%R" to print partition name only
without identifying the default partition with "*").
```
  16efb8e1
- Add missing sinfo format option description for "%p" · 78539947
  Morris Jette authored Apr 26, 2012
  
  78539947
- Note that sinfo "%P" prints "*" after default partition name · 0115695f
  Morris Jette authored Apr 26, 2012
  
  0115695f