Commits · 87e647bd69c428d75828b048dc5ce392329c2c38 · Manuel G. Marciani / ces_slurm_simulator

06 Apr, 2013 2 commits

Fix for timing issue in job steps now that slurm is much faster · 87e647bd
jette authored Apr 05, 2013
```
(at initiating pending job steps), interupt driven rather than
retry based.\
```
87e647bd

Fix sched/backfill logic for min/max time limit and part limit · 1967d512

Morris Jette authored Apr 05, 2013

Fix sched/backfill logic to initiate jobs with maximum time limit over the
partition limit, but the minimum time limit permits it to start.
Related to bug 251

1967d512

05 Apr, 2013 5 commits
- Merge branch 'slurm-2.5' · be21784e
  Morris Jette authored Apr 05, 2013
```
Conflicts:
	src/slurmctld/job_mgr.c
```
  be21784e
- Refactoring of Cray NPPCU logic, simpler code · 8127554a
  Danny Auble authored Apr 05, 2013
  
  8127554a
- Rename some variables in select/linear, no change in logic · bb12de92
  Morris Jette authored Apr 05, 2013
  
  bb12de92
- Move slurm_get_avail_procs() from src/common to select/linear · 32e79416
  Morris Jette authored Apr 04, 2013
```
This is the only remaining use of the function.
Also added boards parameter to the function and removed some unused parameters
```
  32e79416
- Cosmetic mods to commit a6d3074d · d52369d1
  Morris Jette authored Apr 04, 2013
```
Mostly to split long lines
```
  d52369d1
04 Apr, 2013 4 commits

Cray - NPPCU support patch · a6d3074d

Stephen Trofinoff authored Apr 04, 2013

I am sending the latest update of my NPPCU-support patch for Slurm 2.5.0. As before, this patch is applied over my basic BASIL 1.3 support patch. The reason for this latest version is that it came to my attention, that certain jobs that should have been rejected by Slurm were allowed through. I then further noticed that this would cause the backfill algorithm to slow down dramatically (often not being able to process any other jobs).
The cause of the problem was that when I introduced the functionality into Slurm to properly set the "nppcu" (number of processors per compute unit) attribute in the XML reservation request to ALPS, I didn't also adjust the tests earlier in the code that eliminate nodes from consideration that do not have sufficient resources. In other words, jobs that would exceed the absolute total number of processors on the node would be rejected as always (this is good). Jobs that required the reduced number of "visible" processors on the node or less were allocated and worked fine (this is good). Unfortunately, jobs that needed a number of processors somewhere in between these limits (let's call them the soft and hard limits) were allowed through by Slurm. Making matters worse, when Slurm would subsequently try to request the ALPS reservation, ALPS would correctly reject it but Slurm would keep trying--this would then kill the backfilling. In my opinion, these jobs should have been rejected from the onset by Slurm as they are asking for more processors per node than can be supplied. If the user wants this number of processors they should specify the "--ntasks-per-core=..." (in our case "2" as that is the full number of hardware threads per core). Obviously, this problem only appeared when I used CR_ONE_TASK_PER_CORE in the slurm.conf as I had modified the code to set nppcu to 1 when Slurm was configured with that option and the user didn't explicitly specify a different value.
The patch appears to be working well for us now and so I am submitting it to you for your review.

a6d3074d

Web page updates · 24d48b04
Morris Jette authored Apr 04, 2013

24d48b04

Prevent assert failure in scheduling logic · 0fc4539e

Morris Jette authored Apr 04, 2013

This fixes a bug introduced in commit f1cf6d2d
Without this change, a job with a partition related reason for not running
(e.g. MaxNodes) plus some other reason (e.g. dependency), could after satisfying
the second reason (e.g. dependency), have the select_nodes() function executed
and return SUCCESS on an allocation request, but it would not actually be
allocation resources since select_nodes() would interpret the request as a
test to see if it could ever run (e.g. will_run).
Bug 263

0fc4539e

Corrections to document formatting · 8d0949ac
Morris Jette authored Apr 04, 2013

8d0949ac

03 Apr, 2013 5 commits
- Format problem · 1ed0e998
  Morris Jette authored Apr 03, 2013
  
  1ed0e998
- Update SUG program committee · 75ea5be6
  Morris Jette authored Apr 03, 2013
  
  75ea5be6
- Cray - Fix for srun allocation+step launch with hostlist · b6f04b60
  Morris Jette authored Apr 03, 2013
  
  b6f04b60
- Cray systems, partital fix for srun --nodelist option · 2fdebcef
  Morris Jette authored Apr 03, 2013
```
This logic strips leading zeros off node IDs for use with the aprun
-L option. Without this change, the node IDs with leading zeros are
interpretted as octal.
Partial fix for bug 260.
```
  2fdebcef
- Minor updates to the mail web page · 2b2f1ae7
  Danny Auble authored Apr 03, 2013
  
  2b2f1ae7
02 Apr, 2013 9 commits
- BLUEGENE - Fix issue where when doing backfill preemptable jobs were · f12f9e2c
  Danny Auble authored Apr 02, 2013
```
never looked at to determine eligibility of backfillable job.
```
  f12f9e2c
- Modify slurmdbd to retransmit to slurmctld daemon if it is not responding. · 84c45d70
  Morris Jette authored Apr 02, 2013
  
  84c45d70
- Revert of commit 17a00dee · 9e71bfb7
  Morris Jette authored Apr 02, 2013
```
A fix for this problem will require more study. This one causes
xassert when an attempt to start a job results in it not being
started by sched/backfill due to the partition time limit.
```
  9e71bfb7
- BGQ - Fix issue on state recover if block states are not around · 6af6f615
  Danny Auble authored Apr 02, 2013
```
and when reading in state from DB2 we find a block that can't be created.
You can now do a clean start to rid the bad block.
```
  6af6f615
- BGQ - Fix for when a step completes in Slurm before the runjob_mux notifies · 8486c97e
  Danny Auble authored Apr 02, 2013
```
the slurmctld there were software errors on some nodes.
```
  8486c97e
- BGQ - Fix race condition were a job could of been removed from a block · 9fe4f169
  Danny Auble authored Apr 01, 2013
```
without it still existing there.  This is extremely rare.
```
  9fe4f169
- BGQ - Fix issue where if for some reason we are freeing a block with · 8a49d19f
  Danny Auble authored Apr 01, 2013
```
a pending job on it we don't kill the job.
```
  8a49d19f
- BGQ - Handle issue where blocks would have a pending job on them and · 59f87c1f
  Danny Auble authored Apr 01, 2013
```
while it was free cnodes would go into software error and kill the job.
```
  59f87c1f
- Fix sched/backfill logic for min/max time limit and partition limit · 17a00dee
  Morris Jette authored Apr 01, 2013
```
Fix sched/backfill logic to initiate jobs with maximum time limit over the
partition limit, but the minimum time limit permits it to start.
Related to bug 251
```
  17a00dee
01 Apr, 2013 6 commits
- Minor restructuring of backfill code, no change in logic · 2f0eca5b
  Morris Jette authored Apr 01, 2013
  
  2f0eca5b
- Merge branch 'slurm-2.5' · 401625f4
  Morris Jette authored Apr 01, 2013
  
  401625f4
- Reset a job's reason from PartitionDown when the partition is set up · c1a0ef0c
  Morris Jette authored Apr 01, 2013
```
Fix for bug 224
```
  c1a0ef0c
- Merge branch 'slurm-2.5' · 01be783f
  Morris Jette authored Apr 01, 2013
  
  01be783f
- Document use of MPICH2 using mpiexec task launch · 3eef764b
  Morris Jette authored Apr 01, 2013
  
  3eef764b
- Change title for SUG2013 CFP · 18426c72
  Morris Jette authored Apr 01, 2013
  
  18426c72
30 Mar, 2013 1 commit
- Clean up files from test suite · 064c1593
  Morris Jette authored Mar 29, 2013
  
  064c1593
29 Mar, 2013 8 commits
- BGQ - Push action 'D' info to scontrol for admins. · 257a97ef
  Danny Auble authored Mar 29, 2013
  
  257a97ef
- Add tests for sbatch/srun output file %u (user) format options · e874001d
  Nathan Yee authored Mar 29, 2013
  
  e874001d
- Merge branch 'slurm-2.5' · 906cd747
  Morris Jette authored Mar 29, 2013
  
  906cd747
- Minor updates to SUG2013 web pages · 1238dda6
  Morris Jette authored Mar 29, 2013
  
  1238dda6
- Merge remote-tracking branch 'origin/slurm-2.5' · 56d9e47a
  Danny Auble authored Mar 29, 2013
```
Conflicts:
	src/plugins/priority/multifactor/priority_multifactor.c
```
  56d9e47a
- Priority multifactor - add missing variable definitions just to be safe. · 4e2c17e4
  Danny Auble authored Mar 29, 2013
  
  4e2c17e4
- Add sanity check for NULL cluster names trying to register. · 4b862252
  Danny Auble authored Mar 29, 2013
  
  4b862252
- Add Slurm User Group 2013 information (incomplete) · 480a744f
  Morris Jette authored Mar 28, 2013
  
  480a744f