Commits · 2d8a8e4b514e954a6cb822eea5cb647ea683bebb · Manuel G. Marciani / ces_slurm_simulator

09 Jan, 2013 6 commits
- Fixed bad variables · 2d8a8e4b
  Danny Auble authored Jan 09, 2013
  
  2d8a8e4b
- Added salloc to test8.11 for testing. · 8ca946b1
  Nathan Yee authored Jan 09, 2013
  
  8ca946b1
- Update contributor agreement page to advise SchedMD contact · 9c0f8832
  Morris Jette authored Jan 09, 2013
  
  9c0f8832
- Add contributor agreements to web pages · 15c4f769
  Morris Jette authored Jan 09, 2013
  
  15c4f769
- Print new-line after scontrol EOF · 3f52e3ee
  David Bigagli authored Jan 09, 2013
  
  3f52e3ee
- Add missing "safe" flag from print of AccountStorageEnforce option. · 720153b7
  Danny Auble authored Jan 09, 2013
  
  720153b7
08 Jan, 2013 6 commits

BLUEGENE - fix for QOS/Association node limits. · b50e2269
Danny Auble authored Jan 08, 2013

b50e2269
In select/serial enforce both node and core count for reservation createio · f00745ea
jette authored Jan 08, 2013

f00745ea
Disable a test for select/serial plugin · 9bc0cf0b
jette authored Jan 08, 2013

9bc0cf0b
Modify test for change in how squeue reports node state alloc or idle · 82e0bb50
Morris Jette authored Jan 08, 2013

82e0bb50

Report node state as MAINT only if not allocated jobs · 2af5ce33

Rod Schultz authored Jan 08, 2013

One of our testers has observed that when a long running job continues to run after a maintenance reservation comes into effect sinfo reports the node as being in the allocated state while scontrol shows it to be in the maintenance state.

This can happen when a node is not completely allocated. (select cons_res, a partition which is not Shared=EXCLUSIVE, jobs allocated without –exclusive, or jobs that are allocated only some of the cpus on a node.)

Execution paths leading up to calls to node_state_string (slurm_protocol_defs.c) or node_state_string_compact, in scontrol, test for allocated_cpus less that total_cpus on the node and set the node state to MIXED rather than ALLOCATED, while similar paths in sinfo do not.

I think this is probably a bug, since the mixed state is defined and think it is desirable that both command return the same result.

The problem can be fixed with two logic changes (in multiple places)

1) node_state_string and node_state_string_compact have to check for mixed as well as allocated before returning the MAINT state. This means that the reported state for the node with the allocated job will be MIXED.

2) Sinfo must also check allocated_cpus less than total_cpus and set the state to MIXED before calling either node_state_string or node_state_string_compact.

The attached patch (against 2.5.1) makes these changes. The attached script is a test case.

2af5ce33

Fix advanced reservation recovery logic when upgrading from version 2.4. · 604b869e
Morris Jette authored Jan 08, 2013

604b869e

03 Jan, 2013 8 commits
- Start news for V2.5.2 · 64048706
  Morris Jette authored Jan 03, 2013
  
  64048706
- Update META for v2.5.1 tag · 00f20189
  Morris Jette authored Jan 03, 2013
  
  00f20189
- Disable job --exclusive option with select/serial plugin · 54a3a7db
  Morris Jette authored Jan 03, 2013
  
  54a3a7db
- Modify test for changed scontrol output and disable with select/serial · 108edd31
  Morris Jette authored Jan 03, 2013
  
  108edd31
- Correction to commit 9eb37af0 · da312784
  Morris Jette authored Jan 03, 2013
```
Command line argument would not be processed, but scontrol would
exit immediately
```
  da312784
- Exit scontrol command on stdin EOF · 9eb37af0
  Morris Jette authored Jan 03, 2013
  
  9eb37af0
- Fix typo · e3cba0ed
  Morris Jette authored Jan 03, 2013
  
  e3cba0ed
- Correct core reservation logic for use with select/serial plugin. · 0a06566a
  Morris Jette authored Jan 02, 2013
  
  0a06566a
28 Dec, 2012 3 commits
- Support multiple versions of account gather RPCs · 5f087561
  Morris Jette authored Dec 28, 2012
  
  5f087561
- Support version 2.5 slurmctld with version 2.4 slurmd daemons. · 844f70a2
  Morris Jette authored Dec 28, 2012
  
  844f70a2
- Improve srun error handling if launch plugin not found · 977c02cb
  jette authored Dec 27, 2012
  
  977c02cb
22 Dec, 2012 1 commit
- Alter hostlist logic to allocate large grid dynamically instead of on · 8b5045f2
  Danny Auble authored Dec 21, 2012
```
stack.
```
  8b5045f2
21 Dec, 2012 1 commit

Correct job time limit for sched/backfil and job has QOS with NO_RESERVE flag · 4652e982

Morris Jette authored Dec 21, 2012

If sched/backfill starts a job with a QOS having NO_RESERVE and not job
time limit, start it with the partition time limit (or one year if the
partition has no time limit) rather than NO_VAL (140 year time limit);

If a standby job, which in this
case has the NO_RESERVE flag set, is submitted
without a time limit, and is backfilled, it
will get an EndTime waaayyyy into the future.

JobId=99 Name=cmdll
   UserId=eckert(1043) GroupId=eckert(1043)
   Priority=12083 Account=sa QOS=standby
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36
   StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00

so I looked at the code in /src/plugins/sched/backfill:

                if (job_ptr->start_time <= now) {
                        int rc = _start_job(job_ptr, resv_bitmap);
                        if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
                                job_ptr->time_limit = orig_time_limit;
                                job_ptr->end_time = job_ptr->start_time +
                                                    (orig_time_limit * 60);

Using the debugger I found that if the job does not have a specified
time limit, the job_ptr->time_limit is equal to NO_VAL when it hits
this code.

4652e982

20 Dec, 2012 2 commits
- FRONTEND - fixed issue where if a systems nodes weren't defined in the · 4d5d3e9f
  Danny Auble authored Dec 20, 2012
```
slurm.conf with NodeAddr's signals going to a step could be handled
incorrectly.
```
  4d5d3e9f
- Fixed issue where if an srun dies inside of an allocation abnormally it · ea7046d2
  Danny Auble authored Dec 20, 2012
```
would of also killed the allocation.
```
  ea7046d2
19 Dec, 2012 5 commits
- BLUEGENE - Fix issue with preemption when needing to preempt multiple jobs · 8910442c
  Danny Auble authored Dec 19, 2012
```
to make one job run.
```
  8910442c
- BLUEGENE - Correct method to update conn_type of a job. · d26fa575
  Danny Auble authored Dec 19, 2012
  
  d26fa575
- BGQ - in emulation make it so we can pretend to run large jobs (>64k nodes) · 5eab3ceb
  Danny Auble authored Dec 19, 2012
  
  5eab3ceb
- insert missing header · 6c423749
  Danny Auble authored Dec 18, 2012
  
  6c423749
- BGQ - Fix for salloc/sbatch to do the correct allocation when asking for · 3e89da11
  Danny Auble authored Dec 18, 2012
```
-N1 -n#.
```
  3e89da11
18 Dec, 2012 1 commit

Add "default_account" as a LUA job submit plugin variable. · 1a64a229

Kent Engström authored Dec 18, 2012

This is useful in a submit plugin script that needs to do
different things depending on the account, as the the setting
of account from default account does not happen until after
the script has run.

1a64a229

17 Dec, 2012 5 commits
- Another fix for for handling power state resume on restart of · bf2ac92b
  Danny Auble authored Dec 17, 2012
```
slurmctld.
```
  bf2ac92b
- RAPL - Fix for handling errors when opening msr files. · 32ad081a
  Danny Auble authored Dec 17, 2012
  
  32ad081a
- Update Documentation for SlurmdSpoolDir · 54f5e546
  Danny Auble authored Dec 17, 2012
  
  54f5e546
- Merge pull request #34 from cread/master · 65b9d385
  Morris Jette authored Dec 17, 2012
```
Fix spelling of my surname
```
  65b9d385
- Fix spelling of my surname · 3ca1c8c0
  Chris Read authored Dec 17, 2012
  
  3ca1c8c0
14 Dec, 2012 2 commits
- Fix for node being set down due to "unexpeced reboot", but really timing issue · 07c428a2
  Morris Jette authored Dec 14, 2012
  
  07c428a2
- CRAY - Fix for setting up the aprun for a large job (+2000 nodes). · 4e40d6d5
  Danny Auble authored Dec 14, 2012
  
  4e40d6d5