Commits · 5e87910bd3dbcb528e9998ecbf7c266ccf2bd581 · Manuel G. Marciani / ces_slurm_simulator

03 Jan, 2013 10 commits
- Add man pages for slurm_load_job_user and slurm_load_node_single · 5e87910b
  Nathan Yee authored Jan 03, 2013
  
  5e87910b
- Merge branch 'slurm-2.5' · a754551f
  Morris Jette authored Jan 03, 2013
  
  a754551f
- Disable job --exclusive option with select/serial plugin · 54a3a7db
  Morris Jette authored Jan 03, 2013
  
  54a3a7db
- Modify test for changed scontrol output and disable with select/serial · 108edd31
  Morris Jette authored Jan 03, 2013
  
  108edd31
- Correction to commit 9eb37af0 · da312784
  Morris Jette authored Jan 03, 2013
```
Command line argument would not be processed, but scontrol would
exit immediately
```
  da312784
- Merge branch 'slurm-2.5' · 374a8e14
  Morris Jette authored Jan 03, 2013
```
Conflicts:
	src/scontrol/scontrol.c
```
  374a8e14
- Exit scontrol command on stdin EOF · 9eb37af0
  Morris Jette authored Jan 03, 2013
  
  9eb37af0
- Fix typo · e3cba0ed
  Morris Jette authored Jan 03, 2013
  
  e3cba0ed
- Merge branch 'slurm-2.5' · 3d0d31a4
  Morris Jette authored Jan 02, 2013
  
  3d0d31a4
- Correct core reservation logic for use with select/serial plugin. · 0a06566a
  Morris Jette authored Jan 02, 2013
  
  0a06566a
02 Jan, 2013 1 commit

Morris Jette authored Jan 02, 2013

The original patch works fine to avoid cancelling a job when all
of it's nodes go unresponsive, but I don't see any way to easily
address nodes coming back into service. We want to cancel jobs
that have some up nodes and some down nodes, but the nodes will
come back into service indivually rather than all at once.

ac27d503

31 Dec, 2012 1 commit

Do not abort a job if ALL if its nodes are unavailable and not responding · b2c18ec1

jette authored Dec 31, 2012

The job will be aborted if any node is set DOWN while responding
or when "scontrol reconfig" is executed or the slurmctld restarts,
but it should respond better to global failures, like the network
going down.

b2c18ec1

29 Dec, 2012 3 commits
- Add Ralph Castain to contributor list · 17f5d8c7
  jette authored Dec 28, 2012
  
  17f5d8c7
- Merge pull request #35 from rhc54/master · 43847a4c
  Danny Auble authored Dec 28, 2012
```
Fix broken build when HAVE_READLINE is false
```
  43847a4c
- Add missing semicolon so code will compile when HAVE_READLINE is false · daf5b196
  Ralph Castain authored Dec 28, 2012
  
  daf5b196
28 Dec, 2012 8 commits
- Change RPC header check to identify non-supported RPCs · 7c0a32de
  Morris Jette authored Dec 28, 2012
```
There are far fewer RPCs not suppored than are supported, so this
should be faster and easier to maintain.
```
  7c0a32de
- Merge branch 'slurm-2.5' · 474a20ac
  Morris Jette authored Dec 28, 2012
```
Conflicts:
	src/common/slurm_protocol_util.c
```
  474a20ac
- Support multiple versions of account gather RPCs · 5f087561
  Morris Jette authored Dec 28, 2012
  
  5f087561
- Support version 2.5 slurmctld with version 2.4 slurmd daemons. · 844f70a2
  Morris Jette authored Dec 28, 2012
  
  844f70a2
- Document and require topology/none on BlueGene systems · b64a9268
  Morris Jette authored Dec 21, 2012
  
  b64a9268
- Merge branch 'slurm-2.5' · b8927946
  jette authored Dec 27, 2012
  
  b8927946
- Improve srun error handling if launch plugin not found · 977c02cb
  jette authored Dec 27, 2012
  
  977c02cb
- Changes to more cleanly support out of memory condtion · aef77900
  jette authored Dec 27, 2012
  
  aef77900
27 Dec, 2012 4 commits
- More work moving malloc failure handling lower in code · 012d667c
  jette authored Dec 27, 2012
  
  012d667c
- Remove redundant checks for malloc failures · 6708096b
  jette authored Dec 27, 2012
  
  6708096b
- Convert bitstring functions to treat malloc failure as fatal error · 5c5d8d74
  jette authored Dec 27, 2012
  
  5c5d8d74
- Convert list functions to log malloc failure and exit rather than return NULL · 617f6a38
  jette authored Dec 27, 2012
```
For Slurm, we always want to treat a malloc failure as fatal.
```
  617f6a38
22 Dec, 2012 2 commits
- Merge remote-tracking branch 'origin/slurm-2.5' · c22fc658
  Danny Auble authored Dec 21, 2012
  
  c22fc658
- Alter hostlist logic to allocate large grid dynamically instead of on · 8b5045f2
  Danny Auble authored Dec 21, 2012
```
stack.
```
  8b5045f2
21 Dec, 2012 8 commits

Correct job time limit for sched/backfil and job has QOS with NO_RESERVE flag · 4652e982

Morris Jette authored Dec 21, 2012

If sched/backfill starts a job with a QOS having NO_RESERVE and not job
time limit, start it with the partition time limit (or one year if the
partition has no time limit) rather than NO_VAL (140 year time limit);

If a standby job, which in this
case has the NO_RESERVE flag set, is submitted
without a time limit, and is backfilled, it
will get an EndTime waaayyyy into the future.

JobId=99 Name=cmdll
   UserId=eckert(1043) GroupId=eckert(1043)
   Priority=12083 Account=sa QOS=standby
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36
   StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00

so I looked at the code in /src/plugins/sched/backfill:

                if (job_ptr->start_time <= now) {
                        int rc = _start_job(job_ptr, resv_bitmap);
                        if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
                                job_ptr->time_limit = orig_time_limit;
                                job_ptr->end_time = job_ptr->start_time +
                                                    (orig_time_limit * 60);

Using the debugger I found that if the job does not have a specified
time limit, the job_ptr->time_limit is equal to NO_VAL when it hits
this code.

4652e982

Fix unused variable on frontend system · d934fe6e
Danny Auble authored Dec 21, 2012

d934fe6e
Remove sacct --dump --formatted-dump options which were deprecated in · b71523e8
Danny Auble authored Dec 20, 2012
```
2.5.
```
b71523e8
Merge branch 'master' of https://github.com/SchedMD/slurm · fea7cac7
jette authored Dec 21, 2012

fea7cac7
Modify mem_bind test for use with POE · dd7ea4ef
jette authored Dec 21, 2012

dd7ea4ef
Minor refactoring of squeue for single user/job_id optimizations · ac44db86
Morris Jette authored Dec 21, 2012

ac44db86
Add missing header · d5f9d0c8
Morris Jette authored Dec 21, 2012

d5f9d0c8
Added "HealthCheckNodeState" configuration parameter · b139f654
Morris Jette authored Dec 20, 2012
```
Identify node states on which HealthCheckProgram should be executed.
```
b139f654

20 Dec, 2012 3 commits
- Merge remote-tracking branch 'origin/slurm-2.5' · 173fbeb6
  Danny Auble authored Dec 20, 2012
  
  173fbeb6
- FRONTEND - fixed issue where if a systems nodes weren't defined in the · 4d5d3e9f
  Danny Auble authored Dec 20, 2012
```
slurm.conf with NodeAddr's signals going to a step could be handled
incorrectly.
```
  4d5d3e9f
- Fixed issue where if an srun dies inside of an allocation abnormally it · ea7046d2
  Danny Auble authored Dec 20, 2012
```
would of also killed the allocation.
```
  ea7046d2