Commits · fd5b0e566bcadb79832e464e8ac8beef062b7b09 · Manuel G. Marciani / ces_slurm_simulator

04 Jan, 2013 3 commits

mpi/mvapich: Don't set MPIRUN_PROCESSES by default · fd5b0e56

Mark A. Grondona authored Jan 22, 2012

The MPIRUN_PROCESSES variable set by the mpi/mvapich plugin probably
is not needed for most if not all recent versions of mvapich.
This environment variable also negatively affects job scalability
since its length is proportional to the number of tasks in a job.
In fact, for very large jobs, the increased environment size can
lead to failures in execve(2).

Since MPIRUN_PROCESSES *might* be required in some older versions of
mvapich, this patch disables the setting of that variable completely
only if SLURM_NEED_MVAPICH_MPIRUN_PROCESSES is not set in the job's
environment. (Thus, by default MPIRUN_PROCESSES is disabled, but
the old behavior may be restored by setting the environment variable
above)

fd5b0e56

Merge branch 'master' of https://github.com/SchedMD/slurm · b196f153
jette authored Jan 03, 2013

b196f153
Fix logic in hostset_create for invalid input · 33cb1e40
jette authored Jan 03, 2013

33cb1e40

03 Jan, 2013 16 commits
- Merge branch 'slurm-2.5' · c79de9f1
  Morris Jette authored Jan 03, 2013
```
Conflicts:
	META
	NEWS
```
  c79de9f1
- Start news for V2.5.2 · 64048706
  Morris Jette authored Jan 03, 2013
  
  64048706
- Update META for v2.5.1 tag · 00f20189
  Morris Jette authored Jan 03, 2013
  
  00f20189
- Correct bad year in date, time to get used to 2013... · aa6570c1
  Morris Jette authored Jan 03, 2013
  
  aa6570c1
- Add man pages for slurm_load_job_user and slurm_load_node_single · 5e87910b
  Nathan Yee authored Jan 03, 2013
  
  5e87910b
- Merge branch 'slurm-2.5' · a754551f
  Morris Jette authored Jan 03, 2013
  
  a754551f
- Disable job --exclusive option with select/serial plugin · 54a3a7db
  Morris Jette authored Jan 03, 2013
  
  54a3a7db
- Modify test for changed scontrol output and disable with select/serial · 108edd31
  Morris Jette authored Jan 03, 2013
  
  108edd31
- Correction to commit 9eb37af0 · da312784
  Morris Jette authored Jan 03, 2013
```
Command line argument would not be processed, but scontrol would
exit immediately
```
  da312784
- Merge branch 'slurm-2.5' · 374a8e14
  Morris Jette authored Jan 03, 2013
```
Conflicts:
	src/scontrol/scontrol.c
```
  374a8e14
- Merge branch 'master' of https://github.com/SchedMD/slurm · 6dc2db1f
  jette authored Jan 03, 2013
  
  6dc2db1f
- Exit scontrol command on stdin EOF · 9eb37af0
  Morris Jette authored Jan 03, 2013
  
  9eb37af0
- Fix typo · e3cba0ed
  Morris Jette authored Jan 03, 2013
  
  e3cba0ed
- Merge branch 'slurm-2.5' · 3d0d31a4
  Morris Jette authored Jan 02, 2013
  
  3d0d31a4
- Correct core reservation logic for use with select/serial plugin. · 0a06566a
  Morris Jette authored Jan 02, 2013
  
  0a06566a
- Tweak some tests for select/serial configuration · 3cc1e93d
  jette authored Jan 02, 2013
  
  3cc1e93d
02 Jan, 2013 1 commit

Revert commit · ac27d503

Morris Jette authored Jan 02, 2013

The original patch works fine to avoid cancelling a job when all
of it's nodes go unresponsive, but I don't see any way to easily
address nodes coming back into service. We want to cancel jobs
that have some up nodes and some down nodes, but the nodes will
come back into service indivually rather than all at once.

ac27d503

31 Dec, 2012 1 commit

Do not abort a job if ALL if its nodes are unavailable and not responding · b2c18ec1

jette authored Dec 31, 2012

The job will be aborted if any node is set DOWN while responding
or when "scontrol reconfig" is executed or the slurmctld restarts,
but it should respond better to global failures, like the network
going down.

b2c18ec1

29 Dec, 2012 3 commits
- Add Ralph Castain to contributor list · 17f5d8c7
  jette authored Dec 28, 2012
  
  17f5d8c7
- Merge pull request #35 from rhc54/master · 43847a4c
  Danny Auble authored Dec 28, 2012
```
Fix broken build when HAVE_READLINE is false
```
  43847a4c
- Add missing semicolon so code will compile when HAVE_READLINE is false · daf5b196
  Ralph Castain authored Dec 28, 2012
  
  daf5b196
28 Dec, 2012 8 commits
- Change RPC header check to identify non-supported RPCs · 7c0a32de
  Morris Jette authored Dec 28, 2012
```
There are far fewer RPCs not suppored than are supported, so this
should be faster and easier to maintain.
```
  7c0a32de
- Merge branch 'slurm-2.5' · 474a20ac
  Morris Jette authored Dec 28, 2012
```
Conflicts:
	src/common/slurm_protocol_util.c
```
  474a20ac
- Support multiple versions of account gather RPCs · 5f087561
  Morris Jette authored Dec 28, 2012
  
  5f087561
- Support version 2.5 slurmctld with version 2.4 slurmd daemons. · 844f70a2
  Morris Jette authored Dec 28, 2012
  
  844f70a2
- Document and require topology/none on BlueGene systems · b64a9268
  Morris Jette authored Dec 21, 2012
  
  b64a9268
- Merge branch 'slurm-2.5' · b8927946
  jette authored Dec 27, 2012
  
  b8927946
- Improve srun error handling if launch plugin not found · 977c02cb
  jette authored Dec 27, 2012
  
  977c02cb
- Changes to more cleanly support out of memory condtion · aef77900
  jette authored Dec 27, 2012
  
  aef77900
27 Dec, 2012 4 commits
- More work moving malloc failure handling lower in code · 012d667c
  jette authored Dec 27, 2012
  
  012d667c
- Remove redundant checks for malloc failures · 6708096b
  jette authored Dec 27, 2012
  
  6708096b
- Convert bitstring functions to treat malloc failure as fatal error · 5c5d8d74
  jette authored Dec 27, 2012
  
  5c5d8d74
- Convert list functions to log malloc failure and exit rather than return NULL · 617f6a38
  jette authored Dec 27, 2012
```
For Slurm, we always want to treat a malloc failure as fatal.
```
  617f6a38
22 Dec, 2012 2 commits
- Merge remote-tracking branch 'origin/slurm-2.5' · c22fc658
  Danny Auble authored Dec 21, 2012
  
  c22fc658
- Alter hostlist logic to allocate large grid dynamically instead of on · 8b5045f2
  Danny Auble authored Dec 21, 2012
```
stack.
```
  8b5045f2
21 Dec, 2012 2 commits

Correct job time limit for sched/backfil and job has QOS with NO_RESERVE flag · 4652e982

Morris Jette authored Dec 21, 2012

If sched/backfill starts a job with a QOS having NO_RESERVE and not job
time limit, start it with the partition time limit (or one year if the
partition has no time limit) rather than NO_VAL (140 year time limit);

If a standby job, which in this
case has the NO_RESERVE flag set, is submitted
without a time limit, and is backfilled, it
will get an EndTime waaayyyy into the future.

JobId=99 Name=cmdll
   UserId=eckert(1043) GroupId=eckert(1043)
   Priority=12083 Account=sa QOS=standby
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36
   StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00

so I looked at the code in /src/plugins/sched/backfill:

                if (job_ptr->start_time <= now) {
                        int rc = _start_job(job_ptr, resv_bitmap);
                        if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
                                job_ptr->time_limit = orig_time_limit;
                                job_ptr->end_time = job_ptr->start_time +
                                                    (orig_time_limit * 60);

Using the debugger I found that if the job does not have a specified
time limit, the job_ptr->time_limit is equal to NO_VAL when it hits
this code.

4652e982

Fix unused variable on frontend system · d934fe6e
Danny Auble authored Dec 21, 2012

d934fe6e