Commits · 8fb863f9dd77ccdf3d83403ded3f73c495a86504 · Manuel G. Marciani / ces_slurm_simulator

02 Apr, 2014 2 commits

Minor tweak to scheduler cycle timing · 8fb863f9

Morris Jette authored Apr 02, 2014

Decrease maximimum scheduler main loop run time from 10 secs to
4 secs for improved performance.
If running with sched/backfill, do not run through all jobs on
periodic scheduling loop, but only the default depth. The
backfill scheduler can go through more jobs anyway due to its
ability to relinquish and recover locks.
See bug 616

8fb863f9

launch/poe - fix network value · ad7100b8

Morris Jette authored Apr 02, 2014

if an job step's network value is set by poe, either by directly
executing poe or srun launching poe, that value was not being
propagated to the job step creation RPC and the network was not
being set up for the proper protocol (e.g. mpi, lapi, pami, etc.).
The previous logic would only work if the srun execute line
explicitly set the protocol using the --network option.

ad7100b8

31 Mar, 2014 2 commits
- select/cons_res - fix for preempt_mode=off · 8f5ccb71
  Marcin Stolarek authored Mar 31, 2014
```
Do not overcommit partitions with PreemptMode=off
```
  8f5ccb71
- prempt/partition_prio fix · a0ba1865
  Marcin Stolarek authored Mar 31, 2014
```
Prevent preemption of jobs in partition where PreemptMode=off
```
  a0ba1865
26 Mar, 2014 1 commit
- Lock the /cgroup/freezer subsystem when creating files for tracking · bd05aaf2
  David Bigagli authored Mar 26, 2014
```
processes.
```
  bd05aaf2
25 Mar, 2014 1 commit
- mysql - Fix invalid memory reference. · 00cabba3
  Danny Auble authored Mar 25, 2014
  
  00cabba3
24 Mar, 2014 1 commit

job array dependency recovery fix · fca71890

Morris Jette authored Mar 24, 2014

When slurmctld restarted, it would not recover dependencies on
job array elements and would just discard the depenency. This
corrects the parsing problem to recover the dependency. The old code
would print a mesage like this and discard it:
slurmctld: error: Invalid dependencies discarded for job 51: afterany:47_*

fca71890

21 Mar, 2014 2 commits

NRT - Fix minor typos · 675b25ad
Danny Auble authored Mar 21, 2014

675b25ad

NRT - Fix issue with 1 node jobs. It turns out the network does need to · 440932df

Danny Auble authored Mar 21, 2014

be setup for 1 node jobs. Here are some of the reasons from IBM...

1. PE expects it.
2. For failover, if there was some challenge or difficulty with the
shared-memory method of data transfer, the protocol stack might
want to go through the adapter instead.
3. For flexibility, the protocol stack might want to be able to transfer
data using some variable combination of shared memory and adapter-based
communication, and
4. Possibly most important, for overall performance, it might be that
bandwidth or efficiency (BW per CPU cycles) might be better using the
adapter resources. (An obvious case is for large messages, it might
require a lot fewer CPU cycles to program the DMA engines on the
adapter to move data between tasks, rather than depend on the CPU
to move the data with loads and stores, or page re-mapping -- and
a DMA engine might actually move the data more quickly, if it's well
integrated with the memory system, as it is in the P775 case.)

440932df

20 Mar, 2014 2 commits
- task/affinity - Protect against zero divide when simulating more hardware · 92b4de3c
  Danny Auble authored Mar 20, 2014
```
than you really have.
```
  92b4de3c
- sinfo - Make sure if partition name is long and the default the last char · c4bd5ba8
  Danny Auble authored Mar 20, 2014
```
doesn't get chopped off.
```
  c4bd5ba8
19 Mar, 2014 2 commits
- Move the comment from 2.6.7 to 2.6.8 · 9950679b
  David Bigagli authored Mar 19, 2014
  
  9950679b
- Fixed sacct.1 and srun.1 manual pages which contains a hyphen where · e1c8e670
  Gennaro Oliva authored Mar 19, 2014
```
    a minus sign for options was intended.
```
  e1c8e670
18 Mar, 2014 4 commits
- Free job_ptr->state_desc where ever state_reason is set. · c2ae6cfc
  Danny Auble authored Mar 17, 2014
  
  c2ae6cfc
- Minor debugging change, and cleaner initialization. · d7d2a7cb
  Danny Auble authored Mar 17, 2014
  
  d7d2a7cb
- Update last_job_update when a job's state_reason was modified. · b0cc7126
  Danny Auble authored Mar 17, 2014
```
Some of these were resulting in the state of a job not being updated
correctly to tools like sview.
```
  b0cc7126
- Fix issue where jobs still pending after a reservation would remain · 77555c30
  Danny Auble authored Mar 17, 2014
```
in waiting reason ReqNodeNotAvail.
```
  77555c30
17 Mar, 2014 1 commit
- CRAY/ALPS - Add support for CLE52 · a45170c2
  Danny Auble authored Mar 17, 2014
  
  a45170c2
15 Mar, 2014 2 commits

Update qsub help for job arrays · 479c86b3
Morris Jette authored Mar 15, 2014

479c86b3

Add support for Torque/PBS job arrays · 11968284

Morris Jette authored Mar 15, 2014

Add support for job array options in the qsub command, in #PBS
options for sbatch scripts and set the appropriate environment
variables in the spank_pbs plugin (PBS_ARRAY_ID and PBS_ARRAY_INDEX).
Note that Torque uses the "-t" option and PBS Pro uses the "-J"
option.

11968284

14 Mar, 2014 3 commits
- update for potential next 2.6 · ea90d9d4
  Danny Auble authored Mar 14, 2014
  
  ea90d9d4
- Update META for 2.6.7 tag · 84268d97
  Danny Auble authored Mar 14, 2014
  
  84268d97
- Fix a couple of issues with scontrol reconfig and adding nodes to · d03c9300
  Danny Auble authored Mar 14, 2014
```
slurm.conf.  Rebooting daemons after adding nodes to the slurm.conf
is highly recommended.
```
  d03c9300
12 Mar, 2014 1 commit
- Minor typo fix · 11e4a7da
  Danny Auble authored Mar 12, 2014
  
  11e4a7da
11 Mar, 2014 6 commits
- Add missing options to the print of TaskPluginParam. · 51acd3a5
  Danny Auble authored Mar 11, 2014
  
  51acd3a5
- Remove whitespace · 8bdebc68
  Danny Auble authored Mar 11, 2014
  
  8bdebc68
- Missed one · 6af6ba2e
  Danny Auble authored Mar 11, 2014
  
  6af6ba2e
- Change SLURM -> Slurm · 85336299
  Danny Auble authored Mar 11, 2014
  
  85336299
- Fix sleep on step create for suspended job · 8fa12b03
  Morris Jette authored Mar 10, 2014
```
Rather than continuously retrying a step create for suspended jobs,
add a sleep with exponential backoff
```
  8fa12b03
- change logging for step create on suspended job · b56df372
  Morris Jette authored Mar 10, 2014
```
If a job is suspended, log the step create failure using debug
rather than info in slurmctld
```
  b56df372
10 Mar, 2014 1 commit

switch/nrt: correct some logic if NULL pointer · eb69582c

Morris Jette authored Mar 10, 2014

The test for NRT_NULL_MAGIC failed to capture some problems
if the pointer to the structure was NULL. This is an ammendment
to commit 2a55aa0b

eb69582c

08 Mar, 2014 2 commits
- demote some debug · 75aa7e19
  Danny Auble authored Mar 07, 2014
  
  75aa7e19
- NRT - Sanity check to make sure a jobinfo is there before packing. · 2a55aa0b
  Danny Auble authored Mar 07, 2014
```
Perhaps should also look into doing this for nodeinfo and libstate
```
  2a55aa0b
07 Mar, 2014 6 commits
- NRT - change xmalloc's to malloc just to be safe · 714cd570
  Danny Auble authored Mar 07, 2014
  
  714cd570
- NRT - Fix issue where we are launching hosts out of numerical order, · d775d2b3
  Danny Auble authored Mar 07, 2014
```
this would cause pmd's to hang.
```
  d775d2b3
- NRT - initialize missing variables when the PMD is starting a job · 7b7c6006
  Danny Auble authored Mar 07, 2014
  
  7b7c6006
- NRT - Increase Max number of adapters from 8 -> 9 · 7015345e
  Danny Auble authored Mar 07, 2014
  
  7015345e
- NRT - Fix issue where there are more than NRT_MAXADAPTERS on a system. · 6d2b6cf2
  Danny Auble authored Mar 07, 2014
  
  6d2b6cf2
- Better documentation for 'limits' · 08b8b5fb
  Danny Auble authored Mar 07, 2014
  
  08b8b5fb
06 Mar, 2014 1 commit
- Fix a typo in html page. · 9c108513
  David Bigagli authored Mar 06, 2014
  
  9c108513