Commits · a197a1da59f2619890271dd7518a06abd6f3d176 · Manuel G. Marciani / ces_slurm_simulator

18 Apr, 2014 1 commit

switch/nrt - free partial allocation · a197a1da

Morris Jette authored Apr 18, 2014

On switch resource allocation failure, free partial allocation.
Failure mode was CAU could be allocated on some nodes, but not
others. The CAU allocated on nodes and switches up to the failure
point were never released.

a197a1da

08 Apr, 2014 5 commits
- Start NEWS for v2.6.10 · 3114d035
  Morris Jette authored Apr 08, 2014
  
  3114d035
- Update META for v2.6.9 tag · fac3c10f
  Morris Jette authored Apr 08, 2014
  
  fac3c10f
- Fix logic bugs for max_rpc_cnt SchedulerParameters · 78f9b4cc
  Morris Jette authored Apr 08, 2014
```
Fix logic bugs for SchedulerParameters option of max_rpc_cnt.
Scheduling would be delayed for job arrays and
backfill scheduling would be disabled unless max_rpc_cnt > 0.
```
  78f9b4cc
- Fix sacctmgr update user with no "where" condition. · 7ad6df27
  Danny Auble authored Apr 07, 2014
  
  7ad6df27
- Fix sinfo to work correctly with draining/mixed nodes as well as filtering · 2fb004cf
  Danny Auble authored Apr 07, 2014
```
on Mixed state.
```
  2fb004cf
07 Apr, 2014 7 commits
- Restore 1sec msg connect retry · 3053885f
  Morris Jette authored Apr 07, 2014
```
This largely reverts commit 0ec2af27
just to cut down on some logging
```
  3053885f
- Start NEWS for v2.6.9 · a51f6fbf
  Morris Jette authored Apr 07, 2014
  
  a51f6fbf
- Correction to META · 131d08b6
  Morris Jette authored Apr 07, 2014
  
  131d08b6
- UPdate META for v2.6.8 tag · 63469621
  Morris Jette authored Apr 07, 2014
  
  63469621
- BGQ - Fix sub block steps using a block when the block has passthrough's · 91c70cc9
  Danny Auble authored Apr 07, 2014
```
in it.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  91c70cc9
- BGQ - fix debug message to use correct amount of memory. · 5db952da
  Danny Auble authored Apr 04, 2014
  
  5db952da
- BGQ - Fix deny_pass to work correctly. · bee1ec08
  Danny Auble authored Apr 04, 2014
  
  bee1ec08
05 Apr, 2014 6 commits
- Handle invalid SchedulerParameters better · 0503f383
  Morris Jette authored Apr 04, 2014
```
Rather than treat invalid SchedulerParameters options as a fatal
error, print an error and use to the default value.
```
  0503f383
- added SchedulerParameters option of max_rpc_cnt · ab381fd3
  Morris Jette authored Apr 04, 2014
```
Disables job scheduling when there are too many pending RPCs
```
  ab381fd3
- Minor code refactoring for performance · 12128c0b
  Morris Jette authored Apr 04, 2014
  
  12128c0b
- Move state save after scheduling new batch job · b119dee4
  Morris Jette authored Apr 04, 2014
```
This is related to defering batch job scheduling if there
are a bunch of requests pending
```
  b119dee4
- Do msg retry logic at 0.1 sec intervals · 0ec2af27
  Morris Jette authored Apr 04, 2014
```
rather than 1 sec interval retries
```
  0ec2af27
- Decrease pthread create retry sleep · fd23f011
  Morris Jette authored Apr 04, 2014
```
If pthread_create call fails, decrease sleep before retry from
1 sec to 0.1 sec
```
  fd23f011
04 Apr, 2014 8 commits
- remove debug · c5b5491f
  Danny Auble authored Apr 04, 2014
  
  c5b5491f
- MySQL - Fix it so a lock isn't held unnecessarily. · 9fe5c605
  Danny Auble authored Apr 04, 2014
  
  9fe5c605
- Fix sinfo to work correctly with draining/mixed nodes. (copied sview code) · 1d3b553c
  Danny Auble authored Apr 04, 2014
```
This also reverts commit 8cff3b08 and
ced2fa3f
```
  1d3b553c
- NEWS for the last 2 commits · ac4b337a
  Danny Auble authored Apr 03, 2014
  
  ac4b337a
- With the backup slurmctld make sure we reinit beginning values in the · 76c0186d
  Danny Auble authored Apr 03, 2014
```
slurmdbd plugin.
```
  76c0186d
- Fix race condition is corner case with backup slurmctld. · fb5d2a3e
  Danny Auble authored Apr 03, 2014
  
  fb5d2a3e
- Better code to handle caching. Followup to · 00ad5996
  Danny Auble authored Apr 03, 2014
```
9368ff2d
```
  00ad5996
- Sanity check to make sure job_list exists before restoring cache. · 8d5c8e13
  Danny Auble authored Apr 03, 2014
  
  8d5c8e13
03 Apr, 2014 2 commits
- Fix issue where associations weren't correct if backup takes control and · 9368ff2d
  Danny Auble authored Apr 03, 2014
```
new associations were added since it was started.
```
  9368ff2d
- Defer scheduling for many batch jobs · dd4aa1c3
  Morris Jette authored Apr 02, 2014
```
Permit multiple batch job submissions to be made for each run of the
scheduler logic if the job submissions occur at the nearly same time.
bug 616
```
  dd4aa1c3
02 Apr, 2014 2 commits

Minor tweak to scheduler cycle timing · 8fb863f9

Morris Jette authored Apr 02, 2014

Decrease maximimum scheduler main loop run time from 10 secs to
4 secs for improved performance.
If running with sched/backfill, do not run through all jobs on
periodic scheduling loop, but only the default depth. The
backfill scheduler can go through more jobs anyway due to its
ability to relinquish and recover locks.
See bug 616

8fb863f9

launch/poe - fix network value · ad7100b8

Morris Jette authored Apr 02, 2014

if an job step's network value is set by poe, either by directly
executing poe or srun launching poe, that value was not being
propagated to the job step creation RPC and the network was not
being set up for the proper protocol (e.g. mpi, lapi, pami, etc.).
The previous logic would only work if the srun execute line
explicitly set the protocol using the --network option.

ad7100b8

31 Mar, 2014 2 commits
- select/cons_res - fix for preempt_mode=off · 8f5ccb71
  Marcin Stolarek authored Mar 31, 2014
```
Do not overcommit partitions with PreemptMode=off
```
  8f5ccb71
- prempt/partition_prio fix · a0ba1865
  Marcin Stolarek authored Mar 31, 2014
```
Prevent preemption of jobs in partition where PreemptMode=off
```
  a0ba1865
26 Mar, 2014 1 commit
- Lock the /cgroup/freezer subsystem when creating files for tracking · bd05aaf2
  David Bigagli authored Mar 26, 2014
```
processes.
```
  bd05aaf2
25 Mar, 2014 1 commit
- mysql - Fix invalid memory reference. · 00cabba3
  Danny Auble authored Mar 25, 2014
  
  00cabba3
24 Mar, 2014 1 commit

job array dependency recovery fix · fca71890

Morris Jette authored Mar 24, 2014

When slurmctld restarted, it would not recover dependencies on
job array elements and would just discard the depenency. This
corrects the parsing problem to recover the dependency. The old code
would print a mesage like this and discard it:
slurmctld: error: Invalid dependencies discarded for job 51: afterany:47_*

fca71890

21 Mar, 2014 2 commits

NRT - Fix minor typos · 675b25ad
Danny Auble authored Mar 21, 2014

675b25ad

NRT - Fix issue with 1 node jobs. It turns out the network does need to · 440932df

Danny Auble authored Mar 21, 2014

be setup for 1 node jobs. Here are some of the reasons from IBM...

1. PE expects it.
2. For failover, if there was some challenge or difficulty with the
shared-memory method of data transfer, the protocol stack might
want to go through the adapter instead.
3. For flexibility, the protocol stack might want to be able to transfer
data using some variable combination of shared memory and adapter-based
communication, and
4. Possibly most important, for overall performance, it might be that
bandwidth or efficiency (BW per CPU cycles) might be better using the
adapter resources. (An obvious case is for large messages, it might
require a lot fewer CPU cycles to program the DMA engines on the
adapter to move data between tasks, rather than depend on the CPU
to move the data with loads and stores, or page re-mapping -- and
a DMA engine might actually move the data more quickly, if it's well
integrated with the memory system, as it is in the P775 case.)

440932df

20 Mar, 2014 2 commits
- task/affinity - Protect against zero divide when simulating more hardware · 92b4de3c
  Danny Auble authored Mar 20, 2014
```
than you really have.
```
  92b4de3c
- sinfo - Make sure if partition name is long and the default the last char · c4bd5ba8
  Danny Auble authored Mar 20, 2014
```
doesn't get chopped off.
```
  c4bd5ba8