Commits · d30417a363cff1775782e4443be55f8deccb8489 · Manuel G. Marciani / ces_slurm_simulator

31 Oct, 2014 - 3 commits
- Better check when related to commit e5635a76, and it was deemed ok to · d30417a3
  Danny Auble authored Oct 31, 2014
```
pack it this way so we will not change it in 15.08
```
  d30417a3
- Give better estimates on pending node count if no node count is requested. · e5635a76
  Danny Auble authored Oct 31, 2014
  
  e5635a76
- Fix potential buffer overflow. · 219826b0
  Danny Auble authored Oct 31, 2014
```
This isn't that big of an issue for 14.03, but 14.11 added more to this
string which could overflow the buffer since sprintf is used instead of
snprintf.  Using xstrfmtcat fixes the issue and is easier to read code.
```
  219826b0
27 Oct, 2014 - 2 commits

Fix issue with squeue/scontrol showing correct node_cnt when only tasks · af19cb55

Danny Auble authored Oct 27, 2014

are specified.

This is a fix to commit b9cc5b31 which just didn't know
mc_ptr->ntasks_per_core is initialized to INFINITE.  Without it the
node_cnt packed would be set to 1 on the user tools.

This fixes bug 1148.

af19cb55

Improve slurmctld error message · 8889108a
Morris Jette authored Oct 27, 2014
```
bug 1207
```
8889108a

24 Oct, 2014 - 1 commit
- Prevent negative job array index · b02facfc
  David Singleton authored Oct 23, 2014
```
We've seen slurmctld crashes due to negative job array indices.
```
  b02facfc
23 Oct, 2014 - 2 commits

BGQ - Refine logic to handle down cnodes · 447a498f

Morris Jette authored Oct 23, 2014

The previous patch should work in most cases, but this should
work more reliably and the comment is more clear
bug 1196

447a498f

BGQ fix race condition causing slurmctld abort · 69184880

Morris Jette authored Oct 22, 2014

BGQ: Fix race condition when job fails due to hardware failure and is
requeued. Previous code could result in slurmctld abort with NULL pointer.
bug 1096

69184880

21 Oct, 2014 - 1 commit

Fix job gres info clear on slurmctld restart · 1209a664

Morris Jette authored Oct 21, 2014

Fix bug that prevented preservation of a job's GRES bitmap on slurmctld
restart or reconfigure (bug was introduced in 14.03.5 "Clear record of a
job's gres when requeued" and only applies when GRES mapped to specific
files).
bug 1192

1209a664

17 Oct, 2014 - 1 commit

Correct license count for suspended jobs · 77a0bb65

Morris Jette authored Oct 17, 2014

Correct tracking of licenses for suspended jobs on slurmctld reconfigure or
restart. Previously licenses for suspended jobs were not counted, so
the license count could be exceeded with those jobs get resumed.

77a0bb65

15 Oct, 2014 - 3 commits

Avoid duplicate PowerUp of node on slurmctld start · d99cf552

Morris Jette authored Oct 15, 2014

This fixes a race condition if the slurmctld needed to power up
a node shortly after startup. Previously it would execute the
ResumeProgram twice for effected nodes.

d99cf552

if DOWN node set to PowerDown, clear NoResp flag · 13023913

Morris Jette authored Oct 15, 2014

Without this change, a node in the cloud that failed to power up,
would not have its NoResponding flag cleared, which would prevent
its later use. The NoResponding flag is now cleared when manuallly
when the node is modified to PowerDown.

13023913

Permit more batch requeues for cloud bursting · 22a01dc7

Morris Jette authored Oct 15, 2014

If a batch job launch to the cloud fails, permit an unlimited
number of job requeues. Previously the job would abort on the
second launch failure.

22a01dc7

14 Oct, 2014 - 2 commits

ALPS - Don't drain nodes if epilog fails. It leaves them in drain state · 8b479a41

Danny Auble authored Oct 14, 2014

with no way to get them out.

This fixes bug 1134.  It is advised the pro/epilog to call xtprocadmin in
the script instead of returning a non-zero exit code.

8b479a41

Fix invalid read by always finding trigger's job. · aa4b5b74

Brian Christiansen authored Oct 14, 2014

The job could have been purged from a short MinJobAge and the
trigger would then point to an invalid job.
Bug #1144

aa4b5b74

11 Oct, 2014 - 2 commits

Permit power_down of node in down state · 46e8f296

Morris Jette authored Oct 10, 2014

if a node is down, then permit setting its state to power down,
which causes the SuspendProgram to run and set the node state
back to cloud.

46e8f296

Prevent power up of CLOUD node on slurmctld startup · d758c741
Morris Jette authored Oct 10, 2014
```
If a node is powered down, then do not power it up on slurmctld
restart.
```
d758c741

10 Oct, 2014 - 1 commit

Job step memory allocation logic fix · 0dd12469

Dorian Krause authored Oct 10, 2014

This commit fixes a bug we observed when combining select/linear with
gres. If an allocation was requested with a --gres argument an srun
execution within that allocation would stall indefinitely:

-bash-4.1$ salloc -N 1 --gres=gpfs:100
salloc: Granted job allocation 384049
bash-4.1$ srun -w j3c017 -n 1 hostname
srun: Job step creation temporarily disabled, retrying

The slurmctld log showed:

debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
debug3:    cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
debug3:    host=j3l02 port=33608 name=hostname network=(null) exclusive=0
debug3:    checkpoint-dir=/home/user checkpoint_int=0
debug3:    mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
debug3:    overcommit=0 time_limit=0 gres=(null) constraints=(null)
debug:  Configuration for job 384049 complete
_pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
_slurm_rpc_job_step_create for job 384049: Requested nodes are busy

If srun --exclusive would have be used instead everything would work fine.
The reason is that in exclusive mode the code properly checks whether memory
is a reserved resource in the _pick_step_node() function.
This commit modifies the alternate code path to do the same.

0dd12469

09 Oct, 2014 - 1 commit
- Improve node count estimate for pending jobs · b9cc5b31
  Morris Jette authored Oct 08, 2014
```
Take more job options into consideration to estimate its node count.
```
  b9cc5b31
07 Oct, 2014 - 2 commits
- Fix PrivateData=reservation when using associations to give privileges to · 5f0b74bf
  Danny Auble authored Oct 06, 2014
```
a reservation.
```
  5f0b74bf
- PrivateData=reservation modified to permit users to view the reservations · 3eadf524
  Danny Auble authored Oct 06, 2014
```
which they have access to (rather then preventing them from seeing ANY
reservation).  Backport from 14.11 commit 77c2bd25.
```
  3eadf524
04 Oct, 2014 - 2 commits

Resuming powered down node clears down/drain flag · 04cc5643
Morris Jette authored Oct 03, 2014
```
Do not cause it to be rebooted (powered up).
```
04cc5643

Fix for resuming powered down node · 2894b779

Morris Jette authored Oct 03, 2014

This permits a sys admin to power down a node that should already
be powered down, but avoids setting the NO_RESPOND bit in the
node state. Doing so under some conditions prevented the node from
being scheduled. The downside is that the node could possibly be
allocated when it really isn't ready for use.

2894b779

03 Oct, 2014 - 5 commits

Repeat SuspendProgram as needed · 083025f6

Morris Jette authored Oct 03, 2014

When a node's state is set to power_down, then execute SuspendProgram
even if previously executed for that node.

083025f6

This is a follow up to commit bca4003e · 7f23572b
Danny Auble authored Oct 03, 2014
```
which protects against race conditions with the reservations.
```
7f23572b

Fix for job configuring state logic · 6b94b520

Morris Jette authored Oct 03, 2014

Fix logic determining when job configuration (i.e. running node power up
logic) is complete. (Will look at better solution for v14.11).

6b94b520

Repeat node power up on request · 8020ac02

Morris Jette authored Oct 03, 2014

When a node's state is set to power_up, then execute ResumeProgram even if
previously executed for that node.

8020ac02

Fix race condition when dealing with removing many associations at · bca4003e
Danny Auble authored Oct 02, 2014
```
different times when reservations are using the associations that are
being deleted.
```
bca4003e

30 Sep, 2014 - 2 commits
- Enable license_only reservation on bluegene · af37d922
  Morris Jette authored Sep 30, 2014
```
Prior logic would always try to reserve nodes. This also slighly modifies
the reservation create logic for non-bluegene systems
```
  af37d922
- cosmetic mods to reservation logic · 059012d2
  Morris Jette authored Sep 30, 2014
  
  059012d2
29 Sep, 2014 - 1 commit
- Make sure on a reconfigure the select information for a node is preserved. · 758a86bc
  Danny Auble authored Sep 29, 2014
  
  758a86bc
22 Sep, 2014 - 2 commits
- The start time of a reservation that is in ACTIVE state cannot be · 55c9b0a5
  David Bigagli authored Sep 22, 2014
```
modified.
```
  55c9b0a5
- Fix bug in update reservation. · d2dabd07
  Dr. Oliver Fortmeier authored Sep 22, 2014
  
  d2dabd07
19 Sep, 2014 - 1 commit

Fix for mixing core base reservations with whole node based reservations · 182d37c0

Danny Auble authored Sep 19, 2014

to avoid overlapping erroneously.

Before you could get overlapping reservations if you asked for a core
based reservation and then a whole node reservation.  This fixes that.

182d37c0

17 Sep, 2014 - 1 commit

Add more job submit validity checks · 84807b11

Morris Jette authored Sep 17, 2014

Test 3.11 was failing in some configurations without this as
the CPU count in the RPC was lower than the number of nodes in
the required node list

84807b11

16 Sep, 2014 - 3 commits
- If Slurmctld fails to read the job environment consider it a fatal error · b7b13afe
  David Bigagli authored Sep 16, 2014
```
and abort the job.
```
  b7b13afe
- Give more correct waiting reason if job is waiting on association/QOS · 2d8decbd
  Danny Auble authored Sep 16, 2014
```
MaxNode limit.
```
  2d8decbd
- Performance adjustment to avoid calling a function multiple times when it · 6ef256d6
  Danny Auble authored Sep 15, 2014
```
only needs to be called once.
```
  6ef256d6
11 Sep, 2014 - 1 commit
- If the job_submit plugin calls take longer than 1 second to run print a · 955b00fa
  Danny Auble authored Sep 10, 2014
```
warning.
```
  955b00fa
09 Sep, 2014 - 1 commit

fix MaxJobCount enforcement race condition · a768e5f9

Morris Jette authored Sep 09, 2014

Eliminate race condition in enforcement of MaxJobCount limit for job arrays.
The job count limit was checked for a job array before setting the slurmctld
job locks. If new jobs were submitted between the test and the job array
creation such that the job array creation would result in MaxJobCount being
exceeded, then a fatal error would result.
bug 1091

a768e5f9