Commits · 8792121892e74457a91b05707e9b3d32be4dbb0e · Manuel G. Marciani / ces_slurm_simulator

17 Feb, 2017 9 commits

Introduce enum (tres_usage_t) to make code easier to read when · 87921218
Josh Samuelson authored Feb 17, 2017
```
dealing with returns from _validate_tres_usage_limits_for_qos and
_validate_tres_usage_limits_for_assoc an enum
```
87921218
Make return code from _validate_tres_usage_limits_for_qos and · 16aa19f2
Josh Samuelson authored Feb 17, 2017
```
_validate_tres_usage_limits_for_assoc an enum (tres_usage_t)
```
16aa19f2
Merge branch 'slurm-17.02' · 098adffc
Tim Wickberg authored Feb 16, 2017

098adffc

Cleanup logic around reservation end times. · b44b04d4

Tim Wickberg authored Feb 16, 2017

Collapse begin_job_resv_check and fini_job_resv_check into
job_resv_check, and call directly from controller.c rather than
including in job_time_limit.

b44b04d4

Fix potential race condition in job_time_limit. · cc82087a

Dominik Bartkiewicz authored Feb 16, 2017

Introduced by commit 059275f6 when the timer is trigger.
Releasing the locks means that job_ptr may point to an element that was
deleted by a different thread in the meantime. Restructuring the code
to advance the iterator prevents this - the iterator itself does not have
this issue as the List structure will manage the position during the
sleep().

While here, move the reservation update handling outside of this
loop to simplify operation. This does not need to piggy-back on the
scan of the job_list - switching to using list_for_each should
mitigate some of the performance loss by needing a second full pass.

Bug 3414.

cc82087a

Merge remote-tracking branch 'origin/slurm-17.02' · 97d5239d
Danny Auble authored Feb 16, 2017
```
# Conflicts:
#	RELEASE_NOTES
```
97d5239d
Merge remote-tracking branch 'origin/slurm-16.05' into slurm-17.02 · 36e5451f
Danny Auble authored Feb 16, 2017

36e5451f
Fix two comments - this is run_job_cnt not job_run_cnt. · a0f9cd83
Tim Wickberg authored Feb 16, 2017

a0f9cd83
job_submit/lua - remove access to reservation job_run_cnt/job_pend_cnt fields. · 7489e3fe
Tim Wickberg authored Feb 16, 2017
```
These were mis-calculated previously, and are internal implementation details
that weren't meant to be exposed.
```
7489e3fe

16 Feb, 2017 17 commits
- Fix correct state reason when job can't run 'safely' because of an · fad27852
  Josh Samuelson authored Feb 16, 2017
```
association GrpWall limit.
```
  fad27852
- Better debug output when a job is being held because of a GrpTRES[Run]Min · 325d674a
  Danny Auble authored Feb 16, 2017
```
limits.
```
  325d674a
- Continuation of last patch for associations. (same mistake) · 37be42ec
  Danny Auble authored Feb 16, 2017
```
Bug 3476
```
  37be42ec
- Fix correct variables when validating GrpTresMins on a QOS. · 92d2c645
  Josh Samuelson authored Feb 16, 2017
```
Bug 3476
```
  92d2c645
- Fix comments in acct_policy.c to reflect actual variables instead of · 4cfe6bde
  Danny Auble authored Feb 16, 2017
```
old ones.

This is cosmetic only, no code change.

Bug 3476
```
  4cfe6bde
- Merge branch 'federation' · 4cb3f8ea
  Brian Christiansen authored Feb 16, 2017
  
  4cb3f8ea
- Send allocation's resp_host to fed siblings · 10fa7378
  Brian Christiansen authored Feb 15, 2017
```
When an interactive allocation request comes to a controller it fills in
the job's resp_host from the incoming addr. The controller then uses the
resp_host and the alloc_resp_port, sent from srun/salloc, to respond to
listening srun/salloc. In a federation, the origin cluster needs to pass
the initial resp_host from the origin cluster to the siblings. Otherwise
the siblings set the resp_host to the host of the origin cluster and the
sibling clusters won't be able to contact the listening srun/salloc.
```
  10fa7378
- Add error for invalid protocol · b2733da5
  Brian Christiansen authored Feb 15, 2017
  
  b2733da5
- Adjust error message · 76ccfbcf
  Brian Christiansen authored Feb 15, 2017
  
  76ccfbcf
- Prep [un]pack_sib_msg protos for 17.11 · 61f092a7
  Brian Christiansen authored Feb 15, 2017
  
  61f092a7
- Fix spelling of function name · 8f5ece67
  Brian Christiansen authored Feb 15, 2017
  
  8f5ece67
- Remove test15.17 · 9db755ae
  Brian Christiansen authored Feb 16, 2017
```
Continuation of 0098c0c0 -- which removed the ability submit a
batch step within an existing allocation. Removing test15.17 since the
functionality was removed. This is undocumented behavior and says it was
for LSF which isn't supported. There is also the problem where if you
submit two batch steps in an exsiting allocation that the job will be
killed and the node drained because the slurmd will see duplicate
jobids.
```
  9db755ae
- Doc - replace vestigial --shared by --oversubscribe. · c72a6497
  Alejandro Sanchez authored Feb 16, 2017
  
  c72a6497
- Add unit test for slurmdb_account_rec_t · 2fe24f3b
  Isaac Hartung authored Feb 16, 2017
  
  2fe24f3b
- Add invalid protocol check for unpack_account_rec · ba92cc32
  Isaac Hartung authored Feb 16, 2017
  
  ba92cc32
- Merge branch 'slurm-17.02' · e4c09fb5
  Morris Jette authored Feb 15, 2017
```
Conflicts:
	src/common/slurmdbd_defs.c
```
  e4c09fb5
- Fix Coverity reported problem · 781f72d8
  Morris Jette authored Feb 15, 2017
```
Checked for suffix of "k" and "k" (not "K").
Same problem with suffic of "m".
```
  781f72d8
15 Feb, 2017 14 commits
- Check for potential overflow of variable. Rewrite of e0b560e8 · ec35c9b6
  Danny Auble authored Feb 15, 2017
  
  ec35c9b6
- Continuation of commit ee90b4b3. · 61234269
  Danny Auble authored Feb 15, 2017
```
Missed a sanity check.
```
  61234269
- Fix minor potential memory leak · bcc375d2
  Danny Auble authored Feb 15, 2017
  
  bcc375d2
- Simplify redundant code by using new function. · e0ef34db
  Danny Auble authored Feb 15, 2017
  
  e0ef34db
- Add bit_fmt_full to create a char * of a full bitstring instead of having · f2ad0bcb
  Danny Auble authored Feb 15, 2017
```
to have a buffer to fill in.
```
  f2ad0bcb
- Merge remote-tracking branch 'origin/slurm-16.05' into slurm-17.02 · 38e242d8
  Danny Auble authored Feb 15, 2017
  
  38e242d8
- Rework federated will runs · 5e658d6d
  Brian Christiansen authored Feb 15, 2017
```
Fix federated will runs so that the the same federated job id is being
used on each cluster when a willrun is being done. Also fix it so that a
federated will run will check start times of existing jobs.
```
  5e658d6d
- Fix squeue when SLURM_BITSTR_LEN=0 is set in the user environment. · 0ea581a7
  Danny Auble authored Feb 15, 2017
```
Bug 3472
```
  0ea581a7
- Merge branch 'slurm-16.05' into slurm-17.02 · 2b1a8ed1
  Tim Wickberg authored Feb 15, 2017
  
  2b1a8ed1
- Prevent deadlocked slurmstepd processes due to unsafe use of regcomp. · fe193906
  Tim Wickberg authored Feb 15, 2017
```
regcomp() is not safe to use across a fork in older glibc versions.
Reinitialize the keyvalue_re structure after the fork through an atfork()
handler.

Bug 3276.
```
  fe193906
- Allow non-slurm users to submit federated jobs · 865576da
  Brian Christiansen authored Feb 14, 2017
```
The problem was that the fed_mgr sets the jobid before the jobid is
validated so that the jobid can be the same when submitting to all
siblings. _validate_job_desc() validates that only slurm_user or root
can specify jobids. Now in a federation, _validate_job_desc() doesn't
need to validate the jobid since specifying a specific jobid is disabled
in federations.
```
  865576da
- Disble sbatch --jobid for federated batch jobs · cdbbadc8
  Brian Christiansen authored Feb 14, 2017
```
Since a federation determines where the job originated from by looking
at the job's id, the user would have to give a jobid that matches the
origin cluster. Another other option would be to submit jobid specified
jobs as non-federated jobs. But for now this is being disabled.
```
  cdbbadc8
- Rework logic to submit federated jobs · d4739f70
  Brian Christiansen authored Feb 14, 2017
  
  d4739f70
- Fix some recently introduced Coverity errors · e0b560e8
  Morris Jette authored Feb 15, 2017
  
  e0b560e8