Commits · f91d135b3ffd6699cc9c93c5058645d96cd85c4b · Manuel G. Marciani / ces_slurm_simulator

20 Jul, 2016 8 commits
- Merge branch 'slurm-16.05' · f91d135b
  Morris Jette authored Jul 20, 2016
  
  f91d135b
- Prevent slurmctld abort on kill of job waiting node reboot · 1aa7af7d
  Morris Jette authored Jul 20, 2016
```
Prevent slurmctld abort if job is killed or requeued while waiting for
    reboot of its allocated compute nodes. The _wait_boot() would
    reference job_ptr->node_bitmap, which would be NULL.
```
  1aa7af7d
- Merge remote-tracking branch 'origin/slurm-16.05' · 5a6488d2
  Danny Auble authored Jul 20, 2016
  
  5a6488d2
- Fixed race condition in PMIx Fence logic · cf6733be
  Boris Karasev authored Jul 20, 2016
```
Bug 2908
```
  cf6733be
- Continuation of commit 65b4f283 · 71ddc0a5
  Danny Auble authored Jul 20, 2016
  
  71ddc0a5
- Prevent segfault when attempting to cleanup a SLURM_PENDING_STEP. · 3b914e5b
  Tim Wickberg authored Jul 20, 2016
```
Step hasn't been assigned resources, so the select_jobinfo struct
hasn't yet been populated. Calling select_g_step_finish will dereference
causing a segfault.

Bug 2922.
```
  3b914e5b
- Add burst buffer job array test · 7dd26078
  Morris Jette authored Jul 20, 2016
  
  7dd26078
- Merge branch 'slurm-16.05' · 74cd7acc
  Morris Jette authored Jul 20, 2016
  
  74cd7acc
19 Jul, 2016 13 commits

Add routing queue info to Slurm FAQ web page · f88119ff
Morris Jette authored Jul 19, 2016

f88119ff

Show running jobs from all to be deleted clusters · d04f2d21

Brian Christiansen authored Jul 19, 2016

Allow "sacctmgr delete cluster" to show running jobs on multiple
clusters when attempting to delete clusters with running jobs.

By freeing "object" if there were already jobs found on other clusters
prevents _check_jobs_before_remove_assoc() from selecting jobs from the
cluster because "cluster_name" will be NULL.

d04f2d21

Fix small mem leak when deleting clusters from db · eb18e53a
Brian Christiansen authored Jul 19, 2016

eb18e53a
Fix invalid read when deleting cluster from db. · 401886ab
Brian Christiansen authored Jul 19, 2016
```
Happens when there are running jobs on the cluster.
```
401886ab
Dont call extra func unless there is work to do · 6aae9b76
Brian Christiansen authored Jul 19, 2016
```
_process_running_jobs_result() only does something if there were results
returned.
```
6aae9b76
Fix small mem leak when job fails to load state · 6c4df688
Brian Christiansen authored Jul 19, 2016

6c4df688
Fix some typos in comments and logs · 5a45503c
Gennaro Oliva authored Jul 19, 2016

5a45503c
Merge branch 'slurm-16.05' · 6977482d
Morris Jette authored Jul 19, 2016

6977482d

Improve partition AllowGroups caching · 7e381982

Morris Jette authored Jul 19, 2016

If the user is now allowed to use the partition,
    then do not check that user's group access again for 5 seconds.
bug 2913

7e381982

Improve partition AllowGroups caching · 98dc38b2

Morris Jette authored Jul 19, 2016

Improve partition AllowGroups caching. Update the table of UIDs permitted to
    use a partition based upon it's AllowGroups configuration parameter as new
    valid UIDs are found rather than looking up that user's group information
    for every job they submit, which can involve considerable overhead for
    some systems.
bug 2913

98dc38b2

Merge branch 'slurm-16.05' · c4835a73
Morris Jette authored Jul 18, 2016

c4835a73

Minimize preempted jobs · b9f17b18

Morris Jette authored Jul 18, 2016

Minimize preempted jobs for configurations with multiple jobs per node.
  Previous logic would preeempt every job on node allocated to pending
  job.
bug 2906

b9f17b18

gres-flags=enforce-binding fix · 5df8509f

Morris Jette authored Jul 18, 2016

Fix for core selection with job --gres-flags=enforce-binding option.
    Previous logic would in some cases allocate a job zero cores, resulting in
    slurmctld abort.
bug 2808

5df8509f

18 Jul, 2016 4 commits
- Improve GRES log format · b5e54e11
  Morris Jette authored Jul 18, 2016
```
Add some indentation so that GRES topology-specific information
  logged is more readable.
```
  b5e54e11
- Merge branch 'slurm-16.05' · 5115dabf
  Morris Jette authored Jul 18, 2016
  
  5115dabf
- Select/cons_res memory corruption fix · c06db0de
  Morris Jette authored Jul 18, 2016
```
A job allocation selecting nodes and no cores/CPUs could write
  off the end of arrays and corrupt memory. Now to figure out how
  the logic reached this point in the first place.
bug 2808
```
  c06db0de
- Add SLUGM16 dinner info · 6dc074c8
  Morris Jette authored Jul 18, 2016
  
  6dc074c8
16 Jul, 2016 5 commits

Add SLURM_PENDING_STEP id so it won't be confused with SLURM_EXTERN_CONT. · 0c7bd6d0

Danny Auble authored Jul 15, 2016

In commit b8190e5d many places that were mean to be pending step ids
were changed to be extern_step id.  The main problem was when we came up
with the idea of the extern step we reused -1 (INFINITE) for the id.  So
pending steps also appeared to be extern steps as well.  Hopefully this
fixes the situation.

Bug 2907

0c7bd6d0

Merge branch 'slurm-16.05' · b8705d7f
Morris Jette authored Jul 15, 2016

b8705d7f
Remove vestigial comment · 71800937
Morris Jette authored Jul 15, 2016

71800937

Move startup of power save thread · fb8e3558

Morris Jette authored Jul 15, 2016

Start power save thread only after the partition information is read
  in order to avoid trying to interpret the SuspendExcParts configuration
  information before the partition information is available, which would
  result in a slurmctld abort.

fb8e3558

Prevent slurmctld race condition · c7cae55b

Morris Jette authored Jul 15, 2016

Do not try to access part_list variable (partition list pointer)
  if not yet initialized. Return NULL pointer rather than aborting
  with NULL pointer.

c7cae55b

15 Jul, 2016 10 commits
- Fix spelling of hierarchy in comments · 4f3a0a02
  Tim Wickberg authored Jul 15, 2016
  
  4f3a0a02
- Do not scheduled powered down nodes in FAILED state · 310de98d
  Jacek Budzowski authored Jul 15, 2016
```
bug 2900
```
  310de98d
- Remove unnecessary test for super user in regression test · 2a7d01a5
  Nicolas Joly authored Jul 15, 2016
  
  2a7d01a5
- Cleanup generated files if test cannot run due to inappropriate conditions. · b9abe288
  Nicolas Joly authored Jul 15, 2016
  
  b9abe288
- Fix user message in test1.32 to report correct signal USR2. · 7f98f056
  Nicolas Joly authored Jul 15, 2016
  
  7f98f056
- Update LRZ site report in SLUG16 agenda · 48dc2bec
  Morris Jette authored Jul 15, 2016
  
  48dc2bec
- burst_buffer.conf document - Remove info about old release · d371bed5
  Morris Jette authored Jul 15, 2016
  
  d371bed5
- burst_buffer/cray newly found buffer timeout · 6de710be
  Morris Jette authored Jul 15, 2016
```
Don't register newly found buffers that are less than OtherTimout
  old to avoid possible duplicates.
```
  6de710be
- bufst_buffer/cray race condition · 91bc07b8
  Morris Jette authored Jul 15, 2016
```
This hardens the code with respect to a race condtion if the
  slurmctld restarts and a burts buffer creation for a job is
  in progress. Eliminate the possibility of a duplicate job
  allocation record.
```
  91bc07b8
- burst_buffer/cray: Move some logic around for better clarity · 6669f1f1
  Morris Jette authored Jul 15, 2016
```
No change in functionality, just moved function call and added
  comment
```
  6669f1f1