Commits · a2638db41398beccdbefbc10037d8a7e6fb18885 · Manuel G. Marciani / ces_slurm_simulator

20 Jan, 2017 20 commits
- Refactor out common fed_mgr_job_revoke function · a2638db4
  Brian Christiansen authored Dec 05, 2016
  
  a2638db4
- Add helper function for determining fed job · 847bb657
  Brian Christiansen authored Nov 29, 2016
  
  847bb657
- Save fed job details to state · 77af869c
  Brian Christiansen authored Nov 29, 2016
  
  77af869c
- Init resp inside of _send_recv_msg · b214caa4
  Brian Christiansen authored Nov 29, 2016
```
like it does in slurm_send_recv_msg. The resp needs to be inited before
_check_send it called.
```
  b214caa4
- Add scheduling of federated batch jobs · 616de6f3
  Brian Christiansen authored Nov 22, 2016
```
Sibling jobs have to get lock from the origin cluster in order to attempt to
allocate nodes. If it gets the allocation then it lets the origin cluster know
and the origin cluster will set the siblings jobs, if any, into a REVOKED state
and purge the jobs. If the sibling job is the only sibling then it assumes the
lock and attempts to start the job to avoid extra communications. If nodes can't
be allocated then the job releases the lock for another cluster to try.
```
  616de6f3
- Only [un]pack sib_msg data_buffer if one exists · 0f855ef3
  Brian Christiansen authored Nov 22, 2016
  
  0f855ef3
- Add JOB_REVOKED state · 7427659e
  Brian Christiansen authored Nov 22, 2016
```
for fed sibling jobs that don't start.
```
  7427659e
- Extract helper function to get fed cluster by id · 22bbba85
  Brian Christiansen authored Nov 22, 2016
  
  22bbba85
- Refactor protocol funcs in fed_mgr · 88739c7c
  Brian Christiansen authored Nov 22, 2016
  
  88739c7c
- Make state in db an unsigned 32bit · edd2e602
  Brian Christiansen authored Nov 22, 2016
```
To handle JOB_REVOKED
```
  edd2e602
- Fit on one line. · e98948d6
  Brian Christiansen authored Nov 21, 2016
  
  e98948d6
- Display dbd persist conn's cluster name in threads · c88bd9a7
  Brian Christiansen authored Nov 21, 2016
  
  c88bd9a7
- Only free sib_msg->data if not NULL · f87bb6e0
  Brian Christiansen authored Nov 08, 2016
  
  f87bb6e0
- Put fed rpc logic in routines · cbf02772
  Brian Christiansen authored Nov 07, 2016
  
  cbf02772
- Fix unpack_bit_str_hex() problem from commit c40f8809. · e4ad6adb
  Tim Wickberg authored Jan 20, 2017
```
Overwriting _size leads to the bitmap being the wrong length.
```
  e4ad6adb
- Change strncpy to strlcpy to ensure null termination in _connect_srun_cr(). · 37ca2731
  Tim Wickberg authored Jan 19, 2017
```
CID 160063.
```
  37ca2731
- Fix potential memory leak in unpack_bit_str_hex(). · c40f8809
  Tim Wickberg authored Jan 19, 2017
```
The safe_unpackstr_xmalloc() call could jump to unpack_error,
which would leak memory allocated for bitmap. Allocate only after
the unpackstr has succeeded.

Coverity 160092 (+ more due to macro expansion leading to repeats).
```
  c40f8809
- Fix _cpu_freq_freqspec_num() to return the highest frequency when greater value requested. · f40e1c01
  Tim Wickberg authored Jan 19, 2017
```
Code would return NO_VAL if the requested frequency was greater than
the highest available.

While here improve the errors printed to the slurmstepd log location,
and change the initial check against nfreq to test for zero. (Rather
than (uint8_t) NO_VAL which it could never be set to.)

Bug 3335.
```
  f40e1c01
- Add the ability to purge transactions from the database. · a778826c
  Danny Auble authored Jan 19, 2017
```
Bug 2508
```
  a778826c
- Remove outdated link in auth/munge's plugin name. · 3fd56ff3
  Tim Wickberg authored Jan 19, 2017
  
  3fd56ff3
19 Jan, 2017 15 commits
- Don't charge job for node power up time · 7e9d9af1
  Morris Jette authored Jan 19, 2017
```
If job is allocated nodes which are powered down, then reset job start time
    when the nodes are ready and do not charge the job for power up time.
bug 3411
```
  7e9d9af1
- cosmetic changes · c9ca3c20
  Morris Jette authored Jan 19, 2017
```
No changes in logic
```
  c9ca3c20
- Testsuite: change get_my_nuid to get_my_user_name and use consistently. · 9d269d7c
  Isaac Hartung authored Jan 19, 2017
```
Modify get_my_user_name to return a FAILURE if the user_name cannot
be determined, as was handled inconsistently.
```
  9d269d7c
- Add on to commit 8464d75f to handle other things that needed to be freed in · 0d094df0
  Danny Auble authored Jan 19, 2017
```
the function.
```
  0d094df0
- Remove duplicate debug message. · 46d30dcd
  Danny Auble authored Jan 19, 2017
  
  46d30dcd
- Simplify code by using the same correct variable instead of handing it different ways. · 8950e9e8
  Danny Auble authored Jan 19, 2017
  
  8950e9e8
- If a stepd can't kill a job because it hasn't started abort after UnkillableStepTimeout. · f18390e8
  Danny Auble authored Jan 19, 2017
```
Bug 3312 and 3332
```
  f18390e8
- Break the ending of the step out into a function. · 8464d75f
  Danny Auble authored Jan 19, 2017
  
  8464d75f
- Change the argument to step_terminate_monitor_start to stepd_step_rec_t instead of · 023e72e7
  Danny Auble authored Jan 19, 2017
```
jobid and stepid to be used if needing to kill job because it hadn't started yet.
```
  023e72e7
- node_features/knl_cray: add UME monitoring · 0c596661
  Morris Jette authored Jan 19, 2017
```
Add logic to monitor Uncorrectable Memory Errors (UME) and notify
  active jobs in case they run for a while afterwards. This copies
  logic from knl_generic to knl_cray. There may be a different UME
  monitoring system for Cray systems in the future. The original
  knl_generic development is in commit 56ff27da
bug 3341
```
  0c596661
- node_features/knl_generic: cosmetic changes · a01f1bbf
  Morris Jette authored Jan 19, 2017
  
  a01f1bbf
- Testsuite - added global procedure stop_root_user. · 9e945329
  Isaac Hartung authored Jan 13, 2017
```
Replaced identical code in tests with a call to this proc.
Also replace instances of code getting uid with a call to the
global proc get_my_uid / get_my_nuid.
```
  9e945329
- Burst_buffer/cray - Avoid stage-out operation if job never started · 39fb283f
  Morris Jette authored Jan 19, 2017
```
bug 3390
```
  39fb283f
- Remove unused function from sacct.c. · b0547a8e
  Tim Wickberg authored Jan 18, 2017
  
  b0547a8e
- Add INFINITE16 macro to compliment NO_VAL16. · e8e34db5
  Tim Wickberg authored Jan 18, 2017
  
  e8e34db5
18 Jan, 2017 5 commits
- Change http to https for all links to SchedMD sites. · d7770b9b
  Tim Wickberg authored Jan 18, 2017
  
  d7770b9b
- Add test for srun ignoring stdin · 40e89091
  Morris Jette authored Jan 18, 2017
```
see bug 3166
```
  40e89091
- ensure stepd -> srun client socket fully shutdown · c08f8922
  Aaron Knister authored Jan 18, 2017
```
ensure eio objects get explicitly shutdown when
eio_handle_mainloop exits. currently depending on
whether the order the eio_handle_mainloop and
eio_signal_shutdown get called relative to each other

when stepd is instructed to shut down the socket use
SHUT_RDWR instead of SHUT_RD. just using SHUT_RD can
cause srun to receive ECONNRESET if there's outstanding
data that's been sent to stepd that the task has not
read.

bug 3166
```
  c08f8922
- Cosmetic changes · f4b9b51a
  Morris Jette authored Jan 18, 2017
```
No change in logic
```
  f4b9b51a
- Remove local pkill script in testsuite. Was used for AIX. · 79f912d5
  Tim Wickberg authored Jan 18, 2017
  
  79f912d5