- 20 Jan, 2017 20 commits
-
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
like it does in slurm_send_recv_msg. The resp needs to be inited before _check_send it called.
-
Brian Christiansen authored
Sibling jobs have to get lock from the origin cluster in order to attempt to allocate nodes. If it gets the allocation then it lets the origin cluster know and the origin cluster will set the siblings jobs, if any, into a REVOKED state and purge the jobs. If the sibling job is the only sibling then it assumes the lock and attempts to start the job to avoid extra communications. If nodes can't be allocated then the job releases the lock for another cluster to try.
-
Brian Christiansen authored
-
Brian Christiansen authored
for fed sibling jobs that don't start.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
To handle JOB_REVOKED
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Tim Wickberg authored
Overwriting _size leads to the bitmap being the wrong length.
-
Tim Wickberg authored
CID 160063.
-
Tim Wickberg authored
The safe_unpackstr_xmalloc() call could jump to unpack_error, which would leak memory allocated for bitmap. Allocate only after the unpackstr has succeeded. Coverity 160092 (+ more due to macro expansion leading to repeats).
-
Tim Wickberg authored
Code would return NO_VAL if the requested frequency was greater than the highest available. While here improve the errors printed to the slurmstepd log location, and change the initial check against nfreq to test for zero. (Rather than (uint8_t) NO_VAL which it could never be set to.) Bug 3335.
-
Danny Auble authored
Bug 2508
-
Tim Wickberg authored
-
- 19 Jan, 2017 15 commits
-
-
Morris Jette authored
If job is allocated nodes which are powered down, then reset job start time when the nodes are ready and do not charge the job for power up time. bug 3411
-
Morris Jette authored
No changes in logic
-
Isaac Hartung authored
Modify get_my_user_name to return a FAILURE if the user_name cannot be determined, as was handled inconsistently.
-
Danny Auble authored
the function.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
Bug 3312 and 3332
-
Danny Auble authored
-
Danny Auble authored
jobid and stepid to be used if needing to kill job because it hadn't started yet.
-
Morris Jette authored
Add logic to monitor Uncorrectable Memory Errors (UME) and notify active jobs in case they run for a while afterwards. This copies logic from knl_generic to knl_cray. There may be a different UME monitoring system for Cray systems in the future. The original knl_generic development is in commit 56ff27da bug 3341
-
Morris Jette authored
-
Isaac Hartung authored
Replaced identical code in tests with a call to this proc. Also replace instances of code getting uid with a call to the global proc get_my_uid / get_my_nuid.
-
Morris Jette authored
bug 3390
-
Tim Wickberg authored
-
Tim Wickberg authored
-
- 18 Jan, 2017 5 commits
-
-
Tim Wickberg authored
-
Morris Jette authored
see bug 3166
-
Aaron Knister authored
ensure eio objects get explicitly shutdown when eio_handle_mainloop exits. currently depending on whether the order the eio_handle_mainloop and eio_signal_shutdown get called relative to each other when stepd is instructed to shut down the socket use SHUT_RDWR instead of SHUT_RD. just using SHUT_RD can cause srun to receive ECONNRESET if there's outstanding data that's been sent to stepd that the task has not read. bug 3166
-
Morris Jette authored
No change in logic
-
Tim Wickberg authored
-