- 22 Jun, 2017 29 commits
-
-
Brian Christiansen authored
The controller will keeps job in the job_list until the origin comes back up and will find out about it then.
-
Brian Christiansen authored
This allows the origin to be able to sync up jobs after it has been down.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
(void *)(intptr_t)0 is treated as a NULL
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
_cleanup_removed_origin_jobs() could have been called without ever being part of a federation.
-
Brian Christiansen authored
Job could have been requeued if the nodes failed.
-
Isaac Hartung authored
bef69448 was fixed/changed so that slurm_addto_char_list() would now add an empty string to the list if no constraints or clusters were given. The code was expecting an empty List previously.
-
Brian Christiansen authored
Like sview it wasn't mapping the job's node indexes to the correct nodes since federated nodes are merged into one array.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
while the cluster is down. The cluster will figure out what changed after starting up or after resuming from using the cache.
-
Brian Christiansen authored
When the controller starts up and the dbd is not up it waits until the dbd comes up. At this point, the controller needs to find out if anything has changed in the federation (e.g. other clusters or self removed from cluster).
-
Brian Christiansen authored
-
Isaac Hartung authored
-
Isaac Hartung authored
When a non-origin cluster is removed: - running jobs remain - fed_details removed so it can't call home. - origin cluster removes tracking job for running jobs - pending jobs are removed. - pending srun/sallocs don't get notified. - other clusters remove removed cluster from viable and active sibs When an origin cluster is removed: - all pending jobs are removed from all clusters that had job. - pending srun/sallocs are notified of termination - running jobs remain.
-
Isaac Hartung authored
-
Brian Christiansen authored
-
Isaac Hartung authored
-
Brian Christiansen authored
-
Danny Auble authored
The SLURM_ID_HASH used for Cray systems has changed to fully use the entire 64 bits of the hash. Previously the stepid was multiplied by 10,000,000,000 to make it easy to read both the jobid as well as the stepid in the hash separated by at least a couple of zeros, but this lead to overflow on the hash with steps like the batch and extern step where they used all 32 bits to represent the step. While the new method ruins the easy readability it fixes the more important overflow issue. This most likely will go unnoticed by most, just a note of the change.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Danny Auble authored
# Conflicts: # NEWS
-
Hongjia Cao authored
Bug 3919
-
- 21 Jun, 2017 1 commit
-
-
Dominik Bartkiewicz authored
bug 3757
-
- 20 Jun, 2017 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
more than 1 partition or when the partition is changed with scontrol. Bug 3849
-
- 19 Jun, 2017 7 commits
-
-
Danny Auble authored
-
Danny Auble authored
submitted to a QOS/association. Bug 3849
-
Isaac Hartung authored
Continuation of b9719be2
-
Danny Auble authored
-
Brian Christiansen authored
CID: 170772, 170773 Introduced by commit: 250378c2
-
Danny Auble authored
-
Morris Jette authored
Correct error message when ClusterName in configuration files does not match the name in the slurmctld daemon's state save file.
-