- 23 Jun, 2017 7 commits
-
-
Morris Jette authored
bug 3886
-
Morris Jette authored
-
Tim Shaw authored
Not necessarily fatal(), but of potential interest when debugging odd slurmctld crashes. Cannot go where the limit is originally set, as the logging infrastructure is not avaiable at that point. Bug 3886.
-
Morris Jette authored
test1.91 fails with non-default binding
-
Tim Wickberg authored
-
Tim Shaw authored
Bug 3581.
-
Morris Jette authored
Fix for commit 250378c2 test7.3 was failing without this patch bug 3502
-
- 22 Jun, 2017 33 commits
-
-
Morris Jette authored
-
Morris Jette authored
test 17.12 was leaving slurm-#.out files around. Explicitly set output file to /dev/null and set time limit to 1 minute to avoid vestigial jobs.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
Handle syncing between siblings (not the origin). Job could have been canclled or started while the origin was down. The sibling should see this and remove it's copy if the job is running or cancelled on the remote cluster. Handle case where job was cancelled or finished while the origin was down and the siblings were up.
-
Brian Christiansen authored
The controller will keeps job in the job_list until the origin comes back up and will find out about it then.
-
Brian Christiansen authored
This allows the origin to be able to sync up jobs after it has been down.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
(void *)(intptr_t)0 is treated as a NULL
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
_cleanup_removed_origin_jobs() could have been called without ever being part of a federation.
-
Brian Christiansen authored
Job could have been requeued if the nodes failed.
-
Isaac Hartung authored
bef69448 was fixed/changed so that slurm_addto_char_list() would now add an empty string to the list if no constraints or clusters were given. The code was expecting an empty List previously.
-
Brian Christiansen authored
Like sview it wasn't mapping the job's node indexes to the correct nodes since federated nodes are merged into one array.
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
-
Brian Christiansen authored
while the cluster is down. The cluster will figure out what changed after starting up or after resuming from using the cache.
-
Brian Christiansen authored
When the controller starts up and the dbd is not up it waits until the dbd comes up. At this point, the controller needs to find out if anything has changed in the federation (e.g. other clusters or self removed from cluster).
-
Brian Christiansen authored
-
Isaac Hartung authored
-
Isaac Hartung authored
When a non-origin cluster is removed: - running jobs remain - fed_details removed so it can't call home. - origin cluster removes tracking job for running jobs - pending jobs are removed. - pending srun/sallocs don't get notified. - other clusters remove removed cluster from viable and active sibs When an origin cluster is removed: - all pending jobs are removed from all clusters that had job. - pending srun/sallocs are notified of termination - running jobs remain.
-
Isaac Hartung authored
-
Brian Christiansen authored
-
Isaac Hartung authored
-