- 17 Apr, 2011 8 commits
-
-
Moe Jette authored
For robustness, it does make sense calling the RELEASE method multiple times. The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in section 3.3.5, "Improve RELEASE method response" the following: "Periodically send RELEASE method requests until the RELEASE method response indicates the reservation is gone (via an error response)". The typical error message for this case is (also shown in the document on page 11): [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087 [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543 There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel: 1. scancel causes job_signal() to be invoked, 2. job_signal() defers to select_g_job_signal(), - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL, - at this stage the reservation is already released, 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(), 4. this calls select_g_job_fini(), - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(), - since at this stage the reservation has already been released, the error message results. I don't like this error message myself, but avoiding it in all call paths is complicated. Also, for robustness, I would very much prefer that do_basil_release() is always called from deallocate_nodes(). Hence this patch creates a custom error class for the case "no entry for resId xxx". This allows the calling function to still catch the error, but the unnecessary warning is no longer printed in the logfiles. The callers of this method are: * do_basil_release() - which already is set up to handle error/non-error case; * basil_safe_release() - this does not extra error checking, since it is called when trying to remove orphaned reservations, any failure in attempting to release the reservation will result in repeated "orphaned ALPS reservation ..." messages.select/cray: special case for "no such resId" For robustness, it does make sense calling the RELEASE method multiple times. The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in section 3.3.5, "Improve RELEASE method response" the following: "Periodically send RELEASE method requests until the RELEASE method response indicates the reservation is gone (via an error response)". The typical error message for this case is (also shown in the document on page 11): [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087 [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543 There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel: 1. scancel causes job_signal() to be invoked, 2. job_signal() defers to select_g_job_signal(), - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL, - at this stage the reservation is already released, 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(), 4. this calls select_g_job_fini(), - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(), - since at this stage the reservation has already been released, the error message results. I don't like this error message myself, but avoiding it in all call paths is complicated. Also, for robustness, I would very much prefer that do_basil_release() is always called from deallocate_nodes(). Hence this patch creates a custom error class for the case "no entry for resId xxx". This allows the calling function to still catch the error, but the unnecessary warning is no longer printed in the logfiles. The callers of this method are: * do_basil_release() - which already is set up to handle error/non-error case; * basil_safe_release() - this does not extra error checking, since it is called when trying to remove orphaned reservations, any failure in attempting to release the reservation will result in repeated "orphaned ALPS reservation ..." messages.
-
Moe Jette authored
This patch implements the same principle as an earlier one to fix issues when signalling aprun job steps via apkill: to avoid race conditions where further aprun lines get started while the current one is still in progress, always release the reservation first.
-
Moe Jette authored
This refactors the code to parse Basil 4.0 response data, removing code that is applicable to both Basil 3.1 and 4.0.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
SLURM's logic used to support Cray systems.
-
Moe Jette authored
the key and value (e.g. "-N2" gets translated to "-N 2" for the perl parser).
-
- 16 Apr, 2011 8 commits
-
-
Moe Jette authored
-
Moe Jette authored
This is so that the select/cray plugin can read its configuration file and still be used by the perl wrappers.
-
Moe Jette authored
This is the size of the container ID in the current SGI_JOB (PAGG) library.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
which is a wrapper over Cray's aprun command and supports many srun options. Without this option, the srun command will advise the user to use the aprun command.
-
Moe Jette authored
which is a wrapper over Cray's aprun command and supports many srun options. Without this option, the srun command will advise the user to use the aprun command.
-
- 15 Apr, 2011 3 commits
- 14 Apr, 2011 7 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
-
- 13 Apr, 2011 5 commits
-
-
Moe Jette authored
-
Danny Auble authored
BLUEGENE - when running in overlap mode make sure to check the connection type so you can create overlapping blocks on the exact same nodes with different connection types (i.e. one torus, one mesh).
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
-
- 12 Apr, 2011 3 commits
-
-
Don Lipari authored
HWLOC_API_VERSION
-
Danny Auble authored
-
Don Lipari authored
-
- 11 Apr, 2011 6 commits
-
-
Don Lipari authored
the presence of the man2html utility.
-
-
Danny Auble authored
-
Moe Jette authored
20.2, 20.4 and 7.3 from running with select/bluegene
-
Moe Jette authored
-
Moe Jette authored
so that we can resolve xstrdup() to slurm_xstrdup().
-