- 19 Apr, 2011 5 commits
-
-
Moe Jette authored
scripts.
-
Moe Jette authored
command. Dependent upon .rpmmacros parameter of "%_with_srun2aprun"
-
-
Don Lipari authored
-
Moe Jette authored
Just use whichever StorageType plugin the user specifies to the configurator
-
- 18 Apr, 2011 6 commits
-
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
Patch from Bill Brophy
-
Moe Jette authored
--cpus-per-task option. Patch from Martin Perry, Bull.
-
Danny Auble authored
-
Danny Auble authored
-
- 17 Apr, 2011 13 commits
-
-
Moe Jette authored
-
Moe Jette authored
This allows scripted modification of job records, by exposing the * job_ptr->direct_set_prio * job_ptr->priority * job_ptr->details->nice fields to the job_submit.lua script.
-
Moe Jette authored
This allows the job_submit plugin to directly set priority values. If it assigns a priority value different from 0 and NO_VAL, the priority is marked as "fixed" via job_ptr->direct_set_prio. To enable this, the permission check for directly set priority is now done before calling the job_submit plugin, which in addition also allows to influence the nice value of the job via the plugin.
-
Moe Jette authored
This reorders the code of _job_create() to the effect that the job_submit plugin is able to put a job on hold (by setting the job priority to 0). To prevent the user from releasing such jobs, jobs put on hold by the job_submit plugin use WAIT_HELD rather than WAIT_HELD_USER.
-
Moe Jette authored
This increases robustness in releasing ALPS reservations. Previously the reservation was only released through * select_g_job_fini() for interactive (salloc) sessions; * batch_finish() by slurmstepd for batch sessions. This introduces a single point of failure for batch jobs, since a failure of batch_finish() would mean that the reservation could only be released much later, through the detection of orphaned ALPS reservations in basil_inventory(). For batch jobs that terminate normally this means that the RELEASE method is called twice: first in job_complete(), and then in batch_finish(). The Basil 1.2 design document by Ben Landsteiner (dated 15 Feb 2011) suggests in section 3.3.5 repeated calls of RELEASE as one possible way of improving the response of the RELEASE method. There will be additional "entry not found" messages in the apschedMMDD logs, but (due to the preceding patch) not in the SLURM logs. For jobs that have to be terminated (e.g. job_timed_out, job_requeue, job_fail), this patch will mean that the RELEASE is called much sooner and thus is expected to improve efficiency. For interactive salloc sessions that are cancelled via scancel, there is now no longer a warning message about the no longer existing ALPS reservation (since the release happens first through select_p_job_signal and then through job_complete -> deallocate_nodes -> select_p_job_fini).
-
Moe Jette authored
For robustness, it does make sense calling the RELEASE method multiple times. The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in section 3.3.5, "Improve RELEASE method response" the following: "Periodically send RELEASE method requests until the RELEASE method response indicates the reservation is gone (via an error response)". The typical error message for this case is (also shown in the document on page 11): [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087 [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543 There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel: 1. scancel causes job_signal() to be invoked, 2. job_signal() defers to select_g_job_signal(), - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL, - at this stage the reservation is already released, 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(), 4. this calls select_g_job_fini(), - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(), - since at this stage the reservation has already been released, the error message results. I don't like this error message myself, but avoiding it in all call paths is complicated. Also, for robustness, I would very much prefer that do_basil_release() is always called from deallocate_nodes(). Hence this patch creates a custom error class for the case "no entry for resId xxx". This allows the calling function to still catch the error, but the unnecessary warning is no longer printed in the logfiles. The callers of this method are: * do_basil_release() - which already is set up to handle error/non-error case; * basil_safe_release() - this does not extra error checking, since it is called when trying to remove orphaned reservations, any failure in attempting to release the reservation will result in repeated "orphaned ALPS reservation ..." messages.select/cray: special case for "no such resId" For robustness, it does make sense calling the RELEASE method multiple times. The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in section 3.3.5, "Improve RELEASE method response" the following: "Periodically send RELEASE method requests until the RELEASE method response indicates the reservation is gone (via an error response)". The typical error message for this case is (also shown in the document on page 11): [2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087 [2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543 There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel: 1. scancel causes job_signal() to be invoked, 2. job_signal() defers to select_g_job_signal(), - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL, - at this stage the reservation is already released, 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(), 4. this calls select_g_job_fini(), - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(), - since at this stage the reservation has already been released, the error message results. I don't like this error message myself, but avoiding it in all call paths is complicated. Also, for robustness, I would very much prefer that do_basil_release() is always called from deallocate_nodes(). Hence this patch creates a custom error class for the case "no entry for resId xxx". This allows the calling function to still catch the error, but the unnecessary warning is no longer printed in the logfiles. The callers of this method are: * do_basil_release() - which already is set up to handle error/non-error case; * basil_safe_release() - this does not extra error checking, since it is called when trying to remove orphaned reservations, any failure in attempting to release the reservation will result in repeated "orphaned ALPS reservation ..." messages.
-
Moe Jette authored
This patch implements the same principle as an earlier one to fix issues when signalling aprun job steps via apkill: to avoid race conditions where further aprun lines get started while the current one is still in progress, always release the reservation first.
-
Moe Jette authored
This refactors the code to parse Basil 4.0 response data, removing code that is applicable to both Basil 3.1 and 4.0.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
SLURM's logic used to support Cray systems.
-
Moe Jette authored
the key and value (e.g. "-N2" gets translated to "-N 2" for the perl parser).
-
- 16 Apr, 2011 8 commits
-
-
Moe Jette authored
-
Moe Jette authored
This is so that the select/cray plugin can read its configuration file and still be used by the perl wrappers.
-
Moe Jette authored
This is the size of the container ID in the current SGI_JOB (PAGG) library.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
which is a wrapper over Cray's aprun command and supports many srun options. Without this option, the srun command will advise the user to use the aprun command.
-
Moe Jette authored
which is a wrapper over Cray's aprun command and supports many srun options. Without this option, the srun command will advise the user to use the aprun command.
-
- 15 Apr, 2011 3 commits
- 14 Apr, 2011 5 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-