- 16 Apr, 2011 4 commits
-
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
which is a wrapper over Cray's aprun command and supports many srun options. Without this option, the srun command will advise the user to use the aprun command.
-
Moe Jette authored
which is a wrapper over Cray's aprun command and supports many srun options. Without this option, the srun command will advise the user to use the aprun command.
-
- 15 Apr, 2011 3 commits
- 14 Apr, 2011 7 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
-
- 13 Apr, 2011 5 commits
-
-
Moe Jette authored
-
Danny Auble authored
BLUEGENE - when running in overlap mode make sure to check the connection type so you can create overlapping blocks on the exact same nodes with different connection types (i.e. one torus, one mesh).
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
-
- 12 Apr, 2011 3 commits
-
-
Don Lipari authored
HWLOC_API_VERSION
-
Danny Auble authored
-
Don Lipari authored
-
- 11 Apr, 2011 7 commits
-
-
Don Lipari authored
the presence of the man2html utility.
-
-
Danny Auble authored
-
Moe Jette authored
20.2, 20.4 and 7.3 from running with select/bluegene
-
Moe Jette authored
-
Moe Jette authored
so that we can resolve xstrdup() to slurm_xstrdup().
-
Moe Jette authored
-
- 10 Apr, 2011 11 commits
-
-
Moe Jette authored
and dependency clearing logic
-
Moe Jette authored
This removes a function "slurm_pack_msg_no_header" which is nowhere referenced in the src tree and which also is not listed in any of the slurm manpages. As far as I understand the documentation, each slurm message needs to have a header, it could thus be that this function is from very old or initial code.
-
Moe Jette authored
-
Moe Jette authored
This removes a duplicated test statement which appears identically twice.
-
Moe Jette authored
This adds support for the SLURM_CLUSTERS environment variable also for sprio. It also makes the test for the priority plugin type dependent on whether running with multiple cluster support or not.
-
Moe Jette authored
On our frontend host we support multiple clusters (Cray and non-Cray) by setting the SLURM_CLUSTERS environment variable accordingly. In order to use scontrol (e.g. for hold/release of a user job) from a frontend host to control jobs on a remote Cray system, we need support for the SLURM_CLUSTERS environment variable also in scontrol.
-
Moe Jette authored
The current code erases the old nice value (both negative and positive) when a job is put on hold so that the job has a 0 nice component upon release. This interaction causes difficulties if the nice value set at submission time had been set there for a reason, for instance when * a system administrator has allowed to set a negative nice value; * the user wanted to keep this as a low-priority job and wants his/her other jobs to go first (indenpendent of the hold option); * the nice value is used for other semantics - at our site for instance, we use it for computed "base priority values" that are computed by looking at how much of their quota a given group has already (over)used. Here is an example which illustrates the loss of original nice values: [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 55 [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 66 [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 77 [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 88 [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 99 [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 110 This is from user 'kraused' whose project 's310' is within the allocated quota and thus has an initial nice value of -542 (set via the job_submit/lua plugin). However, by putting his jobs on hold, he has lost this advantage: JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE 55 kraused 15181 153 0 5028 10000 0 0 66 kraused 15181 153 0 5028 10000 0 0 77 kraused 15181 153 0 5028 10000 0 0 88 kraused 15178 150 0 5028 10000 0 0 99 kraused 15178 150 0 5028 10000 0 0 110 kraused 15178 150 0 5028 10000 0 0 I believe that resetting the nice value has been there for a reason, thus the patch prevents reset of current nice value only if the operation is not user/administrator hold.
-
Moe Jette authored
This fixes a problem when trying to move a pending job from one partitition to another while not supplying any other parameters: * if a partition value is present, the job is pending and no min_nodes are supplied, the job_specs->min_nodes get set from the detail_ptr value; * this causes subsequent tests for job_specs->min_nodes ==/!= NO_VAL to fail. The following illustrates the behaviour, the example is taken from our system: palu2:0 ~>scontrol update jobid=3944 partition=night slurm_update error: Requested operation not supported on this system slurmctld.log [2011-04-06T14:39:51] update_job: setting partition to night for job_id 3944 [2011-04-06T14:39:51] Change of size for job 3944 not supported [2011-04-06T14:39:51] updating accounting [2011-04-06T14:39:51] _slurm_rpc_update_job JobId=3944 uid=21215: Requested operation not supported on this system ==> The 'Change of size for job 3944' reveals that the !select_g_job_expand_allow() case was triggered, after setting the job_specs->min_nodes due to supplying a job_specs->partition. Fix: ==== Since the test for select_g_job_expand_allow() is not dependent on the job state, moved it up, before the test for the job_specs->partition. At the same time, also moved the equality for INFINITE/NO_VAL min_nodes values to the same place. Tests for job_specs->min_nodes below the job_specs->partition setting depend on the job state, - the 'Reset min and max node counts as needed, insure consistency' requires pending state; - the other remaining test is only for IS_JOB_RUNNING/SUSPENDED.
-
Moe Jette authored
This patch avoids that the priority is not recalculated on 'scontrol release', which happens when an authorized operator releases a job, or if the job is released via e.g. the job_submit plugin. The patch reorders the tests in update_job() to * test first if the job has been held by the user and, only if not, * test whether an authorized operator changed the priority or the updated priority is being reduced. Due to earlier permission checks, we have either * job_ptr->user_id == uid or * authorized, where in both cases the release-user-hold operation is authorized.
-
Moe Jette authored
This fix is related to an earlier one and was observed when trying to 'scontrol release' a job previously submitted via 'sbatch --hold' by the same user. Within the job_submit/lua plugin, the user gets automatically assigned a partition. So, even if no submitter uid checks are usually expected, it can happen in the process of releasing a job, that a part_check is performed. In this case, the error message was [2011-03-30T18:37:17] _part_access_check: uid 4294967294 access to partition usup denied, bad group [2011-03-30T18:37:17] error: _slurm_rpc_update_job JobId=12856 uid=21215: User's group not permitted to use this partition and like before (in scontrol_update_job()), was fixed by supplying the UID of the requesting user.
-
Moe Jette authored
-