1. 16 Apr, 2011 4 commits
  2. 15 Apr, 2011 3 commits
  3. 14 Apr, 2011 7 commits
  4. 13 Apr, 2011 5 commits
  5. 12 Apr, 2011 3 commits
  6. 11 Apr, 2011 7 commits
  7. 10 Apr, 2011 11 commits
    • Moe Jette's avatar
      tweaks to some tests to reflect changes recent changes in priority change · 5498bb90
      Moe Jette authored
      and dependency clearing logic
      5498bb90
    • Moe Jette's avatar
      api: remove unreferenced and undocumented function · 22ece52e
      Moe Jette authored
      This removes a function "slurm_pack_msg_no_header" which is nowhere referenced
      in the src tree and which also is not listed in any of the slurm manpages.
      
      As far as I understand the documentation, each slurm message needs to have a
      header, it could thus be that this function is from very old or initial code.
      22ece52e
    • Moe Jette's avatar
      scontrol: refactor if/else statement · 367c71ba
      Moe Jette authored
      367c71ba
    • Moe Jette's avatar
      protocol_defs: remove duplicate/identical test · 5c13acad
      Moe Jette authored
      This removes a duplicated test statement which appears identically twice.
      5c13acad
    • Moe Jette's avatar
      sprio: add support for the SLURM_CLUSTERS environment variable · 0a0efdf2
      Moe Jette authored
      This adds support for the SLURM_CLUSTERS environment variable also for sprio.
      It also makes the test for the priority plugin type dependent on whether
      running with multiple cluster support or not.
      0a0efdf2
    • Moe Jette's avatar
      scontrol: add support for the SLURM_CLUSTERS environment variable · c7045c83
      Moe Jette authored
      On our frontend host we support multiple clusters (Cray and non-Cray) by
      setting the SLURM_CLUSTERS environment variable accordingly.
      
      In order to use scontrol (e.g. for hold/release of a user job) from a
      frontend host to control jobs on a remote Cray system, we need support for
      the SLURM_CLUSTERS environment variable also in scontrol.
      c7045c83
    • Moe Jette's avatar
      slurmctld: keep original nice value when putting job on hold · b414712e
      Moe Jette authored
      The current code erases the old nice value (both negative and positive) when a job is
      put on hold so that the job has a 0 nice component upon release.
      
      This interaction causes difficulties if the nice value set at submission time had been
      set there for a reason, for instance when
       * a system administrator has allowed to set a negative nice value;
       * the user wanted to keep this as a low-priority job and wants his/her other jobs
         to go first (indenpendent of the hold option);
       * the nice value is used for other semantics - at our site for instance, we use it
         for computed "base priority values" that are computed by looking at how much of
         their quota a given group has already (over)used.
      
      Here is an example which illustrates the loss of original nice values:
      
        [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 55
        [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 66
        [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 77
        [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 88
        [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 99
        [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 110
      
      This is from user 'kraused' whose project 's310' is within the allocated quota and thus
      has an initial nice value of -542 (set via the job_submit/lua plugin).
      
      However, by putting his jobs on hold, he has lost this advantage:
      
        JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
           55  kraused      15181        153          0       5028      10000          0      0
           66  kraused      15181        153          0       5028      10000          0      0
           77  kraused      15181        153          0       5028      10000          0      0
           88  kraused      15178        150          0       5028      10000          0      0
           99  kraused      15178        150          0       5028      10000          0      0
          110  kraused      15178        150          0       5028      10000          0      0
      
      I believe that resetting the nice value has been there for a reason, thus the patch prevents
      reset of current nice value only if the operation is not user/administrator hold.
      b414712e
    • Moe Jette's avatar
      slurmctld: test job_specs->min_nodes before altering the value via partition setting · f8ea48bb
      Moe Jette authored
      This fixes a problem when trying to move a pending job from one partitition to another
      while not supplying any other parameters:
       * if a partition value is present, the job is pending and no min_nodes are supplied,
         the job_specs->min_nodes get set from the detail_ptr value;
       * this causes subsequent tests for job_specs->min_nodes ==/!= NO_VAL to fail.
      
      The following illustrates the behaviour, the example is taken from our system:
        palu2:0 ~>scontrol update jobid=3944 partition=night
        slurm_update error: Requested operation not supported on this system
      
        slurmctld.log
        [2011-04-06T14:39:51] update_job: setting partition to night for job_id 3944
        [2011-04-06T14:39:51] Change of size for job 3944 not supported
        [2011-04-06T14:39:51] updating accounting
        [2011-04-06T14:39:51] _slurm_rpc_update_job JobId=3944 uid=21215: Requested operation not supported on this system
      
      ==> The 'Change of size for job 3944' reveals that the !select_g_job_expand_allow() case was triggered,
          after setting the job_specs->min_nodes due to supplying a job_specs->partition.
      
      Fix:
      ====
       Since the test for select_g_job_expand_allow() is not dependent on the job state, moved it up, before
       the test for the job_specs->partition. At the same time, also moved the equality for INFINITE/NO_VAL
       min_nodes values to the same place.
       Tests for job_specs->min_nodes below the job_specs->partition setting depend on the job state, 
       - the 'Reset min and max node counts as needed, insure consistency' requires pending state;
       - the other remaining test is only for IS_JOB_RUNNING/SUSPENDED.
      f8ea48bb
    • Moe Jette's avatar
      slurmctld: case of authorized operator releasing user hold · 1895a10a
      Moe Jette authored
      This patch avoids that the priority is not recalculated on 'scontrol release',
      which  happens when an authorized operator releases a job, or if the job is
      released via e.g. the job_submit plugin.
      
      The patch reorders the tests in update_job() to 
       * test first if the job has been held by the user and, only if not,
       * test whether an authorized operator changed the priority or
         the updated priority is being reduced.
      
      Due to earlier permission checks, we have either
       * job_ptr->user_id == uid or 
       * authorized,
      where in both cases the release-user-hold operation is authorized.
      1895a10a
    • Moe Jette's avatar
      scontrol: set uid when releasing a job · 6353467b
      Moe Jette authored
      This fix is related to an earlier one and was observed when trying to 'scontrol release'
      a job previously submitted via 'sbatch --hold' by the same user.
      
      Within the job_submit/lua plugin, the user gets automatically assigned a partition. So,
      even if no submitter uid checks are usually expected, it can happen in the process of
      releasing a job, that a part_check is performed.
      
      In this case, the error message was
      
      [2011-03-30T18:37:17] _part_access_check: uid 4294967294 access to partition usup denied, bad group
      [2011-03-30T18:37:17] error: _slurm_rpc_update_job JobId=12856 uid=21215: User's group not permitted to use this partition
      
      and like before (in scontrol_update_job()), was fixed by supplying the UID of the requesting user.
      6353467b
    • Moe Jette's avatar
      add function args to header · a269b6f4
      Moe Jette authored
      a269b6f4