1. 15 Apr, 2011 3 commits
  2. 14 Apr, 2011 7 commits
  3. 13 Apr, 2011 5 commits
  4. 12 Apr, 2011 3 commits
  5. 11 Apr, 2011 7 commits
  6. 10 Apr, 2011 15 commits
    • Moe Jette's avatar
      tweaks to some tests to reflect changes recent changes in priority change · 5498bb90
      Moe Jette authored
      and dependency clearing logic
      5498bb90
    • Moe Jette's avatar
      api: remove unreferenced and undocumented function · 22ece52e
      Moe Jette authored
      This removes a function "slurm_pack_msg_no_header" which is nowhere referenced
      in the src tree and which also is not listed in any of the slurm manpages.
      
      As far as I understand the documentation, each slurm message needs to have a
      header, it could thus be that this function is from very old or initial code.
      22ece52e
    • Moe Jette's avatar
      scontrol: refactor if/else statement · 367c71ba
      Moe Jette authored
      367c71ba
    • Moe Jette's avatar
      protocol_defs: remove duplicate/identical test · 5c13acad
      Moe Jette authored
      This removes a duplicated test statement which appears identically twice.
      5c13acad
    • Moe Jette's avatar
      sprio: add support for the SLURM_CLUSTERS environment variable · 0a0efdf2
      Moe Jette authored
      This adds support for the SLURM_CLUSTERS environment variable also for sprio.
      It also makes the test for the priority plugin type dependent on whether
      running with multiple cluster support or not.
      0a0efdf2
    • Moe Jette's avatar
      scontrol: add support for the SLURM_CLUSTERS environment variable · c7045c83
      Moe Jette authored
      On our frontend host we support multiple clusters (Cray and non-Cray) by
      setting the SLURM_CLUSTERS environment variable accordingly.
      
      In order to use scontrol (e.g. for hold/release of a user job) from a
      frontend host to control jobs on a remote Cray system, we need support for
      the SLURM_CLUSTERS environment variable also in scontrol.
      c7045c83
    • Moe Jette's avatar
      slurmctld: keep original nice value when putting job on hold · b414712e
      Moe Jette authored
      The current code erases the old nice value (both negative and positive) when a job is
      put on hold so that the job has a 0 nice component upon release.
      
      This interaction causes difficulties if the nice value set at submission time had been
      set there for a reason, for instance when
       * a system administrator has allowed to set a negative nice value;
       * the user wanted to keep this as a low-priority job and wants his/her other jobs
         to go first (indenpendent of the hold option);
       * the nice value is used for other semantics - at our site for instance, we use it
         for computed "base priority values" that are computed by looking at how much of
         their quota a given group has already (over)used.
      
      Here is an example which illustrates the loss of original nice values:
      
        [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 55
        [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 66
        [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 77
        [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 88
        [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 99
        [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 110
      
      This is from user 'kraused' whose project 's310' is within the allocated quota and thus
      has an initial nice value of -542 (set via the job_submit/lua plugin).
      
      However, by putting his jobs on hold, he has lost this advantage:
      
        JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
           55  kraused      15181        153          0       5028      10000          0      0
           66  kraused      15181        153          0       5028      10000          0      0
           77  kraused      15181        153          0       5028      10000          0      0
           88  kraused      15178        150          0       5028      10000          0      0
           99  kraused      15178        150          0       5028      10000          0      0
          110  kraused      15178        150          0       5028      10000          0      0
      
      I believe that resetting the nice value has been there for a reason, thus the patch prevents
      reset of current nice value only if the operation is not user/administrator hold.
      b414712e
    • Moe Jette's avatar
      slurmctld: test job_specs->min_nodes before altering the value via partition setting · f8ea48bb
      Moe Jette authored
      This fixes a problem when trying to move a pending job from one partitition to another
      while not supplying any other parameters:
       * if a partition value is present, the job is pending and no min_nodes are supplied,
         the job_specs->min_nodes get set from the detail_ptr value;
       * this causes subsequent tests for job_specs->min_nodes ==/!= NO_VAL to fail.
      
      The following illustrates the behaviour, the example is taken from our system:
        palu2:0 ~>scontrol update jobid=3944 partition=night
        slurm_update error: Requested operation not supported on this system
      
        slurmctld.log
        [2011-04-06T14:39:51] update_job: setting partition to night for job_id 3944
        [2011-04-06T14:39:51] Change of size for job 3944 not supported
        [2011-04-06T14:39:51] updating accounting
        [2011-04-06T14:39:51] _slurm_rpc_update_job JobId=3944 uid=21215: Requested operation not supported on this system
      
      ==> The 'Change of size for job 3944' reveals that the !select_g_job_expand_allow() case was triggered,
          after setting the job_specs->min_nodes due to supplying a job_specs->partition.
      
      Fix:
      ====
       Since the test for select_g_job_expand_allow() is not dependent on the job state, moved it up, before
       the test for the job_specs->partition. At the same time, also moved the equality for INFINITE/NO_VAL
       min_nodes values to the same place.
       Tests for job_specs->min_nodes below the job_specs->partition setting depend on the job state, 
       - the 'Reset min and max node counts as needed, insure consistency' requires pending state;
       - the other remaining test is only for IS_JOB_RUNNING/SUSPENDED.
      f8ea48bb
    • Moe Jette's avatar
      slurmctld: case of authorized operator releasing user hold · 1895a10a
      Moe Jette authored
      This patch avoids that the priority is not recalculated on 'scontrol release',
      which  happens when an authorized operator releases a job, or if the job is
      released via e.g. the job_submit plugin.
      
      The patch reorders the tests in update_job() to 
       * test first if the job has been held by the user and, only if not,
       * test whether an authorized operator changed the priority or
         the updated priority is being reduced.
      
      Due to earlier permission checks, we have either
       * job_ptr->user_id == uid or 
       * authorized,
      where in both cases the release-user-hold operation is authorized.
      1895a10a
    • Moe Jette's avatar
      scontrol: set uid when releasing a job · 6353467b
      Moe Jette authored
      This fix is related to an earlier one and was observed when trying to 'scontrol release'
      a job previously submitted via 'sbatch --hold' by the same user.
      
      Within the job_submit/lua plugin, the user gets automatically assigned a partition. So,
      even if no submitter uid checks are usually expected, it can happen in the process of
      releasing a job, that a part_check is performed.
      
      In this case, the error message was
      
      [2011-03-30T18:37:17] _part_access_check: uid 4294967294 access to partition usup denied, bad group
      [2011-03-30T18:37:17] error: _slurm_rpc_update_job JobId=12856 uid=21215: User's group not permitted to use this partition
      
      and like before (in scontrol_update_job()), was fixed by supplying the UID of the requesting user.
      6353467b
    • Moe Jette's avatar
      add function args to header · a269b6f4
      Moe Jette authored
      a269b6f4
    • Moe Jette's avatar
      slurmstepd: avoid coredump in case of NULL job · e0d92b8a
      Moe Jette authored
      We build slurm with --enable-memory-leak-debug and encountered twice the same core
      dump when user 'root' was trying to run jobs during a maintenance session. 
      
      The root user is not in the accounting database, which explains the errors seen
      below. The gdb session shows that in this invocation 
      
      palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core 
      ...
      Modify: 2011-04-04 19:34:44.000000000 +0200
      
      slurmctld.log
      [2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773
      [2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333
      [2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920
      [2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host
      [2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING
      [2011-04-04T19:34:44] completing job 3254
      [2011-04-04T19:34:44] Requeue JobId=3254 due to node failure
      [2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful
      [2011-04-04T19:34:44] requeue batch job 3254
      [2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285
      
      (gdb) core-file palu7-slurmstepd-6602.core 
      [New Thread 6604]
      Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'.
      Program terminated with signal 11, Segmentation fault.
      #0  main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413
      413             jobacct_gather_g_destroy(job->jobacct);
      (gdb) print job
      $1 = (slurmd_job_t *) 0x0
      (gdb) list
      408
      409     #ifdef MEMORY_LEAK_DEBUG
      410     static void
      411     _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc)
      412     {
      413             jobacct_gather_g_destroy(job->jobacct);
      414             if (!job->batch)
      415                     job_destroy(job);
      416             /*
      417              * The message cannot be freed until the jobstep is complete
      (gdb) print msg
      $2 = (slurm_msg_t *) 0x916008
      (gdb) print rc
      $3 = -1
      (gdb) 
      
      The patch tests for a NULL job argument for the calls that need to dereference the job pointer.
      e0d92b8a
    • Moe Jette's avatar
      select/cray: zero reservation ID is not an error · 03f984aa
      Moe Jette authored
      This avoids meaningless error messages that warn about a zero reservation ID:
      
       [2011-04-07T15:31:26] _slurm_rpc_submit_batch_job JobId=2870 usec=33390
                             ... a minute later the user decides to scancel the queued job:
       [2011-04-07T15:32:34] error: JobId=2870 has invalid (ZERO) resId
       [2011-04-07T15:32:34] sched: Cancel of JobId=2870 by UID=21770, usec=230
      
      To keep things simple, that test has been removed.
      
      (The patch is in particular also necessary since now job_signal() may trigger
       a basil_release() of a pending job which has no ALPS reservation yet.)
      03f984aa
    • Moe Jette's avatar
      select/cray: release ALPS reservation on termination signals · 12772a3a
      Moe Jette authored
      On rosa we experienced severe problems when jobs got killed via scancel or
      as a result of job timeout. Job cleanup took several minutes, created stray
      processes that consumed resources on the slurmd node, keeping the system 
      for long spans unable from scheduling.
      
      This problem did not show up on the smaller 2-cabinet XE system (which also
      runs a more recent ALPS version). The fix for the problem is to keep new
      script lines from starting by sending apkill only after formally releasing
      the reservation.
      
      For all signals whose default disposition is to terminate or to dump core,
      the reservation is released before signalling the aprun job steps. This
      prevents a race condition where further aprun lines get executed while the
      apkill of the current aprun line in the job script is in progress.
      
      We did a before/after test on rosa under full load and the problem disappeared.
      12772a3a
    • Moe Jette's avatar
      add testimonial from CSCS · 44bec602
      Moe Jette authored
      44bec602