1. 04 Nov, 2016 3 commits
    • Morris Jette's avatar
      cray/burst_buffer - Preserve job ID · 42a90020
      Morris Jette authored
      cray/burst_buffer - Preserve job ID and don't translate to job array ID
        after slurmctld restart. Prior logic would not set array_task_id to
        NO_VAL, so all job-buffer IDs would be reported in the form
        "JobID=0_0(123)" rather than "JobID=123"
      42a90020
    • Morris Jette's avatar
      Burst_buffer/cray space tracking fix · 1548086f
      Morris Jette authored
      cray/busrt_buffer - Internally track both allocated and unusable space.
          The reported UsedSpace in a pool is now the allocated space (previously was
          unusable space). Base available space on whichever value leaves least free
          space.
      bug 3222
      1548086f
    • Tim Wickberg's avatar
      Removal last pieces of dynalloc plugin. · 11b9ca4d
      Tim Wickberg authored
      Previously disconnected from build system, and most code removed by
      commit 0b14a3a7 back on 15.08-pre1.
      11b9ca4d
  2. 03 Nov, 2016 8 commits
  3. 01 Nov, 2016 4 commits
  4. 28 Oct, 2016 1 commit
    • Danny Auble's avatar
      Fix issue in the priority/multifactor plugin where on a slurmctld restart · be924b88
      Danny Auble authored
      more time than should be allowed would be accounted for.
      
      This only happened on jobs in the completing state when the slurmctld
      was shutdown.
      
      This will also be enhanced in 17.02 as the job's end_time_exp is not
      stored which is needed to determine if the job has already been through
      the decay_thread at end of job.
      
      Bug 3162
      be924b88
  5. 27 Oct, 2016 8 commits
  6. 26 Oct, 2016 7 commits
  7. 25 Oct, 2016 4 commits
  8. 24 Oct, 2016 2 commits
    • Jacek Budzowski's avatar
      Fix for sstat on multi-node batch jobs · 8589ff40
      Jacek Budzowski authored
      There is a problem with gathering batch step statistics for jobs which are allocated on more than one node.
      
      Sstat asks wrong node for batch step stats. It requests info from last node from hostlist while it should ask first host from hostlist (i.e. BatchHost), because only on the first node the batch step actually executes.
      
      For example, when you have a job allocated on nodes n000[1-2] with BatchHost=p0001. You should be able to check its statistics by running sstat [ with -vv switch for more verbose output] (e.g. sstat -j 1234.batch -vv). Then you can see lines:
      
      sstat: debug:  slurm_job_step_stat: getting pid information of job 1234.4294967294 on nodes n0002
      sstat: debug:  job step 1234.4294967294 has already completed
      
      The problem lays in sstat source code. For batch step a hostlist variable is taken from the hostlist_pop function, which returns last host from given hostlist. This should be replaced with the hostlist_shift function, which returns first host from the given hostlist. Patch attached.
      
      bug 2975
      8589ff40
    • Dorian Krause's avatar
      Fux use-after-free in srun · 2c7c5459
      Dorian Krause authored
      This commit fixes a bug in the multi-prog handling. When running
      salloc -N 2 srun -O --multi-prog mp.conf where mp.conf reads
      
      0-192 true
      
      srun crashes can be observed. valgrind reports:
      
      ==6857== Invalid read of size 4
      ==6857==    at 0x45938D: bit_realloc (bitstring.c:189)
      ==6857==    by 0x5977A9: _update_task_mask (multi_prog.c:335)
      ==6857==    by 0x597A5E: _validate_ranks (multi_prog.c:403)
      ==6857==    by 0x597D1E: verify_multi_name (multi_prog.c:469)
      ==6857==    by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453)
      ==6857==    by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493)
      ==6857==    by 0x58E556: _opt_args (opt.c:1927)
      ==6857==    by 0x58A3B9: initialize_and_process_args (opt.c:270)
      ==6857==    by 0x591F82: init_srun (srun_job.c:459)
      ==6857==    by 0x427E70: srun (srun.c:193)
      ==6857==    by 0x428E23: main (srun.wrapper.c:17)
      ==6857==  Address 0x5ace440 is 16 bytes inside a block of size 28 free'd
      ==6857==    at 0x4C2BB4A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==6857==    by 0x446886: slurm_xrealloc (xmalloc.c:139)
      ==6857==    by 0x45944C: bit_realloc (bitstring.c:191)
      ==6857==    by 0x5977A9: _update_task_mask (multi_prog.c:335)
      ==6857==    by 0x597A5E: _validate_ranks (multi_prog.c:403)
      ==6857==    by 0x597D1E: verify_multi_name (multi_prog.c:469)
      ==6857==    by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453)
      ==6857==    by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493)
      ==6857==    by 0x58E556: _opt_args (opt.c:1927)
      ==6857==    by 0x58A3B9: initialize_and_process_args (opt.c:270)
      ==6857==    by 0x591F82: init_srun (srun_job.c:459)
      ==6857==    by 0x427E70: srun (srun.c:193)
      2c7c5459
  9. 20 Oct, 2016 3 commits