1. 08 Sep, 2017 1 commit
  2. 07 Sep, 2017 2 commits
  3. 05 Sep, 2017 1 commit
  4. 04 Sep, 2017 1 commit
    • Alejandro Sanchez's avatar
      Fix to test job mem against MaxMemPer[CPU|Node] limits at scheduling time. · 24365514
      Alejandro Sanchez authored
      Initially job mem limits were tested at submission time through
      _validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested
      again at scheduling time, thus leading to jobs incorrectly being scheduled
      against partitions where the job exceeded their MaxMemPer* limit
      (which can in turn be inherited from the system wide limit too).
      
      NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support
      this new waiting reason.
      
      Bug 2291.
      24365514
  5. 02 Sep, 2017 1 commit
  6. 01 Sep, 2017 4 commits
  7. 31 Aug, 2017 1 commit
  8. 30 Aug, 2017 3 commits
  9. 29 Aug, 2017 5 commits
  10. 25 Aug, 2017 2 commits
  11. 24 Aug, 2017 2 commits
    • Morris Jette's avatar
      Add file bcast suppot for pack jobs · 58b21490
      Morris Jette authored
      Modify sbcast command and srun's --bcast option to support heterogeneous
            jobs.
      bug 4099
      58b21490
    • Alejandro Sanchez's avatar
      Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6
      Alejandro Sanchez authored
      Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
      call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.
      
      This scenario happened when cgroup.conf has ConstrainDevices=yes and
      task_cgroup_devices_create() tries to collect the GRES devices
      but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
      of zero size which is passed by argument to bit_unfmt().
      
      gres_cpu_cnt is 0 because we have defined a gres.conf like this:
      
      Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
      Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
      Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
      Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3
      
      but have no GresTypes nor GRES option in the slurm.conf / node config def.
      
      Bug 3974
      3e1fffb6
  12. 23 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df
      Alejandro Sanchez authored
      Running slurmctld under valgrind while operating with jobcomp/elasticsearch
      reported the following bytes definitely lost:
      
      ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
      ==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
      ==27403==    by 0x22856A: makespace (xstring.c:114)
      ==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
      ==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
      ==27403==    by 0x83C5BCD: ???
      ==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
      ==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)
      
      It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
      the corresponding job_node->serialized_job, but the originally generated buffer
      wasn't freed afterwards. The fix consists in change the transfer so that instead
      of xstrdup'ing the char * we just assign the pointer and NULL the buffer.
      
      The job_node->serialized_job was already xfree'd properly later when the job
      was indexed.
      
      Discovered while working on Bug 4065.
      8172b7df
  13. 22 Aug, 2017 2 commits
  14. 21 Aug, 2017 2 commits
    • Isaac Hartung's avatar
      Print numbers using exponential format as needed · c125759d
      Isaac Hartung authored
      Print numbers using exponential format if required to fit in allocated
          field width. The sacctmgr and sshare commands are impacted.
      bug 1749
      c125759d
    • Alejandro Sanchez's avatar
      select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d
      Alejandro Sanchez authored
      Given a configuration with TopologyParam including Dragonfly option, if a
      job requested --switches count, the count timeout specified by either
      the job request or max_switch_wait SchedulerParameters was not respected.
      This was due to leaf_switch_count variable not being incremented in
      _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
      the later being a execution path which already succeed to wait for the
      switch count timeout.
      
      Bug 4056
      46c0919d
  15. 18 Aug, 2017 3 commits
  16. 17 Aug, 2017 2 commits
  17. 16 Aug, 2017 3 commits
  18. 15 Aug, 2017 4 commits