1. 14 Sep, 2017 1 commit
    • Tim Wickberg's avatar
      Prevent a second PMI2_Init call from leaving a hung slurmstepd process. · b2aa25d5
      Tim Wickberg authored
      A second PMI2_Init() within the same step is invalid, and cannot succeed.
      
      Return an error code back to the client end, and close the fd to force the
      step to terminate immediately.
      
      Due to a bug in our libpmi code, just returning a cmd=response_to_init with
      an appropriate rc number will not tear down the connection properly, so
      send back something else that will trigger the error path.
      
      Bug 3520.
      b2aa25d5
  2. 13 Sep, 2017 2 commits
  3. 12 Sep, 2017 6 commits
  4. 11 Sep, 2017 1 commit
  5. 08 Sep, 2017 4 commits
  6. 07 Sep, 2017 2 commits
  7. 06 Sep, 2017 9 commits
  8. 01 Sep, 2017 2 commits
  9. 31 Aug, 2017 4 commits
  10. 30 Aug, 2017 2 commits
  11. 29 Aug, 2017 1 commit
  12. 28 Aug, 2017 2 commits
  13. 24 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6
      Alejandro Sanchez authored
      Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
      call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.
      
      This scenario happened when cgroup.conf has ConstrainDevices=yes and
      task_cgroup_devices_create() tries to collect the GRES devices
      but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
      of zero size which is passed by argument to bit_unfmt().
      
      gres_cpu_cnt is 0 because we have defined a gres.conf like this:
      
      Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
      Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
      Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
      Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3
      
      but have no GresTypes nor GRES option in the slurm.conf / node config def.
      
      Bug 3974
      3e1fffb6
  14. 23 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df
      Alejandro Sanchez authored
      Running slurmctld under valgrind while operating with jobcomp/elasticsearch
      reported the following bytes definitely lost:
      
      ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
      ==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
      ==27403==    by 0x22856A: makespace (xstring.c:114)
      ==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
      ==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
      ==27403==    by 0x83C5BCD: ???
      ==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
      ==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)
      
      It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
      the corresponding job_node->serialized_job, but the originally generated buffer
      wasn't freed afterwards. The fix consists in change the transfer so that instead
      of xstrdup'ing the char * we just assign the pointer and NULL the buffer.
      
      The job_node->serialized_job was already xfree'd properly later when the job
      was indexed.
      
      Discovered while working on Bug 4065.
      8172b7df
  15. 22 Aug, 2017 2 commits