1. 19 Sep, 2017 2 commits
  2. 14 Sep, 2017 1 commit
    • Tim Wickberg's avatar
      Prevent a second PMI2_Init call from leaving a hung slurmstepd process. · b2aa25d5
      Tim Wickberg authored
      A second PMI2_Init() within the same step is invalid, and cannot succeed.
      
      Return an error code back to the client end, and close the fd to force the
      step to terminate immediately.
      
      Due to a bug in our libpmi code, just returning a cmd=response_to_init with
      an appropriate rc number will not tear down the connection properly, so
      send back something else that will trigger the error path.
      
      Bug 3520.
      b2aa25d5
  3. 13 Sep, 2017 1 commit
  4. 12 Sep, 2017 3 commits
  5. 08 Sep, 2017 2 commits
  6. 07 Sep, 2017 2 commits
  7. 01 Sep, 2017 2 commits
  8. 24 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6
      Alejandro Sanchez authored
      Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
      call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.
      
      This scenario happened when cgroup.conf has ConstrainDevices=yes and
      task_cgroup_devices_create() tries to collect the GRES devices
      but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
      of zero size which is passed by argument to bit_unfmt().
      
      gres_cpu_cnt is 0 because we have defined a gres.conf like this:
      
      Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
      Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
      Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
      Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3
      
      but have no GresTypes nor GRES option in the slurm.conf / node config def.
      
      Bug 3974
      3e1fffb6
  9. 23 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df
      Alejandro Sanchez authored
      Running slurmctld under valgrind while operating with jobcomp/elasticsearch
      reported the following bytes definitely lost:
      
      ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
      ==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
      ==27403==    by 0x22856A: makespace (xstring.c:114)
      ==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
      ==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
      ==27403==    by 0x83C5BCD: ???
      ==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
      ==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)
      
      It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
      the corresponding job_node->serialized_job, but the originally generated buffer
      wasn't freed afterwards. The fix consists in change the transfer so that instead
      of xstrdup'ing the char * we just assign the pointer and NULL the buffer.
      
      The job_node->serialized_job was already xfree'd properly later when the job
      was indexed.
      
      Discovered while working on Bug 4065.
      8172b7df
  10. 22 Aug, 2017 2 commits
  11. 21 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d
      Alejandro Sanchez authored
      Given a configuration with TopologyParam including Dragonfly option, if a
      job requested --switches count, the count timeout specified by either
      the job request or max_switch_wait SchedulerParameters was not respected.
      This was due to leaf_switch_count variable not being incremented in
      _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
      the later being a execution path which already succeed to wait for the
      switch count timeout.
      
      Bug 4056
      46c0919d
  12. 17 Aug, 2017 1 commit
  13. 16 Aug, 2017 1 commit
  14. 15 Aug, 2017 1 commit
  15. 14 Aug, 2017 3 commits
  16. 11 Aug, 2017 3 commits
  17. 07 Aug, 2017 2 commits
  18. 04 Aug, 2017 4 commits
  19. 02 Aug, 2017 2 commits
    • Marshall Garey's avatar
      Fix starting ctld w/out existing StateSaveLocation · ec78d45a
      Marshall Garey authored
      Would fail when trying to create the clustername file because the
      StateSaveLocation path didn't exist yet.
      
      Bug 3988
      ec78d45a
    • Marshall Garey's avatar
      Fix srun jobs to run in high prio partition · 948de46b
      Marshall Garey authored
      srun jobs that could start immediately and requested multiple partitions
      didn't run in the highest priority partition if the highest priority
      partition wasn't listed first.
      
      It's possible that the scontrol show jobs will show the partition list
      in priority order now that the job's partition list gets sorted by
      priority.
      
      Bug 4015
      948de46b
  20. 01 Aug, 2017 2 commits
  21. 31 Jul, 2017 1 commit
  22. 28 Jul, 2017 2 commits
    • Danny Auble's avatar
      Fix issue when an alternate munge key when communicating on a persistent · 591dc036
      Danny Auble authored
      connection.
      
      Bug 4009
      591dc036
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a
      Alejandro Sanchez authored
      jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
      the jobcomp API isn't designed with save/load state operations, the plugin
      _save_state() isn't extern and not available from outside the plugin itself,
      thus it is highly coupled to fini() function. This state doesn't follow the
      same execution path as the rest of Slurm states, where in save_all_sate()
      they are all independently scheduled. So we save it manually here on a RPC
      of type REQUEST_CONTROL.
      
      This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
      which is currently in controller mode, the Backup will save the state and when
      the Primary assumes control again it can process the saved pending jobs.  The
      other way around was already controlled, because when the Primary is running
      in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
      shutdown and when breaking the ctld main() function while(1) loop, there was
      already a g_slurm_jobcomp_fini() call in place.
      
      Bug 3908
      8944b77a