1. 12 Sep, 2017 3 commits
  2. 08 Sep, 2017 2 commits
  3. 07 Sep, 2017 2 commits
  4. 01 Sep, 2017 2 commits
  5. 24 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6
      Alejandro Sanchez authored
      Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
      call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.
      
      This scenario happened when cgroup.conf has ConstrainDevices=yes and
      task_cgroup_devices_create() tries to collect the GRES devices
      but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
      of zero size which is passed by argument to bit_unfmt().
      
      gres_cpu_cnt is 0 because we have defined a gres.conf like this:
      
      Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
      Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
      Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
      Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3
      
      but have no GresTypes nor GRES option in the slurm.conf / node config def.
      
      Bug 3974
      3e1fffb6
  6. 23 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df
      Alejandro Sanchez authored
      Running slurmctld under valgrind while operating with jobcomp/elasticsearch
      reported the following bytes definitely lost:
      
      ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
      ==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
      ==27403==    by 0x22856A: makespace (xstring.c:114)
      ==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
      ==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
      ==27403==    by 0x83C5BCD: ???
      ==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
      ==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)
      
      It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
      the corresponding job_node->serialized_job, but the originally generated buffer
      wasn't freed afterwards. The fix consists in change the transfer so that instead
      of xstrdup'ing the char * we just assign the pointer and NULL the buffer.
      
      The job_node->serialized_job was already xfree'd properly later when the job
      was indexed.
      
      Discovered while working on Bug 4065.
      8172b7df
  7. 22 Aug, 2017 2 commits
  8. 21 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d
      Alejandro Sanchez authored
      Given a configuration with TopologyParam including Dragonfly option, if a
      job requested --switches count, the count timeout specified by either
      the job request or max_switch_wait SchedulerParameters was not respected.
      This was due to leaf_switch_count variable not being incremented in
      _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
      the later being a execution path which already succeed to wait for the
      switch count timeout.
      
      Bug 4056
      46c0919d
  9. 17 Aug, 2017 1 commit
  10. 16 Aug, 2017 1 commit
  11. 15 Aug, 2017 1 commit
  12. 14 Aug, 2017 3 commits
  13. 11 Aug, 2017 3 commits
  14. 07 Aug, 2017 2 commits
  15. 04 Aug, 2017 4 commits
  16. 02 Aug, 2017 2 commits
    • Marshall Garey's avatar
      Fix starting ctld w/out existing StateSaveLocation · ec78d45a
      Marshall Garey authored
      Would fail when trying to create the clustername file because the
      StateSaveLocation path didn't exist yet.
      
      Bug 3988
      ec78d45a
    • Marshall Garey's avatar
      Fix srun jobs to run in high prio partition · 948de46b
      Marshall Garey authored
      srun jobs that could start immediately and requested multiple partitions
      didn't run in the highest priority partition if the highest priority
      partition wasn't listed first.
      
      It's possible that the scontrol show jobs will show the partition list
      in priority order now that the job's partition list gets sorted by
      priority.
      
      Bug 4015
      948de46b
  17. 01 Aug, 2017 2 commits
  18. 31 Jul, 2017 1 commit
  19. 28 Jul, 2017 2 commits
    • Danny Auble's avatar
      Fix issue when an alternate munge key when communicating on a persistent · 591dc036
      Danny Auble authored
      connection.
      
      Bug 4009
      591dc036
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - save state on REQUEST_CONTROL. · 8944b77a
      Alejandro Sanchez authored
      jobcomp/elasticsearch saves/load the state to/from elasticsearch_state.  Since
      the jobcomp API isn't designed with save/load state operations, the plugin
      _save_state() isn't extern and not available from outside the plugin itself,
      thus it is highly coupled to fini() function. This state doesn't follow the
      same execution path as the rest of Slurm states, where in save_all_sate()
      they are all independently scheduled. So we save it manually here on a RPC
      of type REQUEST_CONTROL.
      
      This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup
      which is currently in controller mode, the Backup will save the state and when
      the Primary assumes control again it can process the saved pending jobs.  The
      other way around was already controlled, because when the Primary is running
      in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is
      shutdown and when breaking the ctld main() function while(1) loop, there was
      already a g_slurm_jobcomp_fini() call in place.
      
      Bug 3908
      8944b77a
  20. 27 Jul, 2017 1 commit
    • Alejandro Sanchez's avatar
      Fix bug when tracking multiple simultaneous spawned ping cycles · f7463ef5
      Alejandro Sanchez authored
      When more than 1 ping cycle is spawned simultaneously (for instance
      REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes),
      we do not track a separate ping_start time for each cycle. When ping_begin()
      is called, the information about the previous ping cycle is lost. Then when
      ping_end() is called for the first of the two cycles, we set ping_start=0,
      which is incorrectly used to see if the last cycle ran for more than
      PING_TIMEOUT seconds (100s), thus incorrectly triggering the:
      
       error("Node ping apparently hung, many nodes may be DOWN or configured "
             "SlurmdTimeout should be increased");
      
      Bug 3914
      f7463ef5
  21. 26 Jul, 2017 3 commits