1. 13 Oct, 2017 5 commits
  2. 10 Oct, 2017 4 commits
  3. 05 Oct, 2017 1 commit
    • Brian Christiansen's avatar
      Show correct MaxTRESPerNode limit assoc reasons · 6e806f2d
      Brian Christiansen authored
      Before:
      $ sbatch --wrap="sleep 300"
      Submitted batch job 228
      $ squeue
                   JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
                     228     debug     wrap    brian PD       0:00      1 (AssocMaxUnknownPerNode)
      
      Fixed:
      $ squeue
                   JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
                     229     debug     wrap    brian PD       0:00      1 (AssocMaxCpuPerNode)
      
      $ sacctmgr mod account stuff set maxtrespernode=cpu=-1,mem=1
      $ squeue
                   JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
                     229     debug     wrap    brian PD       0:00      1 (AssocMaxMemPerNode)
      
      $ sbatch --wrap="sleep 300" --gres=blah:2 -pgpu
      Submitted batch job 235
      $ squeue
                   JOBID PARTITION     NAME     USER ST       TIME   CPUS NODELIST(REASON)
                     235       gpu     wrap    brian PD       0:00      1 (AssocMaxGRESPerNode)
      6e806f2d
  4. 04 Oct, 2017 1 commit
    • Morris Jette's avatar
      burst_buffer/cray plugin updated for Cray UP06 sofware · 859f6c82
      Morris Jette authored
      burst_buffer/cray plugin modified to work with changes in Cray UP06
         software release.
      Specific changes: Cray software now returns an error if a state_in
         or stage_out script is processed that doesn't actually request a
         stage in or out (previously silently ignored).
      Also the warning message about tearing down a buffer that is already
         gone changed.
      859f6c82
  5. 02 Oct, 2017 2 commits
  6. 29 Sep, 2017 2 commits
  7. 27 Sep, 2017 2 commits
  8. 19 Sep, 2017 3 commits
  9. 14 Sep, 2017 1 commit
    • Tim Wickberg's avatar
      Prevent a second PMI2_Init call from leaving a hung slurmstepd process. · b2aa25d5
      Tim Wickberg authored
      A second PMI2_Init() within the same step is invalid, and cannot succeed.
      
      Return an error code back to the client end, and close the fd to force the
      step to terminate immediately.
      
      Due to a bug in our libpmi code, just returning a cmd=response_to_init with
      an appropriate rc number will not tear down the connection properly, so
      send back something else that will trigger the error path.
      
      Bug 3520.
      b2aa25d5
  10. 13 Sep, 2017 1 commit
  11. 12 Sep, 2017 3 commits
  12. 08 Sep, 2017 2 commits
  13. 07 Sep, 2017 2 commits
  14. 01 Sep, 2017 2 commits
  15. 24 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6
      Alejandro Sanchez authored
      Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
      call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.
      
      This scenario happened when cgroup.conf has ConstrainDevices=yes and
      task_cgroup_devices_create() tries to collect the GRES devices
      but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
      of zero size which is passed by argument to bit_unfmt().
      
      gres_cpu_cnt is 0 because we have defined a gres.conf like this:
      
      Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
      Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
      Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
      Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3
      
      but have no GresTypes nor GRES option in the slurm.conf / node config def.
      
      Bug 3974
      3e1fffb6
  16. 23 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df
      Alejandro Sanchez authored
      Running slurmctld under valgrind while operating with jobcomp/elasticsearch
      reported the following bytes definitely lost:
      
      ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
      ==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
      ==27403==    by 0x22856A: makespace (xstring.c:114)
      ==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
      ==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
      ==27403==    by 0x83C5BCD: ???
      ==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
      ==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)
      
      It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
      the corresponding job_node->serialized_job, but the originally generated buffer
      wasn't freed afterwards. The fix consists in change the transfer so that instead
      of xstrdup'ing the char * we just assign the pointer and NULL the buffer.
      
      The job_node->serialized_job was already xfree'd properly later when the job
      was indexed.
      
      Discovered while working on Bug 4065.
      8172b7df
  17. 22 Aug, 2017 2 commits
  18. 21 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      select/cons_res - fix bug with Dragonfly and --switches count timeout · 46c0919d
      Alejandro Sanchez authored
      Given a configuration with TopologyParam including Dragonfly option, if a
      job requested --switches count, the count timeout specified by either
      the job request or max_switch_wait SchedulerParameters was not respected.
      This was due to leaf_switch_count variable not being incremented in
      _eval_nodes_dfly() function when needed, as we do in _eval_nodes_topo(),
      the later being a execution path which already succeed to wait for the
      switch count timeout.
      
      Bug 4056
      46c0919d
  19. 17 Aug, 2017 1 commit
  20. 16 Aug, 2017 1 commit
  21. 15 Aug, 2017 1 commit
  22. 14 Aug, 2017 1 commit