1. 12 Sep, 2017 6 commits
    • Morris Jette's avatar
      Pack job debugger synchronization · a8d2a04f
      Morris Jette authored
      Don't flag a job as "SPAWNED" for debugger until all process
        information is available for all pack job components.
      a8d2a04f
    • Tim Wickberg's avatar
      Fix autoconf test for libcurl when clang is the compiler. · d670de2d
      Tim Wickberg authored
      Adding a newline prevents this error:
      conftest.c:154:8: error: if statement has empty body [-Werror,-Wempty-body]
      d670de2d
    • Alejandro Sanchez's avatar
      If creating/altering a core based reservation with scontrol/sview on a · 3b3e67e1
      Alejandro Sanchez authored
      remote cluster correctly determine the select type.
      
      Bug 2329
      3b3e67e1
    • Morris Jette's avatar
      Modify srun --mpi=list output · 83058057
      Morris Jette authored
      Modify "srun --mpi=list" output to match valid option input by removing the
          "mpi/" prefix on each line of output.
      83058057
    • Morris Jette's avatar
      Disable heterogeneous steps by default · fe2daac7
      Morris Jette authored
      Enable them onlyh with SchedulerParameters=enable_hetero_jobs OR
        MPI type is "none"
      fe2daac7
    • Brian Christiansen's avatar
      Speedup arbitrary distribution algorithm · 011db71c
      Brian Christiansen authored
      Do pointer comparisons rather than strcmps.
      ~80x speedup
      Bug 3529
      
      e.g.
      1000 nodes
      8000 tasks
      [Sep 11 14:24:15.873639 20992 srvcn        0x7f8c1cdda700] _task_layout_hostfile: hostfile processing took usec=2152678 (orig)
      [Sep 11 14:27:46.173424 20992 srvcn        0x7f8c1c6d3700] _task_layout_hostfile: hostfile processing took usec=2142997 (orig)
      [Sep 11 14:32:32.245420  4037 srvcn        0x7f12de4e4700] _task_layout_hostfile: hostfile processing took usec=26198 (node ptrs)
      [Sep 11 14:36:12.88769   4037 srvcn        0x7f12de6e6700] _task_layout_hostfile: hostfile processing took usec=25515 (node ptrs)
      [Sep 11 14:41:38.339162  4037 srvcn        0x7f132c8d5700] _task_layout_hostfile: hostfile processing took usec=27459 (node ptrs)
      [Sep 11 15:16:59.575189  1874 srvcn        0x7f3dae3f0700] _task_layout_hostfile: hostfile processing took usec=30129 (node ptrs)
      [Sep 11 15:20:50.365004  1874 srvcn        0x7f3dc8b34700] _task_layout_hostfile: hostfile processing took usec=29884 (node ptrs)
      011db71c
  2. 11 Sep, 2017 2 commits
  3. 08 Sep, 2017 6 commits
  4. 07 Sep, 2017 2 commits
  5. 05 Sep, 2017 1 commit
  6. 04 Sep, 2017 1 commit
    • Alejandro Sanchez's avatar
      Fix to test job mem against MaxMemPer[CPU|Node] limits at scheduling time. · 24365514
      Alejandro Sanchez authored
      Initially job mem limits were tested at submission time through
      _validate_min_mem_partition() -> _valid_pn_min_mem(), but not tested
      again at scheduling time, thus leading to jobs incorrectly being scheduled
      against partitions where the job exceeded their MaxMemPer* limit
      (which can in turn be inherited from the system wide limit too).
      
      NOTE: New WAIT_PN_MEM_LIMIT job_state_reason enum component added to support
      this new waiting reason.
      
      Bug 2291.
      24365514
  7. 02 Sep, 2017 1 commit
  8. 01 Sep, 2017 4 commits
  9. 31 Aug, 2017 1 commit
  10. 30 Aug, 2017 3 commits
  11. 29 Aug, 2017 5 commits
  12. 25 Aug, 2017 2 commits
  13. 24 Aug, 2017 2 commits
    • Morris Jette's avatar
      Add file bcast suppot for pack jobs · 58b21490
      Morris Jette authored
      Modify sbcast command and srun's --bcast option to support heterogeneous
            jobs.
      bug 4099
      58b21490
    • Alejandro Sanchez's avatar
      Prevent slurmstepd ABRT when parsing gres.conf CPUs. · 3e1fffb6
      Alejandro Sanchez authored
      Calling bit_unfmt() with a zero bit_size() bitmap leads to a later
      call to bit_nclear() with start=0 and stop=-1, leading to the ABRT.
      
      This scenario happened when cgroup.conf has ConstrainDevices=yes and
      task_cgroup_devices_create() tries to collect the GRES devices
      but gres_cpu_cnt=0, thus creating a p->cpus_bitmap = bit_alloc(gres_cpu_cnt);
      of zero size which is passed by argument to bit_unfmt().
      
      gres_cpu_cnt is 0 because we have defined a gres.conf like this:
      
      Name=gpu Type=tesla File=/tmp/gres/tesla0 CPUs=0,1
      Name=gpu Type=tesla File=/tmp/gres/tesla1 CPUs=0,1
      Name=gpu Type=kepler File=/tmp/gres/kepler0 CPUs=2,3
      Name=gpu Type=kepler File=/tmp/gres/kepler1 CPUs=2,3
      
      but have no GresTypes nor GRES option in the slurm.conf / node config def.
      
      Bug 3974
      3e1fffb6
  14. 23 Aug, 2017 1 commit
    • Alejandro Sanchez's avatar
      jobcomp/elasticsearch - fix memory leak when transferring generated buffer. · 8172b7df
      Alejandro Sanchez authored
      Running slurmctld under valgrind while operating with jobcomp/elasticsearch
      reported the following bytes definitely lost:
      
      ==27403== 658 bytes in 1 blocks are definitely lost in loss record 301 of 342
      ==27403==    at 0x4C2FD4F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==27403==    by 0x2281B3: slurm_xrealloc (xmalloc.c:137)
      ==27403==    by 0x22856A: makespace (xstring.c:114)
      ==27403==    by 0x2285D0: _xstrcat (xstring.c:132)
      ==27403==    by 0x228CE0: _xstrfmtcat (xstring.c:291)
      ==27403==    by 0x83C5BCD: ???
      ==27403==    by 0x30A913: g_slurm_jobcomp_write (slurm_jobcomp.c:172)
      ==27403==    by 0x18D8FC: job_completion_logger (job_mgr.c:13652)
      
      It turns out the generated buffer in slurm_jobcomp_log_record was xstrdup'ed to
      the corresponding job_node->serialized_job, but the originally generated buffer
      wasn't freed afterwards. The fix consists in change the transfer so that instead
      of xstrdup'ing the char * we just assign the pointer and NULL the buffer.
      
      The job_node->serialized_job was already xfree'd properly later when the job
      was indexed.
      
      Discovered while working on Bug 4065.
      8172b7df
  15. 22 Aug, 2017 2 commits
  16. 21 Aug, 2017 1 commit