1. 25 Oct, 2016 1 commit
  2. 24 Oct, 2016 17 commits
    • Morris Jette's avatar
      1dee28db
    • Yu Watanabe's avatar
      Fix typo in src/api/Makefile.am · 3da8021f
      Yu Watanabe authored
      bug 2390
      3da8021f
    • Tim Wickberg's avatar
      Prevent feeding strerror() a negative value. · d3dd24ad
      Tim Wickberg authored
      Many open Coverity issues point back to this. The results
      of passing a negative value to strerror() are undefined,
      return a static string instead.
      d3dd24ad
    • Danny Auble's avatar
      Add missing frees from commit 133a4249 · 1b010388
      Danny Auble authored
      1b010388
    • Danny Auble's avatar
      Alphabetize structure, no real code change. · 24b45e39
      Danny Auble authored
      24b45e39
    • Danny Auble's avatar
      Put in missing free from last commit. · c6e3bc97
      Danny Auble authored
      c6e3bc97
    • Doug Parisek's avatar
      Add env variables to the kill_job_msg_t to add pack jobs env vars to · ece58780
      Doug Parisek authored
      the pro/epilog of jobs.
      ece58780
    • Tim Wickberg's avatar
      0a8cfa67
    • Yu Watanabe's avatar
      Do not install man pages for missing commands · a494efde
      Yu Watanabe authored
      Even if several commands, e.g. sview, are not built and installed, the corresponding man pages are still installed.
      
      The attached patch stop to install man pages when the corresponding commands are not built installed.
      bug 2393
      a494efde
    • Morris Jette's avatar
      Merge branch 'slurm-16.05' · 9a2b977e
      Morris Jette authored
      9a2b977e
    • Danny Auble's avatar
      Revert "On a fatal, abort so we get a core file instead of just exiting." · be4bc31c
      Danny Auble authored
      This reverts commit 428347cf.
      
      Decided we didn't want a core dump on ever fatal, as fatal is used
      in other programs instead of just the daemons.
      be4bc31c
    • Jacek Budzowski's avatar
      Fix for sstat on multi-node batch jobs · 8589ff40
      Jacek Budzowski authored
      There is a problem with gathering batch step statistics for jobs which are allocated on more than one node.
      
      Sstat asks wrong node for batch step stats. It requests info from last node from hostlist while it should ask first host from hostlist (i.e. BatchHost), because only on the first node the batch step actually executes.
      
      For example, when you have a job allocated on nodes n000[1-2] with BatchHost=p0001. You should be able to check its statistics by running sstat [ with -vv switch for more verbose output] (e.g. sstat -j 1234.batch -vv). Then you can see lines:
      
      sstat: debug:  slurm_job_step_stat: getting pid information of job 1234.4294967294 on nodes n0002
      sstat: debug:  job step 1234.4294967294 has already completed
      
      The problem lays in sstat source code. For batch step a hostlist variable is taken from the hostlist_pop function, which returns last host from given hostlist. This should be replaced with the hostlist_shift function, which returns first host from the given hostlist. Patch attached.
      
      bug 2975
      8589ff40
    • Morris Jette's avatar
      8e461b60
    • Morris Jette's avatar
      burst_buffer/cray: accept jobs without dw_wlm_cli · 5acd7e76
      Morris Jette authored
      burst_buffer/cray: Accept new jobs on backup slurmctld daemon without access
          to dw_wlm_cli command. No burst buffer actions will take place. Newly
          submitted jobs will be accepted and stay in pending state. Jobs depedent
          upon stage-in or stage-out will remain in their current state until the
          action can take place.
      5acd7e76
    • Brian Christiansen's avatar
      Only free the msg once. · 10c9b0d2
      Brian Christiansen authored
      10c9b0d2
    • Morris Jette's avatar
      Merge branch 'slurm-16.05' · 6c05c82e
      Morris Jette authored
      6c05c82e
    • Dorian Krause's avatar
      Fux use-after-free in srun · 2c7c5459
      Dorian Krause authored
      This commit fixes a bug in the multi-prog handling. When running
      salloc -N 2 srun -O --multi-prog mp.conf where mp.conf reads
      
      0-192 true
      
      srun crashes can be observed. valgrind reports:
      
      ==6857== Invalid read of size 4
      ==6857==    at 0x45938D: bit_realloc (bitstring.c:189)
      ==6857==    by 0x5977A9: _update_task_mask (multi_prog.c:335)
      ==6857==    by 0x597A5E: _validate_ranks (multi_prog.c:403)
      ==6857==    by 0x597D1E: verify_multi_name (multi_prog.c:469)
      ==6857==    by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453)
      ==6857==    by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493)
      ==6857==    by 0x58E556: _opt_args (opt.c:1927)
      ==6857==    by 0x58A3B9: initialize_and_process_args (opt.c:270)
      ==6857==    by 0x591F82: init_srun (srun_job.c:459)
      ==6857==    by 0x427E70: srun (srun.c:193)
      ==6857==    by 0x428E23: main (srun.wrapper.c:17)
      ==6857==  Address 0x5ace440 is 16 bytes inside a block of size 28 free'd
      ==6857==    at 0x4C2BB4A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
      ==6857==    by 0x446886: slurm_xrealloc (xmalloc.c:139)
      ==6857==    by 0x45944C: bit_realloc (bitstring.c:191)
      ==6857==    by 0x5977A9: _update_task_mask (multi_prog.c:335)
      ==6857==    by 0x597A5E: _validate_ranks (multi_prog.c:403)
      ==6857==    by 0x597D1E: verify_multi_name (multi_prog.c:469)
      ==6857==    by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453)
      ==6857==    by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493)
      ==6857==    by 0x58E556: _opt_args (opt.c:1927)
      ==6857==    by 0x58A3B9: initialize_and_process_args (opt.c:270)
      ==6857==    by 0x591F82: init_srun (srun_job.c:459)
      ==6857==    by 0x427E70: srun (srun.c:193)
      2c7c5459
  3. 21 Oct, 2016 5 commits
  4. 20 Oct, 2016 5 commits
  5. 19 Oct, 2016 8 commits
  6. 18 Oct, 2016 4 commits