- 25 Oct, 2016 1 commit
-
-
Morris Jette authored
Do not include SLURM_JOB_DERIVED_EC, SLURM_JOB_EXIT_CODE, or SLURM_JOB_EXIT_CODE in PrologSlurmctld environment (not available yet). bug 1431
-
- 24 Oct, 2016 17 commits
-
-
Morris Jette authored
-
Yu Watanabe authored
bug 2390
-
Tim Wickberg authored
Many open Coverity issues point back to this. The results of passing a negative value to strerror() are undefined, return a static string instead.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Doug Parisek authored
the pro/epilog of jobs.
-
Tim Wickberg authored
-
Yu Watanabe authored
Even if several commands, e.g. sview, are not built and installed, the corresponding man pages are still installed. The attached patch stop to install man pages when the corresponding commands are not built installed. bug 2393
-
Morris Jette authored
-
Danny Auble authored
This reverts commit 428347cf. Decided we didn't want a core dump on ever fatal, as fatal is used in other programs instead of just the daemons.
-
Jacek Budzowski authored
There is a problem with gathering batch step statistics for jobs which are allocated on more than one node. Sstat asks wrong node for batch step stats. It requests info from last node from hostlist while it should ask first host from hostlist (i.e. BatchHost), because only on the first node the batch step actually executes. For example, when you have a job allocated on nodes n000[1-2] with BatchHost=p0001. You should be able to check its statistics by running sstat [ with -vv switch for more verbose output] (e.g. sstat -j 1234.batch -vv). Then you can see lines: sstat: debug: slurm_job_step_stat: getting pid information of job 1234.4294967294 on nodes n0002 sstat: debug: job step 1234.4294967294 has already completed The problem lays in sstat source code. For batch step a hostlist variable is taken from the hostlist_pop function, which returns last host from given hostlist. This should be replaced with the hostlist_shift function, which returns first host from the given hostlist. Patch attached. bug 2975
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray: Accept new jobs on backup slurmctld daemon without access to dw_wlm_cli command. No burst buffer actions will take place. Newly submitted jobs will be accepted and stay in pending state. Jobs depedent upon stage-in or stage-out will remain in their current state until the action can take place.
-
Brian Christiansen authored
-
Morris Jette authored
-
Dorian Krause authored
This commit fixes a bug in the multi-prog handling. When running salloc -N 2 srun -O --multi-prog mp.conf where mp.conf reads 0-192 true srun crashes can be observed. valgrind reports: ==6857== Invalid read of size 4 ==6857== at 0x45938D: bit_realloc (bitstring.c:189) ==6857== by 0x5977A9: _update_task_mask (multi_prog.c:335) ==6857== by 0x597A5E: _validate_ranks (multi_prog.c:403) ==6857== by 0x597D1E: verify_multi_name (multi_prog.c:469) ==6857== by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453) ==6857== by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493) ==6857== by 0x58E556: _opt_args (opt.c:1927) ==6857== by 0x58A3B9: initialize_and_process_args (opt.c:270) ==6857== by 0x591F82: init_srun (srun_job.c:459) ==6857== by 0x427E70: srun (srun.c:193) ==6857== by 0x428E23: main (srun.wrapper.c:17) ==6857== Address 0x5ace440 is 16 bytes inside a block of size 28 free'd ==6857== at 0x4C2BB4A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==6857== by 0x446886: slurm_xrealloc (xmalloc.c:139) ==6857== by 0x45944C: bit_realloc (bitstring.c:191) ==6857== by 0x5977A9: _update_task_mask (multi_prog.c:335) ==6857== by 0x597A5E: _validate_ranks (multi_prog.c:403) ==6857== by 0x597D1E: verify_multi_name (multi_prog.c:469) ==6857== by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453) ==6857== by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493) ==6857== by 0x58E556: _opt_args (opt.c:1927) ==6857== by 0x58A3B9: initialize_and_process_args (opt.c:270) ==6857== by 0x591F82: init_srun (srun_job.c:459) ==6857== by 0x427E70: srun (srun.c:193)
-
- 21 Oct, 2016 5 commits
-
-
Morris Jette authored
Do not process SALLOC_HINT, SBATCH_HINT or SLURM_HINT environment variables if any of the following salloc, sbatch or srun command line options are specified: -B, --cpu_bind, --hint, --ntasks-per-core, or --threads-per-core.bug 3118
-
Tim Wickberg authored
-
Morris Jette authored
Without this change, only the error was available, but no identification of the specific plugin that failed.
-
Morris Jette authored
Coverity was complaining that the return value of s_p_get_* was ignored. I added typecasting of the return value to (void) where needed. No change in logic, just making Coverity happy ;)
-
Morris Jette authored
-
- 20 Oct, 2016 5 commits
-
-
Tim Wickberg authored
_select_nodes_parts() was resetting state_reason to an admin hold without regard to admin vs user hold state. state_reason is the only place that user vs. admin is distinguished, so this prevented users from releasing these jobs. Bug introduced by commit fb46c84b in 16.05.5. Bug 3197.
-
Tim Wickberg authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
This is an addition to commit cb7ed937
-
- 19 Oct, 2016 8 commits
-
-
Morris Jette authored
Related to commit 974878cd
-
Morris Jette authored
The kmem limit was added in commit 084c9308
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
-
Ole H Nielsen authored
bug 3191
-
Ole H Nielsen authored
bug 3192
-
Morris Jette authored
-
- 18 Oct, 2016 4 commits
-
-
Martin Perry authored
-
Tim Wickberg authored
Continuation of commit 2fd4d7a6. MySQL dropped support for 'ignore', unconditionally remove that to avoid issues with mismatching client + server versions like in bug 3189.
-
Dominik Bartkiewicz authored
Improve reported estimates of start and end times for pending jobs. bug 3184
-
Morris Jette authored
Cray: Prevent abort in backfill scheduling logic for requeued job that has been cancelled while NHC is running. bug 3185
-