- 01 Nov, 2016 2 commits
-
-
Joseph Mingrone authored
Add the POLLRDHUP hack used elsewhere to work around non-standard flag use. Bug 3227.
-
Morris Jette authored
cray/busrt_buffer - If total_space in a pool decreases, reset used_space rather than trying to account for buffer allocations in progress. bug 3222
-
- 28 Oct, 2016 1 commit
-
-
Danny Auble authored
more time than should be allowed would be accounted for. This only happened on jobs in the completing state when the slurmctld was shutdown. This will also be enhanced in 17.02 as the job's end_time_exp is not stored which is needed to determine if the job has already been through the decay_thread at end of job. Bug 3162
-
- 27 Oct, 2016 8 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
issue with gang scheduling. Bug 3211
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Morris Jette authored
This option specifies minimum characteristics of the compute nodes which should be considered for use, not the resource allocation size. bug 3118
-
Alejandro Sanchez authored
Create separate check_hosts_contiguous procedure in globals and use it for both test1.83 and test15.21. Bug 3006.
-
Morris Jette authored
-
- 26 Oct, 2016 7 commits
-
-
Morris Jette authored
Fix bug that was clearing MAINT mode on nodes scheduled for reboot (bug introduced in version 16.05.5 to address bug in overlapping reservations, commit 5eee1d28). Note that a node's MAINT flag is used for both a requested reboot and maintenance reservation. What I'd like to do is add a new node state flag to differenciate between these two cases, but that involves some significant changes that could introduce instability, so it will be defered to version 17.02 bug 3210
-
Morris Jette authored
Correct/expand description of NODE_STATE_FLAG
-
Danny Auble authored
-
Danny Auble authored
requested with -n tasks < hosts from -w hostlist.
-
Dominik Bartkiewicz authored
Bug 3171.
-
Morris Jette authored
bug 2149
-
Morris Jette authored
In addition to "show <ENTITY> <ID>" slurm also accepts "show <ENTITY>=<ID>" bug 2162
-
- 25 Oct, 2016 4 commits
-
-
Dominik Bartkiewicz authored
Bug 3194
-
Morris Jette authored
-
Morris Jette authored
Document that node Weight takes precedence over load with LLN scheduling. bug 3204
-
Tim Wickberg authored
task/cray's _get_numa_nodes() function needs to run before task/cgroup cleans up the cgroup hierarchies, otherwise ALPS memory compaction will never run. Also move task_p_add_pid() outside the #ifdef HAVE_NATIVE_CRAY block so that the plugin will load (albeit without any functionality) on non-Cray systems for testing purposes. Revise documentation and provided slurm.conf templates as well. Bug 3154.
-
- 24 Oct, 2016 2 commits
-
-
Jacek Budzowski authored
There is a problem with gathering batch step statistics for jobs which are allocated on more than one node. Sstat asks wrong node for batch step stats. It requests info from last node from hostlist while it should ask first host from hostlist (i.e. BatchHost), because only on the first node the batch step actually executes. For example, when you have a job allocated on nodes n000[1-2] with BatchHost=p0001. You should be able to check its statistics by running sstat [ with -vv switch for more verbose output] (e.g. sstat -j 1234.batch -vv). Then you can see lines: sstat: debug: slurm_job_step_stat: getting pid information of job 1234.4294967294 on nodes n0002 sstat: debug: job step 1234.4294967294 has already completed The problem lays in sstat source code. For batch step a hostlist variable is taken from the hostlist_pop function, which returns last host from given hostlist. This should be replaced with the hostlist_shift function, which returns first host from the given hostlist. Patch attached. bug 2975
-
Dorian Krause authored
This commit fixes a bug in the multi-prog handling. When running salloc -N 2 srun -O --multi-prog mp.conf where mp.conf reads 0-192 true srun crashes can be observed. valgrind reports: ==6857== Invalid read of size 4 ==6857== at 0x45938D: bit_realloc (bitstring.c:189) ==6857== by 0x5977A9: _update_task_mask (multi_prog.c:335) ==6857== by 0x597A5E: _validate_ranks (multi_prog.c:403) ==6857== by 0x597D1E: verify_multi_name (multi_prog.c:469) ==6857== by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453) ==6857== by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493) ==6857== by 0x58E556: _opt_args (opt.c:1927) ==6857== by 0x58A3B9: initialize_and_process_args (opt.c:270) ==6857== by 0x591F82: init_srun (srun_job.c:459) ==6857== by 0x427E70: srun (srun.c:193) ==6857== by 0x428E23: main (srun.wrapper.c:17) ==6857== Address 0x5ace440 is 16 bytes inside a block of size 28 free'd ==6857== at 0x4C2BB4A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==6857== by 0x446886: slurm_xrealloc (xmalloc.c:139) ==6857== by 0x45944C: bit_realloc (bitstring.c:191) ==6857== by 0x5977A9: _update_task_mask (multi_prog.c:335) ==6857== by 0x597A5E: _validate_ranks (multi_prog.c:403) ==6857== by 0x597D1E: verify_multi_name (multi_prog.c:469) ==6857== by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453) ==6857== by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493) ==6857== by 0x58E556: _opt_args (opt.c:1927) ==6857== by 0x58A3B9: initialize_and_process_args (opt.c:270) ==6857== by 0x591F82: init_srun (srun_job.c:459) ==6857== by 0x427E70: srun (srun.c:193)
-
- 20 Oct, 2016 4 commits
-
-
Tim Wickberg authored
_select_nodes_parts() was resetting state_reason to an admin hold without regard to admin vs user hold state. state_reason is the only place that user vs. admin is distinguished, so this prevented users from releasing these jobs. Bug introduced by commit fb46c84b in 16.05.5. Bug 3197.
-
Tim Wickberg authored
-
Danny Auble authored
-
Danny Auble authored
This is an addition to commit cb7ed937
-
- 19 Oct, 2016 1 commit
-
-
Ole H Nielsen authored
bug 3191
-
- 18 Oct, 2016 2 commits
-
-
Dominik Bartkiewicz authored
Improve reported estimates of start and end times for pending jobs. bug 3184
-
Morris Jette authored
Cray: Prevent abort in backfill scheduling logic for requeued job that has been cancelled while NHC is running. bug 3185
-
- 17 Oct, 2016 4 commits
-
-
Morris Jette authored
Modify DataWarb example to use an environment variable rather than absolute path
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
new glibc 2.24+ that depricates readdir_r.
-
- 15 Oct, 2016 1 commit
-
-
Morris Jette authored
-
- 14 Oct, 2016 4 commits
-
-
Tim Wickberg authored
This reverts commit 8bc8e7e6.
-
Morris Jette authored
The slurm_strcasestr() which existed in v16.05 was removed in v17.02.
-
Morris Jette authored
Found by Coverity
-
Morris Jette authored
Fix for possibly treating a negative number as a positive. Problem reported by Coverity.
-