- 25 Oct, 2016 1 commit
-
-
Tim Wickberg authored
task/cray's _get_numa_nodes() function needs to run before task/cgroup cleans up the cgroup hierarchies, otherwise ALPS memory compaction will never run. Also move task_p_add_pid() outside the #ifdef HAVE_NATIVE_CRAY block so that the plugin will load (albeit without any functionality) on non-Cray systems for testing purposes. Revise documentation and provided slurm.conf templates as well. Bug 3154.
-
- 24 Oct, 2016 2 commits
-
-
Jacek Budzowski authored
There is a problem with gathering batch step statistics for jobs which are allocated on more than one node. Sstat asks wrong node for batch step stats. It requests info from last node from hostlist while it should ask first host from hostlist (i.e. BatchHost), because only on the first node the batch step actually executes. For example, when you have a job allocated on nodes n000[1-2] with BatchHost=p0001. You should be able to check its statistics by running sstat [ with -vv switch for more verbose output] (e.g. sstat -j 1234.batch -vv). Then you can see lines: sstat: debug: slurm_job_step_stat: getting pid information of job 1234.4294967294 on nodes n0002 sstat: debug: job step 1234.4294967294 has already completed The problem lays in sstat source code. For batch step a hostlist variable is taken from the hostlist_pop function, which returns last host from given hostlist. This should be replaced with the hostlist_shift function, which returns first host from the given hostlist. Patch attached. bug 2975
-
Dorian Krause authored
This commit fixes a bug in the multi-prog handling. When running salloc -N 2 srun -O --multi-prog mp.conf where mp.conf reads 0-192 true srun crashes can be observed. valgrind reports: ==6857== Invalid read of size 4 ==6857== at 0x45938D: bit_realloc (bitstring.c:189) ==6857== by 0x5977A9: _update_task_mask (multi_prog.c:335) ==6857== by 0x597A5E: _validate_ranks (multi_prog.c:403) ==6857== by 0x597D1E: verify_multi_name (multi_prog.c:469) ==6857== by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453) ==6857== by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493) ==6857== by 0x58E556: _opt_args (opt.c:1927) ==6857== by 0x58A3B9: initialize_and_process_args (opt.c:270) ==6857== by 0x591F82: init_srun (srun_job.c:459) ==6857== by 0x427E70: srun (srun.c:193) ==6857== by 0x428E23: main (srun.wrapper.c:17) ==6857== Address 0x5ace440 is 16 bytes inside a block of size 28 free'd ==6857== at 0x4C2BB4A: realloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) ==6857== by 0x446886: slurm_xrealloc (xmalloc.c:139) ==6857== by 0x45944C: bit_realloc (bitstring.c:191) ==6857== by 0x5977A9: _update_task_mask (multi_prog.c:335) ==6857== by 0x597A5E: _validate_ranks (multi_prog.c:403) ==6857== by 0x597D1E: verify_multi_name (multi_prog.c:469) ==6857== by 0x6E7B4BE: launch_p_handle_multi_prog_verify (launch_slurm.c:453) ==6857== by 0x58A25D: launch_g_handle_multi_prog_verify (launch.c:493) ==6857== by 0x58E556: _opt_args (opt.c:1927) ==6857== by 0x58A3B9: initialize_and_process_args (opt.c:270) ==6857== by 0x591F82: init_srun (srun_job.c:459) ==6857== by 0x427E70: srun (srun.c:193)
-
- 20 Oct, 2016 4 commits
-
-
Tim Wickberg authored
_select_nodes_parts() was resetting state_reason to an admin hold without regard to admin vs user hold state. state_reason is the only place that user vs. admin is distinguished, so this prevented users from releasing these jobs. Bug introduced by commit fb46c84b in 16.05.5. Bug 3197.
-
Tim Wickberg authored
-
Danny Auble authored
-
Danny Auble authored
This is an addition to commit cb7ed937
-
- 19 Oct, 2016 1 commit
-
-
Ole H Nielsen authored
bug 3191
-
- 18 Oct, 2016 2 commits
-
-
Dominik Bartkiewicz authored
Improve reported estimates of start and end times for pending jobs. bug 3184
-
Morris Jette authored
Cray: Prevent abort in backfill scheduling logic for requeued job that has been cancelled while NHC is running. bug 3185
-
- 17 Oct, 2016 4 commits
-
-
Morris Jette authored
Modify DataWarb example to use an environment variable rather than absolute path
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
new glibc 2.24+ that depricates readdir_r.
-
- 15 Oct, 2016 1 commit
-
-
Morris Jette authored
-
- 14 Oct, 2016 4 commits
-
-
Tim Wickberg authored
This reverts commit 8bc8e7e6.
-
Morris Jette authored
The slurm_strcasestr() which existed in v16.05 was removed in v17.02.
-
Morris Jette authored
Found by Coverity
-
Morris Jette authored
Fix for possibly treating a negative number as a positive. Problem reported by Coverity.
-
- 13 Oct, 2016 8 commits
-
-
Morris Jette authored
-
Morris Jette authored
Change the default syscfg path to /usr/bin/syscfg Don't load the plugin if ResumeProgram is configured Update documentation on web page
-
Morris Jette authored
-
Morris Jette authored
This applies only to knl_generic plugin. The original node active features should be the same as the features field at startup. The logic was assuming the original value was NULL.
-
Morris Jette authored
This was a new function added to support knl_generic and it was originally assigned the wrong name.
-
Morris Jette authored
Added node_features/knl_generic plugin for KNL support on non-Cray systems. NOTE: This plugin is still under development.
-
Morris Jette authored
Do not propagate SLURM_UMASK environment variable to batch script. bug 2609
-
Bjørn-Helge Mevik authored
Correct a bitmap test function (used only by the select/bluegene plugin). The effect of this bug is probably very limited as it will in almost all cases revert prematurely to a bit-by-bit test rather than using a full-word test. bug 3145
-
- 12 Oct, 2016 11 commits
-
-
Tim Wickberg authored
Cannot use ClusterName without reading a config file that may not exist. Bug 3026.
-
Tim Wickberg authored
This introduced an inadvertent dependency on the config file, which does not exist when setting up a new cluster. Bug 3026. This reverts commit c39f9ac9.
-
Tim Wickberg authored
-
Morris Jette authored
task/affinity plugin: Honor a job's --ntasks-per-socket and --ntasks-per-core options in task binding. bug 3118
-
Pär Lindfors authored
-
Brian Christiansen authored
Changed in df70b651
-
Brian Gilmer authored
-
Morris Jette authored
Preserve non-KNL node features when updating the KNL node features for a multi-node job in which the non-KNL node features vary by node.
-
Morris Jette authored
node_features/knl_cray plugin: If the reconfiguration of nodes for an interactive job fails, kill the job (it can't be requeued like a batch job).
-
Morris Jette authored
Execute "capmc node_status" at more frequent intervals to handle nodes getting added or removed from the system using Cray tools (i.e. try to keep Slurm and Cray software better synchronized).
-
Morris Jette authored
node_features/knl_cray plugin: Add separate thread to interact with capmc in response to unexpected node reboots. bug 3153
-
- 11 Oct, 2016 2 commits
-
-
Alejandro Sanchez authored
bug 3091
-
Morris Jette authored
-