- 28 Feb, 2018 4 commits
-
-
Tim Wickberg authored
-
Isaac Hartung authored
Also throws spurious errors of: "slurmd: error: Domain socket directory /var/spool/slurmd: No such file or directory" if you SlurmdSpoolDir is located elsewhere. Bug 4289.
-
Dominik Bartkiewicz authored
Add additional protection in slurmctld as well. Bug 4826.
-
Alejandro Sanchez authored
job_limits_check() uses the job desc to call _valid_pn_min_mem(). This second function might adjust the following values (up to date): cpus_per_task pn_min_memory min_cpus max_cpus pn_min_cpus If the function returns success, these adjusted members need to be copied back to the job_record. It turns out pn_min_cpus wasn't copied back, thus the logs claimed to automatically increase pn_min_cpus but actually the job record wasn't modified and the select plugin tried to allocate wrong amount of resources. Bug 4823.
-
- 27 Feb, 2018 1 commit
-
-
Tim Wickberg authored
No longer needed, and will cause errors on FreeBSD systems build with WITHOUT_KERBEROS. Bug 4805.
-
- 23 Feb, 2018 1 commit
-
-
Morris Jette authored
Bug 4783
-
- 22 Feb, 2018 6 commits
-
-
Morris Jette authored
"#DW destroy_persistent" directives available in Cray CLE6.0UP06. This will be supported in Slurm version 18.08. Use "#BB" directives until then. Bug 4302
-
Felip Moll authored
Only a single io_timeout_thread should be created for each sls struct. Creating multiple, while seemingly harmless in operation, can lead to fatal() messages when srun shuts down by destroying mutex locks that are in use by threads that srun doesn't expect to still have running. Regression caused by a1185f04. Bug 4596
-
Felip Moll authored
This patch fixes the situation that makes features unrecognized where a node features plugin is active and features are defined to nodes in slurm.conf. It also preserves KNL node features when slurmctld daemons are reconfigured including active and available modes. Features not belonging to node features plugin are reset to what is in slurm.conf when restarting or reconfiguring. Bug 4734
-
Alejandro Sanchez authored
_resv_overlap function was only checking the flags for the updated reservation, but not for the rest of present ones. This implied that the allowed overlap derived from these flags only applied depending on the update order. Bug 4806.
-
Alejandro Sanchez authored
After commit b31fa177, we do not defer slurmd node registration if HealthCheckProgram fails. So at slurmd startup, slurmd executes: run_script_health_check(); _spawn_registration_engine(); And does not keeps spinning if NHC fails. Now if there are nodes managed by the Power Save logic, when they are requested to be POWER_UP because a job is allocated resources, then at slurmd startup NHC is executed before node registers. The problem comes when this NHC execution fails, if the NHC program decides to update the node to DRAIN, since the job was already allocated before this update, then the job will attempt to start RUNNING but might fail since NHC detected there's something wrong. So this change what it does is to detect DRAIN/FAIL node update requests, then check if node is ALLOC/MIXED and POWER_[SAVE|UP] and if so then force a requeue, so that the job doesn't start on a failed node. Bug 4689.
-
Felip Moll authored
Can frequently throw scary-sounding messages on short-lived processes that disappear while the stats are collected. Bug 4759.
-
- 21 Feb, 2018 7 commits
-
-
Brian Christiansen authored
Bug 4504
-
Brian Christiansen authored
submission option. Bug 4548
-
Brian Christiansen authored
Bug 4548
-
Brian Christiansen authored
Continuation of 873871ed Bug 4758
-
Brian Christiansen authored
Bug 4800
-
Danny Auble authored
Bug 4698.
-
Dominik Bartkiewicz authored
NEWS entry also applies to prior 31667d5d. Bug 4809.
-
- 20 Feb, 2018 6 commits
-
-
Marshall Garey authored
Bug 4636
-
Brian Christiansen authored
Bug 4716
-
Brian Christiansen authored
Bug 4716
-
Felip Moll authored
In perl tools, fix for regexp that caused extra incorrectly shown results. Bug 4766
-
Danny Auble authored
requesting any tasks. Bug 4730
-
Morris Jette authored
wait to retry or not. I discovered this bug regression testing. Some similar situations will result in srun continuously issuing step create requests and the launch_common_create_job_step() function not sleeping between RPCs. Basically launch_common_create_job_step() sleeps for some error codes and srun retries the step create on some error codes. The problem is that those error codes do not match in both places, resulting in constant retries without sleeps. This situation is very likely with job preemption combined with salloc, but other conditions can trigger the same event. The following errno will all trigger this situation: EAGAIN, ESLURM_DISABLED, ESLURM_POWER_NOT_AVAIL, ESLURM_POWER_RESERVED, ESLURM_PROLOG_RUNNING, ESLURM_INTERCONNECT_BUSY. Bug 4786
-
- 16 Feb, 2018 2 commits
-
-
Brian Christiansen authored
Bug 4772
-
Brian Christiansen authored
Was checking time ranges against a bad cast'ed variable. And MaxQueryTimeRange was being stored as minutes and was being compared against the difference of two time_t's -- which produces seconds. Bug 4772
-
- 15 Feb, 2018 2 commits
-
-
Doug Jacobsen authored
Fixes issue with bf_min_prio_reserve not being respected, leading to significantly impacted backfill performance. Bug 4760.
-
Morris Jette authored
Log that support for the ChosLoc configuration parameter will end in Slurm version 18.08. Bug 4791
-
- 14 Feb, 2018 1 commit
-
-
Tim Wickberg authored
And would prevent transmission of a file. Bug 4787.
-
- 13 Feb, 2018 4 commits
-
-
Danny Auble authored
Bug 4736 Bug 4784
-
Morris Jette authored
Partial write can happen under high system load, leading to step termination when finishing the write would let the step launch properly instead. Fix suggested by Matthieu Hautreux. Bug 4778.
-
Felip Moll authored
Add a privilege check for when an unprivileged user tries to modify a resource. Min level set to Operator. Bug 4735.
-
Felip Moll authored
Bug 4747.
-
- 12 Feb, 2018 1 commit
-
-
Felip Moll authored
Fixes some issues around differences in lua package naming. Bug 4568.
-
- 08 Feb, 2018 1 commit
-
-
Dominik Bartkiewicz authored
Bug 4709.
-
- 07 Feb, 2018 4 commits
-
-
Alejandro Sanchez authored
Previously it was taking the MIN, without respecting the order. Also add a note to the resource_limits.html page to clarify the exception for Max[Wall|Time] and/or [Max|Min]Nodes limits, where the default is that the Partition is the king with regards of precedence, unless the respective job's QOS flags Partition[Min|Max|Time]Limit are set. Bug 4681.
-
Danny Auble authored
This prevents a hard-to-diagnose issue where slurmstepd may fail to start due to a missing library. This now ensures slurmd will fail, and keep the node down until the library issue can be fixed. Bug 4645, 4644.
-
Danny Auble authored
fatal() calls exit(1) which precludes getting a backtrace. That's fine on configuration issues and other types of problem, but for hitting "impossible" edge cases getting a core dump may be the only way to isolate the issue. Adding to 17.11 so we can easily provide diagnostic patches without needing users to back-port this implementation. Further use will come in 18.08. Bug 4599.
-
Tim Wickberg authored
-