- 16 Aug, 2017 2 commits
-
-
Morris Jette authored
Set SLURM_NTASKS environment variable to reflect global task count (needed by MPI).
-
Morris Jette authored
Set SLURM_PROCID environment variable to reflect global task rank (needed by MPI).
-
- 15 Aug, 2017 4 commits
-
-
Morris Jette authored
-
Morris Jette authored
bug 3217
-
Morris Jette authored
-
Morris Jette authored
If srun lacks application specification for some component, the next one specified will be used for earlier components.
-
- 14 Aug, 2017 3 commits
-
-
Morris Jette authored
-
Danny Auble authored
This reverts commit 00a691b9.
-
Morris Jette authored
-
- 12 Aug, 2017 1 commit
-
-
Morris Jette authored
Modify scontrol job hold/release and update to operate with heterogeneous job id specification (e.g. "scontrol hold 123+4").
-
- 11 Aug, 2017 5 commits
-
-
Alejandro Sanchez authored
Fix sview to avoid messages to stderr when modifying a block, partition, or reservation. bug 3217
-
Danny Auble authored
This will allow dell's custom syscfg to work correctly. NOTE: Dell calls flat memory just memory. Bug 4034
-
Morris Jette authored
Doing so would break the current scheduling logic.
-
Danny Auble authored
Bug 4059
-
Dominik Bartkiewicz authored
-
- 10 Aug, 2017 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
- 07 Aug, 2017 2 commits
-
-
Danny Auble authored
-
Dominik Bartkiewicz authored
Bug 4019
-
- 04 Aug, 2017 6 commits
-
-
Morris Jette authored
truncation of core specification and not reserving the specified cores. Fixes Coverity CID 45174 and 45175 Bug 4053
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
the tree. Bug 4050
-
Morris Jette authored
Modify launch/slurm plugin to signal all components of a pack job rather than just the one (modify to use a list of step context records).
-
Morris Jette authored
If prolog is running when attempting to signal a step, then return EAGAIN and retry rather than simply returning SLURM_ERROR and aborting.
-
- 03 Aug, 2017 1 commit
-
-
Morris Jette authored
Fix I/O race condition on step termination for srun launching multiple pack job groups. Without this change application output might be lost and/or the srun command might hang after some tasks exit.
-
- 02 Aug, 2017 4 commits
-
-
Marshall Garey authored
Would fail when trying to create the clustername file because the StateSaveLocation path didn't exist yet. Bug 3988
-
Marshall Garey authored
srun jobs that could start immediately and requested multiple partitions didn't run in the highest priority partition if the highest priority partition wasn't listed first. It's possible that the scontrol show jobs will show the partition list in priority order now that the job's partition list gets sorted by priority. Bug 4015
-
Tim Wickberg authored
Bug 3956.
-
Morris Jette authored
Add pack_job_id and pack_job_offset to accounting database. Modified sacct to accept pack job ID specification using "#+#" notation. Modified sstat to accept pack job ID specification using "#+#" notation.
-
- 01 Aug, 2017 3 commits
-
-
Tim Shaw authored
Bug 3999
-
Tim Shaw authored
Default to 1, unless set to 0. Allow to be set to 0 even if GroupUpdateTime was not set before. Move down to alphabetical position in read_config.c as well. Bug 3956.
-
Dominik Bartkiewicz authored
Fix bug in selection of GRES bound to specific CPUs where the GRES count is 2 or more. Previous logic could allocate CPUs not available to the job. bug 4029
-
- 31 Jul, 2017 1 commit
-
-
Tim Shaw authored
This will be fixed before 17.11, but is being left as-is on 17.02. Bug 3956.
-
- 28 Jul, 2017 3 commits
-
-
Danny Auble authored
connection. Bug 4009
-
Alejandro Sanchez authored
jobcomp/elasticsearch saves/load the state to/from elasticsearch_state. Since the jobcomp API isn't designed with save/load state operations, the plugin _save_state() isn't extern and not available from outside the plugin itself, thus it is highly coupled to fini() function. This state doesn't follow the same execution path as the rest of Slurm states, where in save_all_sate() they are all independently scheduled. So we save it manually here on a RPC of type REQUEST_CONTROL. This enables that when the Primary ctld issues a REQUEST_CONTROL to the Backup which is currently in controller mode, the Backup will save the state and when the Primary assumes control again it can process the saved pending jobs. The other way around was already controlled, because when the Primary is running in controller mode and the Backup issues a REQUEST_CONTROL, the Primary is shutdown and when breaking the ctld main() function while(1) loop, there was already a g_slurm_jobcomp_fini() call in place. Bug 3908
-
Morris Jette authored
Perform limit check on heterogeneous job as a whole at submit time to reject jobs that will never be able to run. Accepting pack jobs that can never start will have a significant effect on scheduling in general (blocking the queue).
-
- 27 Jul, 2017 1 commit
-
-
Alejandro Sanchez authored
When more than 1 ping cycle is spawned simultaneously (for instance REQUEST_PING + REQUEST_NODE_REGISTRATION_STATUS for the selected nodes), we do not track a separate ping_start time for each cycle. When ping_begin() is called, the information about the previous ping cycle is lost. Then when ping_end() is called for the first of the two cycles, we set ping_start=0, which is incorrectly used to see if the last cycle ran for more than PING_TIMEOUT seconds (100s), thus incorrectly triggering the: error("Node ping apparently hung, many nodes may be DOWN or configured " "SlurmdTimeout should be increased"); Bug 3914
-
- 26 Jul, 2017 2 commits
-
-
Danny Auble authored
-
Isaac Hartung authored
-- Add slurm.conf configuration parameters SlurmctldSyslogDebug and SlurmdSyslogDebug to control which messages from the slurmctld and slurmd daemons get written to syslog. -- Add slurmdbd.conf configuration parameter DebugLevelSyslog to control which messages from the slurmdbd daemon get written to syslog. bug 3933
-