- 29 Aug, 2013 3 commits
-
-
Morris Jette authored
The Cray version of those libraries must be used. bug 407
-
Morris Jette authored
switch/generic - propagate switch information from srun down to slurmd and slurmstepd. Previously the information was not going past the srun command.
-
Morris Jette authored
-
- 28 Aug, 2013 2 commits
-
-
Morris Jette authored
due to multiple free calls caused by job arrays submitted to multiple partitions. The root cause is the job priority array of the original job being re-used by the subsequent job array entries. A similar problem that could be induced by the user specifying a job accounting frequency when submitting a job array is also fixed. bug 401
-
Danny Auble authored
sacctmgr.
-
- 27 Aug, 2013 4 commits
-
-
Morris Jette authored
If reservation create request included a CoreCnt value and more nodes are required than configured, the logic in select/cons_res could go off the end of the core_cnt array. This patch adds a check for a zero value in the core_cnt array, which terminates the user-specified array. Back-port from master of commit 211c224b
-
Morris Jette authored
-
Morris Jette authored
If reservation create request included a CoreCnt value and more nodes are required than configured, the logic in select/cons_res could go off the end of the core_cnt array. This patch adds a check for a zero value in the core_cnt array, which terminates the user-specified array.
-
Danny Auble authored
the node health check script for steps and allocations respectfully.
-
- 26 Aug, 2013 1 commit
-
-
Morris Jette authored
Used job terminations due to failure to boot it's allocated nodes or BlueGene block. bug 213
-
- 24 Aug, 2013 1 commit
-
-
Danny Auble authored
-
- 23 Aug, 2013 1 commit
-
-
Morris Jette authored
This is a correction of a bug introduced in commit https://github.com/SchedMD/slurm/commit/ac44db862c8d1f460e55ad09017d058942ff6499 That commit eliminated the need of reading the node state information from squeue for performance reasons (mostly for large parallel systems in which the Prolog ran squeue, which generates a lot of simultaneous RPCs, slowing down the job launch process). It also assumed 1 CPU per node. If a pending job specified a node count of 1 and a task count larger than one, squeue was reporting the node count of the job as the same as the task count. This patch moves that same calculation of a pending job's minimum node count into slurmctld, so the squeue still does not need to read the node information, but can report the correct node count for pending jobs with minimal overhead.
-
- 22 Aug, 2013 5 commits
-
-
Danny Auble authored
to avoid it thinking we don't have a cluster name.
-
Nathan Yee authored
-
Danny Auble authored
to avoid it thinking we don't have a cluster name.
-
Nathan Yee authored
%o and %Z respectively
-
Danny Auble authored
-
- 21 Aug, 2013 1 commit
-
-
Hongjia Cao authored
If there are completing jobs, a reconfigure will set wrong job/node state: all nodes of the completing job will be set allocated, and the job will not be removed even if the completing nodes are released. The state can only be restored by restarting slurmctld after the completing nodes released.
-
- 20 Aug, 2013 2 commits
-
-
Danny Auble authored
-
Morris Jette authored
Added boards_per_node, sockets_per_board, ntasks_per_node, ntasks_per_board, ntasks_per_socket, ntasks_per_core, and nice.
-
- 19 Aug, 2013 5 commits
-
-
Chris Read authored
-
David Bigagli authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 17 Aug, 2013 1 commit
-
-
Morris Jette authored
-
- 16 Aug, 2013 2 commits
-
-
Morris Jette authored
This makes it consistent with the value of default_queue_depth. The backfill scheduler should be able to easily handle this value (or much higher for pretty much any configuration).
-
Danny Auble authored
-
- 15 Aug, 2013 5 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
could end up before the job started. Bug 371
-
Morris Jette authored
This function can now be called to test for processes which are dumping in order to avoid sending them a SIGKILL until dump completes. Change in logic required for job_container/cray.
-
Danny Auble authored
-
- 14 Aug, 2013 5 commits
-
-
Danny Auble authored
plugins more maintainable.
-
Morris Jette authored
This avoids waiting for the job's initiation to fail.
-
Morris Jette authored
Fix job state recovery logic in which a job's accounting frequency was not set. This would result in a value of 65534 seconds being used (the equivalent of NO_VAL in uint16_t), which could result in the job being requeued or aborted.
-
David Bigagli authored
-
Morris Jette authored
Problem reported by BYU. slurm.conf included a file one byte in length. Logic created a buffer one byte long and used fgets() to read the file. fgets() reads one byte less than the buffer size to include a trailing '\0', so it fails to read the file.
-
- 13 Aug, 2013 2 commits
-
-
Morris Jette authored
-
jette authored
This problem was reported by Harvard University and could be reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id". With other job types, the error message could be logged many times for each job. This change logs the error once per job and only if the job request does not include the -O/--overcommit option.
-