- 22 Jan, 2013 5 commits
-
-
Morris Jette authored
-
jette authored
Conflicts: doc/html/Makefile.am doc/html/Makefile.in
-
Magnus Jonsson authored
-
jette authored
Correction to CPU allocation logic for cores without hyperthreading Backport of https://github.com/SchedMD/slurm/commit/1ef41ac9590e018e631eaefb31254622984b7d2d
-
jette authored
-
- 19 Jan, 2013 2 commits
- 18 Jan, 2013 15 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
From Chris Holmes, HP: After several days of brainstorming and debugging, I have identified a bug in SLURM 2.5.0rc2, related to the 'tree' topology. It was so early in the execution of the whole SLURM machinery that it took me some time to figure it out (say, 100 or 200 jobs showing the issue, with more or less debugging levels increased and extra instrumentation, with sometimes an uncertain reliability)... For every “switch” a bitmap of nodes (seen down by the switch) is built as the topology is discovered through 'topology.conf'. There is code in read_config.c, executed when the SLURM control daemon starts, that reorders the nodes (according to their hostname by default), while the switches table (ie the bitmaps) has already being built. To reorder the nodes means that the bitmaps of the switches become wrong.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
slurm.MEM_PER_CPU, slurm.NO_VAL, etc.
-
Morris Jette authored
-
Chris Harwell authored
indeces is not a word, but indices is. However, the choice of indexes has already been made in other parts of this code. Standardize: sed -i -e's/indecies/indexes/g' doc/man/man1/sbatch.1 src/sbatch/opt.c src/slurmctld/slurmctld.h src/common/node_conf.c src/common/node_conf.h
-
Morris Jette authored
Conflicts: doc/html/documentation.shtml
-
Morris Jette authored
-
Morris Jette authored
-
Phil Eckert authored
About a year ago I submitted a modification that you incorporated into SLURM 2.4, which was to allow an admin to modify a job to use a QOS even though the user did not have access to the QOS. However, I must have tested it without having the Accounting set to enforce QOS's. So, if an admin modifies a job to a QOS they don't have access to, it will be modified, but the job will result in a state of InvalidQOS, which is reasonable, since this would handle the case where a user has their QOS removed. A problem, however, is that even though the scheduler won't schedule the job, backfill still will. One approach would be to fix backfill to be consistent with the scheduler (which should probably occur regardless), but my thought would be to modify the scheduler to allow the QOS as long as it was set by an admin, since that was the intent of the modification to begin with. I believe it would only take a single line to change, just adding a check on the job_ptr->limit_set_qos, to make sure it was set by an admin: if (job_ptr->qos_id) { slurmdb_association_rec_t *assoc_ptr; assoc_ptr = (slurmdb_association_rec_t *)job_ptr->assoc_ptr; if (assoc_ptr && !bit_test(assoc_ptr->usage->valid_qos, job_ptr->qos_id) && !job_ptr->limit_set_qos) { info("sched: JobId=%u has invalid QOS", job_ptr->job_id); xfree(job_ptr->state_desc); job_ptr->state_reason = FAIL_QOS; continue; } else if (job_ptr->state_reason == FAIL_QOS) { xfree(job_ptr->state_desc); job_ptr->state_reason = WAIT_NO_REASON; } } Phil
-
jette authored
The shutdown call was causing all pending I/O to be discarded. Linger waits for pending I/O to complete before the close call returns.
-
- 17 Jan, 2013 6 commits
-
-
Morris Jette authored
Conflicts: src/sacctmgr/sacctmgr.c src/sreport/sreport.c
-
David Bigagli authored
-
Morris Jette authored
-
David Bigagli authored
-
Morris Jette authored
From Matthieu Hautreux: However, after discussing the point with onsite Bull support team and looking at the slurmstepd code concerning stdout/err/in redirection we would like to recommend two things for future versions of SLURM : - sutdown(...,SHUT_WR) should be performed when managing the TCP sockets : no shutdown(...,SHUT_WR) is performed on the TCP socket in slurmstepd eio management. Thus, the close() can not reliably inform the other end of the socket that the transmission is done (no TCP_FIN transmitted). As the close is followed by an exit(), the kernel is the only entity that is knowing of the fact that the close may not have been took into account by the other side (wich might be our initial problem) and thus no retry can be performed, letting the server side of the socket (srun) in a position where it can wait for a read until the end of time. - TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP exchanges, thus letting the system potentially deadlocked if a remote host dissapear and the local host is waiting on a read (the write would result in a EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a relatively large timeout value (5 minutes), could enhance the resilience of SLURM for unexpected packet/connection loss without too much implication on the scalability of the solution. The timeout could be configurable in case it is find too aggresive for particular configurations.
-
Morris Jette authored
Added "KeepAliveTime" configuration parameter From Matthieu Hautreux: TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP exchanges, thus letting the system potentially deadlocked if a remote host dissapear and the local host is waiting on a read (the write would result in a EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a relatively large timeout value (5 minutes), could enhance the resilience of SLURM for unexpected packet/connection loss without too much implication on the scalability of the solution. The timeout could be configurable in case it is find too aggresive for particular configurations.
-
- 16 Jan, 2013 12 commits
-
-
Morris Jette authored
Conflicts: src/slurmctld/proc_req.c
-
Morris Jette authored
-
David Bigagli authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Without this change a high priority batch job may not start at submit time. In addtion, a pending job with mutltiple partitions be cancelled when the scheduler runs if any of it's partitions can not be used by the job.
-
David Bigagli authored
-
Morris Jette authored
The original work this was based upon has been replaced with new logic.
-
Morris Jette authored
Without this patch, if the first listed partition lacks nodes with required features the job would be rejected.
-
Morris Jette authored
While this will validate job at submit time, it results in redundant looping when scheduling jobs. Working on alternate patch now.
-
Danny Auble authored
-
Danny Auble authored
submission.
-