- 03 Jan, 2013 10 commits
-
-
Nathan Yee authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Command line argument would not be processed, but scontrol would exit immediately
-
Morris Jette authored
Conflicts: src/scontrol/scontrol.c
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 02 Jan, 2013 1 commit
-
-
Morris Jette authored
The original patch works fine to avoid cancelling a job when all of it's nodes go unresponsive, but I don't see any way to easily address nodes coming back into service. We want to cancel jobs that have some up nodes and some down nodes, but the nodes will come back into service indivually rather than all at once.
-
- 31 Dec, 2012 1 commit
-
-
jette authored
The job will be aborted if any node is set DOWN while responding or when "scontrol reconfig" is executed or the slurmctld restarts, but it should respond better to global failures, like the network going down.
-
- 29 Dec, 2012 3 commits
-
-
jette authored
-
Danny Auble authored
Fix broken build when HAVE_READLINE is false
-
Ralph Castain authored
-
- 28 Dec, 2012 8 commits
-
-
Morris Jette authored
There are far fewer RPCs not suppored than are supported, so this should be faster and easier to maintain.
-
Morris Jette authored
Conflicts: src/common/slurm_protocol_util.c
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
-
jette authored
-
jette authored
-
- 27 Dec, 2012 4 commits
- 22 Dec, 2012 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
stack.
-
- 21 Dec, 2012 8 commits
-
-
Morris Jette authored
If sched/backfill starts a job with a QOS having NO_RESERVE and not job time limit, start it with the partition time limit (or one year if the partition has no time limit) rather than NO_VAL (140 year time limit); If a standby job, which in this case has the NO_RESERVE flag set, is submitted without a time limit, and is backfilled, it will get an EndTime waaayyyy into the future. JobId=99 Name=cmdll UserId=eckert(1043) GroupId=eckert(1043) Priority=12083 Account=sa QOS=standby JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36 StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00 so I looked at the code in /src/plugins/sched/backfill: if (job_ptr->start_time <= now) { int rc = _start_job(job_ptr, resv_bitmap); if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){ job_ptr->time_limit = orig_time_limit; job_ptr->end_time = job_ptr->start_time + (orig_time_limit * 60); Using the debugger I found that if the job does not have a specified time limit, the job_ptr->time_limit is equal to NO_VAL when it hits this code.
-
Danny Auble authored
-
Danny Auble authored
2.5.
-
https://github.com/SchedMD/slurmjette authored
-
jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Identify node states on which HealthCheckProgram should be executed.
-
- 20 Dec, 2012 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
slurm.conf with NodeAddr's signals going to a step could be handled incorrectly.
-
Danny Auble authored
would of also killed the allocation.
-