- 12 Sep, 2003 6 commits
-
-
Moe Jette authored
when the job does not exist).
-
Moe Jette authored
it is a duplicate record.
-
Mark Grondona authored
-
Mark Grondona authored
o check for a job step state of STARTED before issuing kill_job rpc
-
Moe Jette authored
was only going to 65500 for the job_id and the step_id was always zero. This change does not elimiate the possibility of an error, but reduces its probability by a factor of about 65000. (gnats:276)
-
Moe Jette authored
to job_kill request and slurmctld leaves node and job in COMPLETING state until the slurmd issues an EPILOG_COMPLETE RPC on each node. This permits better support for non-killable processes and/or long-running epilog scripts. Several minor changes in node registration handling and slurmctld agent logic to better address a flood of incomming RPC (typically when system restarts). (gnats:268)
-
- 11 Sep, 2003 1 commit
-
-
Moe Jette authored
-
- 10 Sep, 2003 3 commits
- 09 Sep, 2003 8 commits
-
-
Mark Grondona authored
-
Mark Grondona authored
-
Mark Grondona authored
may result in multiple executions of system epilog for a single job (gnats:267)
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
- 05 Sep, 2003 8 commits
-
-
Moe Jette authored
of socket communications. Previously was sometimes overwriting legitimate SLURM error code with fcntl error code of EINTR.
-
Moe Jette authored
sort of slurm error.
-
Moe Jette authored
-
Moe Jette authored
on a job kill. Let the KILL_JOB RPC do all of the cleanup. This removes a redundant RPC. - Moe
-
Moe Jette authored
-
Moe Jette authored
occur naturally if a srun, scontrol, scancel, sinfo, or squeue command is killed by the user with a communication to slurmctld in progress. This seems to occur fairly regularly as part of batch job termination.
-
Moe Jette authored
-
Moe Jette authored
send/receive, function (poll, timeout, send, recv, etc), and the error message are all reported.
-
- 04 Sep, 2003 3 commits
- 03 Sep, 2003 3 commits
-
-
Moe Jette authored
It was picking zero nodes and failing.
-
Moe Jette authored
-
Mark Grondona authored
problem when debugging remote tasks. (and error should have only printed once anyway)
-
- 02 Sep, 2003 1 commit
-
-
Mark Grondona authored
not SIGXCPU on reaching timelimit.
-
- 20 Aug, 2003 1 commit
-
-
jwindley authored
-
- 14 Aug, 2003 2 commits
- 13 Aug, 2003 4 commits
-
-
Mark Grondona authored
-
Moe Jette authored
to match that of a job's run time (TIME).
-
Mark Grondona authored
-
Moe Jette authored
hand if race condition starting all daemons).
-