- 15 Sep, 2003 5 commits
-
-
Mark Grondona authored
-
Mark Grondona authored
setting SLURM_NODELIST in the environment)
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
in slurmd killing itself if the KILL_JOB RPC arrived before the job began execution (the pid in the data structure was still zero.
-
- 13 Sep, 2003 1 commit
-
-
Moe Jette authored
cases. Exit code is now 0 only if all commands execute without error. Exit code is 1 if any failure occurs for any command executed. (gnats:278)
-
- 12 Sep, 2003 8 commits
-
-
Mark Grondona authored
-
Mark Grondona authored
-
Moe Jette authored
when the job does not exist).
-
Moe Jette authored
it is a duplicate record.
-
Mark Grondona authored
-
Mark Grondona authored
o check for a job step state of STARTED before issuing kill_job rpc
-
Moe Jette authored
was only going to 65500 for the job_id and the step_id was always zero. This change does not elimiate the possibility of an error, but reduces its probability by a factor of about 65000. (gnats:276)
-
Moe Jette authored
to job_kill request and slurmctld leaves node and job in COMPLETING state until the slurmd issues an EPILOG_COMPLETE RPC on each node. This permits better support for non-killable processes and/or long-running epilog scripts. Several minor changes in node registration handling and slurmctld agent logic to better address a flood of incomming RPC (typically when system restarts). (gnats:268)
-
- 11 Sep, 2003 1 commit
-
-
Moe Jette authored
-
- 10 Sep, 2003 3 commits
- 09 Sep, 2003 8 commits
-
-
Mark Grondona authored
-
Mark Grondona authored
-
Mark Grondona authored
may result in multiple executions of system epilog for a single job (gnats:267)
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
- 05 Sep, 2003 8 commits
-
-
Moe Jette authored
of socket communications. Previously was sometimes overwriting legitimate SLURM error code with fcntl error code of EINTR.
-
Moe Jette authored
sort of slurm error.
-
Moe Jette authored
-
Moe Jette authored
on a job kill. Let the KILL_JOB RPC do all of the cleanup. This removes a redundant RPC. - Moe
-
Moe Jette authored
-
Moe Jette authored
occur naturally if a srun, scontrol, scancel, sinfo, or squeue command is killed by the user with a communication to slurmctld in progress. This seems to occur fairly regularly as part of batch job termination.
-
Moe Jette authored
-
Moe Jette authored
send/receive, function (poll, timeout, send, recv, etc), and the error message are all reported.
-
- 04 Sep, 2003 3 commits
- 03 Sep, 2003 3 commits
-
-
Moe Jette authored
It was picking zero nodes and failing.
-
Moe Jette authored
-
Mark Grondona authored
problem when debugging remote tasks. (and error should have only printed once anyway)
-