1. Why is my job/node in completing state?
2. Why do I see the error "Can't propagate RLIMIT_..."?
3. Why is my job not running? URL = http://www.llnl.gov/linux/slurm/faq.html
UCRL-WEB-201790
Last Modified January 9, 2004
When a job is terminating both the job and its nodes enter the state
completing.
As the SLURM daemon on each node determines that all processes associated
with the job have terminated, that node changes state to idle or
some other appropriate state.
When every node allocated to a job has determined that all processes
associated with it have terminated, the job changes state to completed
or some other appropriate state.
Normally this happens within a fraction of one second.
However if the job has processes which can not be terminated with a SIGKILL,
the job and one or more nodes can remain in the completing state for
an extended period of time.
This may be indicative of processes hung waiting for a core file to completed
I/O or operating system failure.
If this state persists, the system administrator should use the scontrol
command to change the node's state to down, reboot the node, then reset
the node's state to idle.
When the srun command executes, it captures the resource limits in
effect at that time.
These limits are propagated to the allocated nodes before initiating the
user's job.
If the soft resource limits on the job submit host are higher than the
hard resource limits on the allocated host, SLURM will be unable to
propagate the resource limits and print an error of the type shown above.
It is recommended that the system administrator establish uniform hard
resource limits on all nodes within a cluster to prevent this from occurring.
The answer to this question depends upon the scheduler used by SLURM.
Executing the command
scontrol show config | grep SchedulerType will supply this information.
If the scheduler type is builtin, then jobs will be executed in the
order of submission for a given partition.
Even if resources are available to initiate your job immediately, it will
be deferred until no previously submitted job is pending.
If the scheduler type is backfill, then jobs will generally be
executed in the order of submission for a given partition with one
exception: later submitted jobs will be initiated early if doing so does
not delay the expected execution time of an earlier submitted job.
In order for backfill scheduling to be effective, users jobs should specify
reasonable time limits.
If jobs do not specify time limits then all jobs will receive the same
time limit, that associated with the partition, and the ability to backfill
schedule jobs will be limited.
The backfill scheduler does not alter job specifications of required or
excluded nodes, so jobs which specify nodes will substantially reduce
the effectiveness of backfill scheduling.
If the scheduler type is wiki, this represents
The Maui Scheduler.
Please refer to its documentation for help.
For any scheduler, you can check priorities of jobs using the command
scontrol show job.
Privacy and Legal Notice