
Home
About
Overview
What's New
Publications
SLURM Team
Using
Documentation
FAQ
Getting Help
Installing
Platforms
Download
Guide |
 |
Frequently Asked Questions
- Why is my job/node in "completing" state?
- Why do I see the error "Can't propagate RLIMIT_..."?
- Why is my job not running?
1. Why is my job/node in "completing" state?
When a job is terminating, both the job and its nodes enter the state "completing."
As the SLURM daemon on each node determines that all processes associated with
the job have terminated, that node changes state to "idle" or some other
appropriate state. When every node allocated to a job has determined that all
processes associated with it have terminated, the job changes state to "completed"
or some other appropriate state. Normally, this happens within a fraction of one
second. However, if the job has processes that cannot be terminated with a SIGKILL,
the job and one or more nodes can remain in the completing state for an extended
period of time. This may be indicative of processes hung waiting for a core file
to complete I/O or operating system failure. If this state persists, the system
administrator should use the scontrol command
to change the node's state to "down," reboot the node, then reset the
node's state to idle.
2. Why do I see the error "Can't propagate RLIMIT_..."?
When the srun command executes, it captures the
resource limits in effect at that time. These limits are propagated to the allocated
nodes before initiating the user's job. If the soft resource limits on the job
submit host are higher than the hard resource limits on the allocated host, SLURM
will be unable to propagate the resource limits and print an error of the type
shown above. It is recommended that the system administrator establish uniform
hard resource limits on all nodes within a cluster to prevent this from occurring.
3. Why is my job not running?
The answer to this question depends upon the scheduler used by SLURM. Executing
the command
scontrol show config | grep SchedulerType
will supply this information. If the scheduler type is builtin, then
jobs will be executed in the order of submission for a given partition. Even if
resources are available to initiate your job immediately, it will be deferred
until no previously submitted job is pending. If the scheduler type is backfill,
then jobs will generally be executed in the order of submission for a given partition
with one exception: later submitted jobs will be initiated early if doing so does
not delay the expected execution time of an earlier submitted job. In order for
backfill scheduling to be effective, users jobs should specify reasonable time
limits. If jobs do not specify time limits, then all jobs will receive the same
time limit (that associated with the partition), and the ability to backfill schedule
jobs will be limited. The backfill scheduler does not alter job specifications
of required or excluded nodes, so jobs which specify nodes will substantially
reduce the effectiveness of backfill scheduling. If the scheduler type is wiki,
this represents The Maui Scheduler.
Please refer to its documentation for help. For any scheduler, you can check priorities
of jobs using the command scontrol show job. |