Commits · d9bd45db4d34f182b66c2844cebe4415ca06e261 · Manuel G. Marciani / ces_slurm_simulator

23 Jan, 2013 1 commit

In select/cons_res, correct logic when job removed from only some nodes. · eb3c1046

jette authored Jan 23, 2013

I run into a problem with slurm-2.5.1 that IDLE nodes can not be
allocated to jobs. This can be reproduced as follows:

First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to
allocate nodes exclusively by default). Then set one of the nodes
allocated to the job(cn2) to state DOWN:

srun: error: Node failure on cn2
srun: error: Node failure on cn2
srun: error: cn2: task 0: Killed
^Csrun: interrupt (one more within 1 sec to abort)
srun: task 1: running
srun: task 0: exited abnormally
^Csrun: sending Ctrl-C to job 22605.0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: Force Terminated job step 22605.0

Then change state of the node to IDLE again. But it can not be allocated
to jobs:

srun: job 22606 queued and waiting for resources

  JOBID PARTITION     NAME     USER  ST       TIME  NODES
NODELIST(REASON)
  22606      work hostname     root  PD       0:00      1 (Resources)
  22604      work   sbatch     root   R       3:06      1 cn1

NodeName=cn2 Arch=x86_64 CoresPerSocket=8
   CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc
   Gres=(null)
   NodeAddr=cn2 NodeHostName=cn2
   OS=Linux RealMemory=30000 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

I traced and located the problem in select/cons_res. The call sequence
is:

slurmctld/node_mgr.c: update_node() =>
slurmctld/job_mgr.c: kill_running_job_by_node_name() =>
excise_node_from_job() =>
plugins/select/cons_res/select_cons_res.c: select_p_job_resized() =>
_rm_job_from_one_node() => _build_row_bitmaps() =>
common/job_resources: remove_job_from_cores()

If there are other jobs running in the partition, the partition row
bitmap will not be set correctly. In the example above, before
_build_row_bitmaps(), output of _dump_part() is:

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-63

after setting the node down, output of _dump_part() is

[2013-01-19T13:24:56+08:00] part:work rows:1 pri:1
[2013-01-19T13:24:56+08:00]   row0: num_jobs 2: bitmap: 16,32-47

Cores of cn2 are not marked as available. Instead, cores of other nodes
are released. When another job requires the node cn2, the following log
message appears:

[2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy

I do not understand the design of select/cons_res well and I do not know
how to fix this. But it seems that _build_row_bitmaps() should not be
called, since the job is not removed totally, but only one of the nodes
released.

eb3c1046

22 Jan, 2013 1 commit
- In select/cons_res, correct logic to allocate whole sockets to jobs · 47a39777
  Magnus Jonsson authored Jan 21, 2013
  
  47a39777
18 Jan, 2013 3 commits

Fix topology/tree logic when nodes defined in slurm.conf get re-ordered · 29df4c83

Morris Jette authored Jan 18, 2013

From Chris Holmes, HP:
After several days of brainstorming and debugging, I have identified
a bug in SLURM 2.5.0rc2, related to the 'tree' topology. It was so
early in the execution of the whole SLURM machinery that it took me
some time to figure it out (say, 100 or 200 jobs showing the issue,
with more or less debugging levels increased and extra
instrumentation, with sometimes an uncertain reliability)...

For every “switch” a bitmap of nodes (seen down by the switch) is
built as the topology is discovered through 'topology.conf'.

There is code in read_config.c, executed when the SLURM control
daemon starts, that reorders the nodes (according to their hostname
by default), while the switches table (ie the bitmaps) has already
being built. To reorder the nodes means that the bitmaps of the switches become wrong.

29df4c83

Make more variables available to job_submit/lua plugin · 28740196
Morris Jette authored Jan 18, 2013
```
slurm.MEM_PER_CPU, slurm.NO_VAL, etc.
```
28740196

Permit job with invalid QOS to run if QOS set by administrator · 7aef4f80

Phil Eckert authored Jan 17, 2013

About a year ago I submitted a modification that you incorporated
into SLURM 2.4, which was to allow an admin to modify a job to use
a QOS even though the user did not have access to the QOS.

However, I must have tested it without having the Accounting set
to enforce QOS's. So, if an admin modifies a job to a QOS they
don't have access to, it will be modified, but the job will result
in a state of InvalidQOS, which is reasonable, since this would
handle the case where a user has their QOS removed. A problem,
however, is that even though the scheduler won't schedule the job,
backfill still will.

One approach would be to fix backfill to be consistent with
the scheduler (which should probably occur regardless), but
my thought would be to modify the scheduler to allow the QOS
as long as it was set by an admin, since that was the intent
of the modification to begin with.

I believe it  would only take a single line to change, just
adding a check on the job_ptr->limit_set_qos, to make sure
it was set by an admin:

                if (job_ptr->qos_id) {
                        slurmdb_association_rec_t *assoc_ptr;
                        assoc_ptr = (slurmdb_association_rec_t *)job_ptr->assoc_ptr;
                        if (assoc_ptr &&
                            !bit_test(assoc_ptr->usage->valid_qos,
                                      job_ptr->qos_id) &&
                            !job_ptr->limit_set_qos) {
                                info("sched: JobId=%u has invalid QOS",
                                        job_ptr->job_id);
                                xfree(job_ptr->state_desc);
                                job_ptr->state_reason = FAIL_QOS;
                                continue;
                        } else if (job_ptr->state_reason == FAIL_QOS) {
                                xfree(job_ptr->state_desc);
                                job_ptr->state_reason = WAIT_NO_REASON;
                        }
                }

Phil

7aef4f80

17 Jan, 2013 1 commit

Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce

Morris Jette authored Jan 16, 2013

Added "KeepAliveTime" configuration parameter

From Matthieu Hautreux:
TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
exchanges, thus letting the system potentially deadlocked if a remote host
dissapear and the local host is waiting on a read (the write would result in a
EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
relatively large timeout value (5 minutes), could enhance the resilience of
SLURM for unexpected packet/connection loss without too much implication on the
scalability of the solution. The timeout could be configurable in case it is
find too aggresive for particular configurations.

5752c6ce

16 Jan, 2013 4 commits
- Fix for job request with multiple partitions and node features · d84ad203
  Morris Jette authored Jan 16, 2013
```
Without this patch, if the first listed partition lacks nodes
with required features the job would be rejected.
```
  d84ad203
- Revert commit 3d69c635 · 0b9c3c58
  Morris Jette authored Jan 16, 2013
```
While this will validate job at submit time, it results in redundant
looping when scheduling jobs. Working on alternate patch now.
```
  0b9c3c58
- Check all requested partitions to make sure at least one is valid on job · 3d69c635
  Danny Auble authored Jan 16, 2013
```
submission.
```
  3d69c635
- Fix issue where sjstat and sjobexitmod was installed in 2 different rpms · 8b91fd63
  Danny Auble authored Jan 15, 2013
  
  8b91fd63
15 Jan, 2013 2 commits

QoS limits enforcement: correct a bug with 0-valued per user used limits · 4136520d

Matthieu Hautreux authored Jan 15, 2013

QoS limits enforcement on the controller side is based on a list of used_limits
per user. When a user is not yet added to the list, which is common when the
controller is restarted and the user has no running jobs, the current logic is
to not check some of the "per user limits" and let the submission succeed.
However, if one of these limits is a zero-valued limit, the check chould
failed as it means that no job should be submitted at all as it would
necessarily result in a crossing of the limit.

This patch ensures that even when a user is not yet present in the per user
used_limits list, the 0-valued limits are correctly treated.

4136520d

Merge priority/multifactor2 plugin into priority/multifactor · 6596f17e
David Bigagli authored Jan 14, 2013
```
Add PriorityFlags value of "TICKET_BASED".
```
6596f17e

14 Jan, 2013 3 commits
- Handle srun task launch failure without duplicate error messages or abort. · 6e83dc4c
  jette authored Jan 14, 2013
  
  6e83dc4c
- Added new SelectTypeParameter value of "CR_ALLOCATE_FULL_SOCKET". · cdf679d0
  Morris Jette authored Jan 14, 2013
  
  cdf679d0
- select/cons_res plugin: CPU allocation logic fix · 1ef41ac9
  Morris Jette authored Jan 14, 2013
```
Correction to CPU allocation count logic in for cores without hyperthreading.
```
  1ef41ac9
11 Jan, 2013 1 commit
- Modify switch/nrt logic to permit build without libnrt.so library. · d534d48b
  Morris Jette authored Jan 10, 2013
  
  d534d48b
10 Jan, 2013 3 commits
- Cray - Add new cray.conf parameter of "AlpsEngine" · 35a61b44
  Morris Jette authored Jan 10, 2013
```
Used to specify the communication protocol to be used for ALPS/BASIL.
```
  35a61b44
- Add job_submit/all_partitions plugin · f07f173b
  jette authored Jan 10, 2013
  
  f07f173b
- Fix logic to optimize GRES topology with respect to allocated CPUs. · a6bfae71
  Morris Jette authored Jan 09, 2013
  
  a6bfae71
09 Jan, 2013 2 commits
- Add missing "safe" flag from print of AccountStorageEnforce option. · 720153b7
  Danny Auble authored Jan 09, 2013
  
  720153b7
- Add new AccountStorageEnforce options of 'nojobs' and 'nosteps' which will · caa974ae
  Danny Auble authored Jan 08, 2013
```
allow the use of accounting features like associations, qos and limits but
not keep track of jobs or steps in accounting.
```
  caa974ae
08 Jan, 2013 3 commits
- BLUEGENE - fix for QOS/Association node limits. · b50e2269
  Danny Auble authored Jan 08, 2013
  
  b50e2269
- Fix advanced reservation recovery logic when upgrading from version 2.4. · 604b869e
  Morris Jette authored Jan 08, 2013
  
  604b869e
- Added support for job arrays. · 2993b423
  Morris Jette authored Jan 07, 2013
```
Phase 1 of effort. See "man sbatch" option -a/--array option for details.
Creates job records using sbatch. Reports job arrays using scontrol or
squeue. More work coming soon...
```
  2993b423
03 Jan, 2013 5 commits
- Start news for V2.5.2 · 64048706
  Morris Jette authored Jan 03, 2013
  
  64048706
- Disable job --exclusive option with select/serial plugin · 54a3a7db
  Morris Jette authored Jan 03, 2013
  
  54a3a7db
- Exit scontrol command on stdin EOF · 9eb37af0
  Morris Jette authored Jan 03, 2013
  
  9eb37af0
- Fix typo · e3cba0ed
  Morris Jette authored Jan 03, 2013
  
  e3cba0ed
- Correct core reservation logic for use with select/serial plugin. · 0a06566a
  Morris Jette authored Jan 02, 2013
  
  0a06566a
28 Dec, 2012 1 commit
- Support version 2.5 slurmctld with version 2.4 slurmd daemons. · 844f70a2
  Morris Jette authored Dec 28, 2012
  
  844f70a2
22 Dec, 2012 1 commit
- Alter hostlist logic to allocate large grid dynamically instead of on · 8b5045f2
  Danny Auble authored Dec 21, 2012
```
stack.
```
  8b5045f2
21 Dec, 2012 3 commits

Correct job time limit for sched/backfil and job has QOS with NO_RESERVE flag · 4652e982

Morris Jette authored Dec 21, 2012

If sched/backfill starts a job with a QOS having NO_RESERVE and not job
time limit, start it with the partition time limit (or one year if the
partition has no time limit) rather than NO_VAL (140 year time limit);

If a standby job, which in this
case has the NO_RESERVE flag set, is submitted
without a time limit, and is backfilled, it
will get an EndTime waaayyyy into the future.

JobId=99 Name=cmdll
   UserId=eckert(1043) GroupId=eckert(1043)
   Priority=12083 Account=sa QOS=standby
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:14 TimeLimit=12:00:00 TimeMin=N/A
   SubmitTime=2012-12-20T11:49:36 EligibleTime=2012-12-20T11:49:36
   StartTime=2012-12-20T11:49:44 EndTime=2149-01-26T18:16:00

so I looked at the code in /src/plugins/sched/backfill:

                if (job_ptr->start_time <= now) {
                        int rc = _start_job(job_ptr, resv_bitmap);
                        if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
                                job_ptr->time_limit = orig_time_limit;
                                job_ptr->end_time = job_ptr->start_time +
                                                    (orig_time_limit * 60);

Using the debugger I found that if the job does not have a specified
time limit, the job_ptr->time_limit is equal to NO_VAL when it hits
this code.

4652e982

Remove sacct --dump --formatted-dump options which were deprecated in · b71523e8
Danny Auble authored Dec 20, 2012
```
2.5.
```
b71523e8
Added "HealthCheckNodeState" configuration parameter · b139f654
Morris Jette authored Dec 20, 2012
```
Identify node states on which HealthCheckProgram should be executed.
```
b139f654

20 Dec, 2012 4 commits
- FRONTEND - fixed issue where if a systems nodes weren't defined in the · 4d5d3e9f
  Danny Auble authored Dec 20, 2012
```
slurm.conf with NodeAddr's signals going to a step could be handled
incorrectly.
```
  4d5d3e9f
- Fixed issue where if an srun dies inside of an allocation abnormally it · ea7046d2
  Danny Auble authored Dec 20, 2012
```
would of also killed the allocation.
```
  ea7046d2
- Added "slurm_load_node_single" function · 2a889727
  Morris Jette authored Dec 20, 2012
```
This is a variation of "slurm_load_nodes", but accepts a node name argument, potentially resultingin substantial performance improvement for "sinfo --nodes=NAME".
```
  2a889727
- Added "slurm_load_job_user" function. · 78011bab
  Morris Jette authored Dec 19, 2012
```
This is a variation of "slurm_load_jobs", but accepts a user ID argument,
potentially resulting in substantial performance improvement.
```
  78011bab
19 Dec, 2012 2 commits
- BLUEGENE - Fix issue with preemption when needing to preempt multiple jobs · 8910442c
  Danny Auble authored Dec 19, 2012
```
to make one job run.
```
  8910442c
- BLUEGENE - Correct method to update conn_type of a job. · d26fa575
  Danny Auble authored Dec 19, 2012
  
  d26fa575