Commits · 8e038b5c5afb84c88cb7c622ec7e822fb15366bc · Manuel G. Marciani / ces_slurm_simulator

19 Mar, 2013 5 commits

Note nature of latest change · 8e038b5c
Morris Jette authored Mar 19, 2013

8e038b5c

fix of idle nodes cannot be allocated · 4ea9850a

Hongjia Cao authored Mar 19, 2013

avoid add/remove node resource of job if the node is lost by resize

 I found another case that idle node can not be allocated. It can be
reproduced as follows:

1. run a job with -k option:

    [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
    srun: error: Node failure on cn28
    srun: error: Node failure on cn28
    srun: error: cn28: task 10: Killed
    ^Csrun: interrupt (one more within 1 sec to abort)
    srun: tasks 0-9: running
    srun: task 10: exited abnormally
    ^Csrun: sending Ctrl-C to job 106120.0
    srun: Job step aborted: Waiting up to 2 seconds for job step to
finish.

2. set a node down and then set it idle:

    [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
test"
    [root@mn0 ~]# scontrol update nodename=cn28 state=idle

3. restart slurmctld

    [root@mn0 ~]# service slurm restart
    stopping slurmctld:                                        [  OK  ]
    slurmctld is stopped
    starting slurmctld:                                        [  OK  ]

4. cancel the job

then, the node set down will be left unavailable:

    [root@mn0 ~]# sinfo -n cn[18-28]
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    work*        up   infinite     11   idle cn[18-28]

    [root@mn0 ~]# srun -w cn[18-28] hostname
    srun: job 106122 queued and waiting for resources

    [root@mn0 slurm]# grep cn28 slurmctld.log
    [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use
    [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use

I made an attempt to fix this by the attached patch. Please review it.

4ea9850a

Correction in logic issuing call to account for change in job time limit · 9f5a7a0e

Morris Jette authored Mar 19, 2013

I don't believe save_time_limit was redundant.  At least in this case:

if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
    if (orig_time_limit == NO_VAL)
        orig_time_limit = comp_time_limit;
    job_ptr->time_limit = orig_time_limit;
[...]

So later, when updating the db,

    if (save_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);
will cause the db to be updated, while,

        if (orig_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);

will not because job_ptr->time_limit now equals orig_time_limit.

9f5a7a0e

Do not report error when job step terminates while sstat is running · 4cb6137c
Morris Jette authored Mar 19, 2013

4cb6137c

Record updated job time limit if modified by backfill · 46348f91

Don Lipari authored Mar 14, 2013

Without this change, if the job's time limit is modified down
toward --time-min by the backfill scheduler, update the job's
time limit in the database.

46348f91

14 Mar, 2013 4 commits
- sreport - Fix by adding planned down time to utilization reports. · dced5e7f
  Danny Auble authored Mar 14, 2013
  
  dced5e7f
- Add some diagrams to the IBM PE documentation · 2868384d
  Morris Jette authored Mar 14, 2013
  
  2868384d
- Accounting - more checks for strings with a possible `'` in it. · ff021de1
  Danny Auble authored Mar 14, 2013
  
  ff021de1
- CRAY - Fix SLURM_TASKS_PER_NODE to be set correctly. · 5c370edb
  Danny Auble authored Mar 11, 2013
  
  5c370edb
13 Mar, 2013 5 commits
- Improve error checking for step allocation with min and max node count · 7223d0d2
  Morris Jette authored Mar 13, 2013
  
  7223d0d2
- Correction to error returned by step request error for too many CPUs · 36df0bbf
  Morris Jette authored Mar 13, 2013
```
If step requests more CPUs than possible in specified node count of job
allocation then return ESLURM_TOO_MANY_REQUESTED_CPUS rather than
ESLURM_NODES_BUSY and retrying.
```
  36df0bbf
- Add comments to describe function arguments · 4a37e469
  Morris Jette authored Mar 13, 2013
  
  4a37e469
- Minor documentation update for select plugin · a71f95b2
  Danny Auble authored Mar 13, 2013
  
  a71f95b2
- Minor changes to documentation · 0da977dc
  Danny Auble authored Mar 13, 2013
  
  0da977dc
12 Mar, 2013 2 commits

Minor format changes from previous commit · f5a89755
Morris Jette authored Mar 12, 2013

f5a89755

Fix scheduling if node in more than one partition · fcef06b4

Magnus Jonsson authored Mar 12, 2013

I found a bug in cons_res/select_p_select_nodeinfo_set_all.

If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node.

Patch attached to fix the problem.

I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version.

Best regards,
Magnus

fcef06b4

11 Mar, 2013 8 commits
- Add support for SALLOC_RESERVATION and SLURM_RESERVATION for salloc and srun · 845a7925
  Morris Jette authored Mar 11, 2013
```
This permits default reservation names to be more easily managed
```
  845a7925
- Add support for SBATCH_RESERVATION environment variable · 03983666
  Andy Wettstein authored Mar 11, 2013
  
  03983666
- Export SLURM_ environment variables from sbatch, even if not --exported · 336bd7bf
  Nathan Yee authored Mar 11, 2013
```
Without this change, when the sbatch --export option is used, many
Slurm environment variables are not set unless explicitly exported.
```
  336bd7bf
- Fix minor formatting issue with printing QOS limits · 5f67674d
  Danny Auble authored Mar 11, 2013
  
  5f67674d
- Fix for sacctmgr add qos to handle the 'flags' option. · c8498b0d
  Danny Auble authored Mar 11, 2013
  
  c8498b0d
- Clarify Slurm use with IBM PE · 15d570fa
  Morris Jette authored Mar 11, 2013
  
  15d570fa
- Remove vestigial check for llapi.h file at configure time · 5731a563
  jette authored Mar 11, 2013
  
  5731a563
- Fix some typos in documentation · 11118303
  Dmitri Gribenko authored Mar 10, 2013
  
  11118303
08 Mar, 2013 7 commits
- Start NEWS for v2.5.5 · ad3caaae
  Morris Jette authored Mar 08, 2013
  
  ad3caaae
- Update META for v2.5.4 tag · 478b34d9
  Morris Jette authored Mar 08, 2013
  
  478b34d9
- Correct typos in web pages · 16d67ec8
  Nathan Yee authored Mar 08, 2013
  
  16d67ec8
- Fix typos in man pages · 60fe6e28
  Nathan Yee authored Mar 08, 2013
  
  60fe6e28
- Add sleep to logic when invalid RPC received to discourage attacks · a650edce
  jette authored Mar 08, 2013
  
  a650edce
- Replace blocks of spaces with tabs, no change in logic · 18c11099
  jette authored Mar 08, 2013
  
  18c11099
- Fix to handle init.d script for querying status and not return 1 on · 01e855a9
  Danny Auble authored Mar 08, 2013
```
success
```
  01e855a9
07 Mar, 2013 1 commit

GRES topology bug in core selection logic fixed. · 07eb5d24

jette authored Mar 07, 2013

This problem would effect systems in which specific GRES are associated
with specific CPUs.
One possible result is the CPUs identified as usable could be inappropriate
and job would be held when trying to layout out the tasks on CPUs (all
done as part of the job allocation process).
The other problem is that if multiple GRES are linked to specific CPUs,
there was a CPU bitmap OR which should have been an AND, resulting in
some CPUs being identified as usable, but not available to all GRES.

07eb5d24

06 Mar, 2013 8 commits
- Updates to plugin descriptions · dd8329d7
  Morris Jette authored Mar 06, 2013
  
  dd8329d7
- Add pam_slurm.so to faq example · 40add58e
  Morris Jette authored Mar 06, 2013
  
  40add58e
- Various Documentation updates · aae67aeb
  Danny Auble authored Mar 06, 2013
  
  aae67aeb
- Disable per-task i/o tests without launch/slurm plugin use · 0b99cee1
  Morris Jette authored Mar 06, 2013
  
  0b99cee1
- Update pam_slurm module information · 7e76cd2b
  Morris Jette authored Mar 06, 2013
  
  7e76cd2b
- BGQ - fix typo · 0cd81940
  Danny Auble authored Mar 06, 2013
  
  0cd81940
- More test updates · 8af105ab
  Danny Auble authored Mar 05, 2013
  
  8af105ab
- Minor format change · be8faab6
  Danny Auble authored Mar 05, 2013
  
  be8faab6