Commits · 782b3fdba0b093b0dd2dcedc2d7269f532bcd9a8 · Manuel G. Marciani / ces_slurm_simulator

26 Mar, 2013 3 commits
- Update reference to slurm documentation from www.schedmd.com/slurmdocs to · 782b3fdb
  Danny Auble authored Mar 26, 2013
```
slurm.schedmd.com
```
  782b3fdb
- Accounting - Minor fix to avoid reuse of variable erroneously. · 9403500e
  Danny Auble authored Mar 26, 2013
  
  9403500e
- Accounting - When rolling up data from past usage ignore "idle" time from · 2ed8a4d6
  Danny Auble authored Mar 26, 2013
```
a reservation when it has the "Ignore_Jobs" flag set.  Since jobs could run
outside of the reservation in it's nodes without this you could have
double time.
```
  2ed8a4d6
25 Mar, 2013 6 commits
- Modify logging logic to more easily disable · c1c50a42
  Morris Jette authored Mar 25, 2013
  
  c1c50a42
- launch/aprun - correction to tasks/per/node computation · 6486499a
  Morris Jette authored Mar 25, 2013
  
  6486499a
- Cray - Disable enforcement of MaxTasksPerNode · aacdb424
  Morris Jette authored Mar 25, 2013
```
This is not applicable with launch/aprun
```
  aacdb424
- Note nature of last two patches from Hongjia Cao · a63e616e
  Morris Jette authored Mar 25, 2013
  
  a63e616e
- fix of not enough cpus allocated to a step · e0b68e0c
  Hongjia Cao authored Mar 23, 2013
  
  e0b68e0c
- fix of not running exclusive step with required nodes · 00494f54
  Hongjia Cao authored Mar 23, 2013
  
  00494f54
24 Mar, 2013 1 commit
- Note Bright support situation · cc668cb7
  jette authored Mar 23, 2013
  
  cc668cb7
23 Mar, 2013 1 commit
- Vestigial code cleanup in squeue · 39d20fa5
  Lipari, Don authored Mar 22, 2013
  
  39d20fa5
22 Mar, 2013 2 commits

Andy Wettstein authored Mar 22, 2013

On Redhat 6 based distros the lua library name is liblua-5.1.so.
Installing the lua-devel package will create the liblua.so symlink, but
if that isn't installed then the lua job submit plugin will fail to
load.
I'm attaching a patch that adds liblua-5.1.so to the search path.

9112d154

Select/cray - Modify build to enable direct use of libslurm library. · 7d4f145a

Morris Jette authored Mar 22, 2013

These changes are required so that select/cray can load select/linear,
  which is a bit more complex than the other select plugin structures.
Export plugin_context_create and plugin_context_destroy symbols from
  libslurm.so.
Correct typo in exported hostlist_sort symbol name
Define some functions in select/cray to avoid undefined symbols if
  the plugin is loaded via libslurm rather than from a slurm command
  (which has all of the required symbols)

7d4f145a

21 Mar, 2013 1 commit
- Update support info · 33647a85
  Morris Jette authored Mar 21, 2013
  
  33647a85
20 Mar, 2013 5 commits
- [PATCH] fix of job requiring contiguous nodes can not run · e416e35f
  Hongjia Cao authored Mar 20, 2013
  
  e416e35f
- SlurmDBD - fix to allow user root along with the slurm user to register a · 485cb062
  Danny Auble authored Mar 20, 2013
```
cluster.
```
  485cb062
- Decrease time limit in a test in case of small partition time limit · e0020ed1
  jette authored Mar 19, 2013
  
  e0020ed1
- Add more logging information to a test · b912fad0
  jette authored Mar 19, 2013
  
  b912fad0
- initialize timer string to avoid garbage in log messages · 73996996
  Morris Jette authored Mar 19, 2013
  
  73996996
19 Mar, 2013 8 commits

Log when a job's time limit is changes by backfill scheduling · 03ad76cf
Don Lipari authored Mar 19, 2013

03ad76cf
Select/cons_res - Tighter packing of job allocations on sockets. · 7fcdc7e5
Morris Jette authored Mar 19, 2013

7fcdc7e5
change select() to poll() in waiting for a socket to be readable · 3175cf91
Hongjia Cao authored Mar 19, 2013
```
select()/FD_ISSET() does not work for file descriptor larger than 1023.
```
3175cf91
Note nature of latest change · 8e038b5c
Morris Jette authored Mar 19, 2013

8e038b5c

fix of idle nodes cannot be allocated · 4ea9850a

Hongjia Cao authored Mar 19, 2013

avoid add/remove node resource of job if the node is lost by resize

 I found another case that idle node can not be allocated. It can be
reproduced as follows:

1. run a job with -k option:

    [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
    srun: error: Node failure on cn28
    srun: error: Node failure on cn28
    srun: error: cn28: task 10: Killed
    ^Csrun: interrupt (one more within 1 sec to abort)
    srun: tasks 0-9: running
    srun: task 10: exited abnormally
    ^Csrun: sending Ctrl-C to job 106120.0
    srun: Job step aborted: Waiting up to 2 seconds for job step to
finish.

2. set a node down and then set it idle:

    [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
test"
    [root@mn0 ~]# scontrol update nodename=cn28 state=idle

3. restart slurmctld

    [root@mn0 ~]# service slurm restart
    stopping slurmctld:                                        [  OK  ]
    slurmctld is stopped
    starting slurmctld:                                        [  OK  ]

4. cancel the job

then, the node set down will be left unavailable:

    [root@mn0 ~]# sinfo -n cn[18-28]
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    work*        up   infinite     11   idle cn[18-28]

    [root@mn0 ~]# srun -w cn[18-28] hostname
    srun: job 106122 queued and waiting for resources

    [root@mn0 slurm]# grep cn28 slurmctld.log
    [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use
    [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use

I made an attempt to fix this by the attached patch. Please review it.

4ea9850a

Correction in logic issuing call to account for change in job time limit · 9f5a7a0e

Morris Jette authored Mar 19, 2013

I don't believe save_time_limit was redundant.  At least in this case:

if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
    if (orig_time_limit == NO_VAL)
        orig_time_limit = comp_time_limit;
    job_ptr->time_limit = orig_time_limit;
[...]

So later, when updating the db,

    if (save_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);
will cause the db to be updated, while,

        if (orig_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);

will not because job_ptr->time_limit now equals orig_time_limit.

9f5a7a0e

Do not report error when job step terminates while sstat is running · 4cb6137c
Morris Jette authored Mar 19, 2013

4cb6137c

Record updated job time limit if modified by backfill · 46348f91

Don Lipari authored Mar 14, 2013

Without this change, if the job's time limit is modified down
toward --time-min by the backfill scheduler, update the job's
time limit in the database.

46348f91

14 Mar, 2013 4 commits
- sreport - Fix by adding planned down time to utilization reports. · dced5e7f
  Danny Auble authored Mar 14, 2013
  
  dced5e7f
- Add some diagrams to the IBM PE documentation · 2868384d
  Morris Jette authored Mar 14, 2013
  
  2868384d
- Accounting - more checks for strings with a possible `'` in it. · ff021de1
  Danny Auble authored Mar 14, 2013
  
  ff021de1
- CRAY - Fix SLURM_TASKS_PER_NODE to be set correctly. · 5c370edb
  Danny Auble authored Mar 11, 2013
  
  5c370edb
13 Mar, 2013 5 commits
- Improve error checking for step allocation with min and max node count · 7223d0d2
  Morris Jette authored Mar 13, 2013
  
  7223d0d2
- Correction to error returned by step request error for too many CPUs · 36df0bbf
  Morris Jette authored Mar 13, 2013
```
If step requests more CPUs than possible in specified node count of job
allocation then return ESLURM_TOO_MANY_REQUESTED_CPUS rather than
ESLURM_NODES_BUSY and retrying.
```
  36df0bbf
- Add comments to describe function arguments · 4a37e469
  Morris Jette authored Mar 13, 2013
  
  4a37e469
- Minor documentation update for select plugin · a71f95b2
  Danny Auble authored Mar 13, 2013
  
  a71f95b2
- Minor changes to documentation · 0da977dc
  Danny Auble authored Mar 13, 2013
  
  0da977dc
12 Mar, 2013 2 commits

Minor format changes from previous commit · f5a89755
Morris Jette authored Mar 12, 2013

f5a89755

Fix scheduling if node in more than one partition · fcef06b4

Magnus Jonsson authored Mar 12, 2013

I found a bug in cons_res/select_p_select_nodeinfo_set_all.

If a node is part of two (or more) partitions the code will only count the number of cores/cpus in the partition that has the most running jobs on that node.

Patch attached to fix the problem.

I also added an new function to bitstring to count the number of bits in an range (bit_set_count_range) and made a minor improvement of (bit_set_count) while reviewing the range version.

Best regards,
Magnus

fcef06b4

11 Mar, 2013 2 commits
- Add support for SALLOC_RESERVATION and SLURM_RESERVATION for salloc and srun · 845a7925
  Morris Jette authored Mar 11, 2013
```
This permits default reservation names to be more easily managed
```
  845a7925
- Add support for SBATCH_RESERVATION environment variable · 03983666
  Andy Wettstein authored Mar 11, 2013
  
  03983666