Commits · 5cbe0429c205f81fab50d85884964ddaa904345a · Manuel G. Marciani / ces_slurm_simulator

22 Mar, 2013 2 commits
- sched/backfill -fix to new front-end ACL support · 5cbe0429
  Morris Jette authored Mar 22, 2013
  
  5cbe0429
- Add Allow/Deny Groups/User fields to front end node configuration · 9a127c23
  Morris Jette authored Mar 21, 2013
  
  9a127c23
21 Mar, 2013 3 commits
- Correction to failing node callback function · 03243a75
  Morris Jette authored Mar 21, 2013
  
  03243a75
- Add another callback for slurmctld plugins · 7afa5561
  Morris Jette authored Mar 21, 2013
  
  7afa5561
- Change time limit on a test in case default partition time limit is small · d3be6669
  jette authored Mar 21, 2013
  
  d3be6669
20 Mar, 2013 9 commits
- Remove slurmctld/sock_insec plugin · 274117c9
  Morris Jette authored Mar 20, 2013
  
  274117c9
- add username to the filename pattern in the batch script · c4d93160
  Luis Cabellos authored Mar 20, 2013
  
  c4d93160
- Merge branch 'slurm-2.5' · bde80059
  Morris Jette authored Mar 20, 2013
  
  bde80059
- [PATCH] fix of job requiring contiguous nodes can not run · e416e35f
  Hongjia Cao authored Mar 20, 2013
  
  e416e35f
- SlurmDBD - fix to allow user root along with the slurm user to register a · 485cb062
  Danny Auble authored Mar 20, 2013
```
cluster.
```
  485cb062
- Merge branch 'slurm-2.5' · 2da3b228
  jette authored Mar 19, 2013
  
  2da3b228
- Decrease time limit in a test in case of small partition time limit · e0020ed1
  jette authored Mar 19, 2013
  
  e0020ed1
- Add more logging information to a test · b912fad0
  jette authored Mar 19, 2013
  
  b912fad0
- initialize timer string to avoid garbage in log messages · 73996996
  Morris Jette authored Mar 19, 2013
  
  73996996
19 Mar, 2013 11 commits

Merge branch 'slurm-2.5' · 4322b420
Morris Jette authored Mar 19, 2013
```
Conflicts:
	src/plugins/sched/backfill/backfill.c
```
4322b420
Log when a job's time limit is changes by backfill scheduling · 03ad76cf
Don Lipari authored Mar 19, 2013

03ad76cf
Select/cons_res - Tighter packing of job allocations on sockets. · 7fcdc7e5
Morris Jette authored Mar 19, 2013

7fcdc7e5
change select() to poll() in waiting for a socket to be readable · 3175cf91
Hongjia Cao authored Mar 19, 2013
```
select()/FD_ISSET() does not work for file descriptor larger than 1023.
```
3175cf91
Note nature of latest change · 8e038b5c
Morris Jette authored Mar 19, 2013

8e038b5c

fix of idle nodes cannot be allocated · 4ea9850a

Hongjia Cao authored Mar 19, 2013

avoid add/remove node resource of job if the node is lost by resize

 I found another case that idle node can not be allocated. It can be
reproduced as follows:

1. run a job with -k option:

    [root@mn0 ~]# srun -w cn[18-28] -k sleep 1000
    srun: error: Node failure on cn28
    srun: error: Node failure on cn28
    srun: error: cn28: task 10: Killed
    ^Csrun: interrupt (one more within 1 sec to abort)
    srun: tasks 0-9: running
    srun: task 10: exited abnormally
    ^Csrun: sending Ctrl-C to job 106120.0
    srun: Job step aborted: Waiting up to 2 seconds for job step to
finish.

2. set a node down and then set it idle:

    [root@mn0 ~]# scontrol update nodename=cn28 state=down reason="hjcao
test"
    [root@mn0 ~]# scontrol update nodename=cn28 state=idle

3. restart slurmctld

    [root@mn0 ~]# service slurm restart
    stopping slurmctld:                                        [  OK  ]
    slurmctld is stopped
    starting slurmctld:                                        [  OK  ]

4. cancel the job

then, the node set down will be left unavailable:

    [root@mn0 ~]# sinfo -n cn[18-28]
    PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
    work*        up   infinite     11   idle cn[18-28]

    [root@mn0 ~]# srun -w cn[18-28] hostname
    srun: job 106122 queued and waiting for resources

    [root@mn0 slurm]# grep cn28 slurmctld.log
    [2013-03-18T15:28:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use
    [2013-03-18T15:29:02+08:00] debug3: cons_res: _vns: node cn28 in
exclusive use

I made an attempt to fix this by the attached patch. Please review it.

4ea9850a

Merge branch 'slurm-2.5' · 6dd90805
Morris Jette authored Mar 19, 2013

6dd90805

Correction in logic issuing call to account for change in job time limit · 9f5a7a0e

Morris Jette authored Mar 19, 2013

I don't believe save_time_limit was redundant.  At least in this case:

if (qos_ptr && (qos_ptr->flags & QOS_FLAG_NO_RESERVE)){
    if (orig_time_limit == NO_VAL)
        orig_time_limit = comp_time_limit;
    job_ptr->time_limit = orig_time_limit;
[...]

So later, when updating the db,

    if (save_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);
will cause the db to be updated, while,

        if (orig_time_limit != job_ptr->time_limit)
        jobacct_storage_g_job_start(acct_db_conn,
                        job_ptr);

will not because job_ptr->time_limit now equals orig_time_limit.

9f5a7a0e

Merge branch 'slurm-2.5' · 3f24195a

Morris Jette authored Mar 19, 2013

Conflicts:
	src/db_api/cluster_report_functions.c
	src/plugins/sched/backfill/backfill.c

3f24195a

Do not report error when job step terminates while sstat is running · 4cb6137c
Morris Jette authored Mar 19, 2013

4cb6137c

Record updated job time limit if modified by backfill · 46348f91

Don Lipari authored Mar 14, 2013

Without this change, if the job's time limit is modified down
toward --time-min by the backfill scheduler, update the job's
time limit in the database.

46348f91

18 Mar, 2013 1 commit
- Improvements for fault-tolerance work · 465fc898
  Morris Jette authored Mar 18, 2013
  
  465fc898
14 Mar, 2013 7 commits
- sreport - Fix by adding planned down time to utilization reports. · dced5e7f
  Danny Auble authored Mar 14, 2013
  
  dced5e7f
- Add some diagrams to the IBM PE documentation · 2868384d
  Morris Jette authored Mar 14, 2013
  
  2868384d
- Merge remote-tracking branch 'origin/slurm-2.5' · 363dfb95
  Danny Auble authored Mar 14, 2013
  
  363dfb95
- Accounting - more checks for strings with a possible `'` in it. · ff021de1
  Danny Auble authored Mar 14, 2013
  
  ff021de1
- CRAY - Fix SLURM_TASKS_PER_NODE to be set correctly. · 5c370edb
  Danny Auble authored Mar 11, 2013
  
  5c370edb
- Remove temporary testing logic for timer logging · feb46f2b
  Morris Jette authored Mar 14, 2013
  
  feb46f2b
- Change default log time to ISO 8601 (remove time zone) · 28beff27
  Morris Jette authored Mar 14, 2013
```
Add milliseconds to default log message header (both RFC 5424 and ISO 8601
time formats). Disable milliseconds logging using the configure
parameter "--disable-log-time-msec". Default time format changes to
ISO 8601 (without time zone information). Specify "--enable-rfc5424time"
to restore the time zone information.
```
  28beff27
13 Mar, 2013 7 commits
- Add "--enable-rfc5424time-secs" configure parameter · 00c46ce5
  Morris Jette authored Mar 13, 2013
```
Add milliseconds to default log message header with the (default)
RFC5424 time format. Disable milliseconds logging using the configure
parameter "--enable-rfc5424time-secs". Sample time stamp format is as
follows: "2013-03-13T14:28:17.767-07:00".
```
  00c46ce5
- Add msec resolution to RFC5424 time stamps in the logs · a36b680f
  David Bigagli authored Mar 13, 2013
  
  a36b680f
- Merge branch 'slurm-2.5' · 169cfd68
  Morris Jette authored Mar 13, 2013
```
Conflicts:
	doc/man/man1/sbatch.1
```
  169cfd68
- Improve error checking for step allocation with min and max node count · 7223d0d2
  Morris Jette authored Mar 13, 2013
  
  7223d0d2
- Correction to error returned by step request error for too many CPUs · 36df0bbf
  Morris Jette authored Mar 13, 2013
```
If step requests more CPUs than possible in specified node count of job
allocation then return ESLURM_TOO_MANY_REQUESTED_CPUS rather than
ESLURM_NODES_BUSY and retrying.
```
  36df0bbf
- Add comments to describe function arguments · 4a37e469
  Morris Jette authored Mar 13, 2013
  
  4a37e469
- Minor documentation update for select plugin · a71f95b2
  Danny Auble authored Mar 13, 2013
  
  a71f95b2