Commits · 49770e20b6c18e4aedd3fe2567505bbcc8247451 · Manuel G. Marciani / ces_slurm_simulator

03 Mar, 2015 1 commit

Abort I/O for debugged app launch fail · 49770e20

Morris Jette authored Mar 02, 2015

For job running under a debugger, if the exec of the task fails, then
cancel its I/O and abort immediately rather than waiting 60 seconds for
I/O timeout.

49770e20

02 Mar, 2015 4 commits
- Change the level of debug messages. · 971d0021
  David Bigagli authored Mar 02, 2015
  
  971d0021
- Correct the initialization of QOS MinCPUs per job limit. · 862cc80b
  David Bigagli authored Mar 02, 2015
  
  862cc80b
- minor doc update · 94149da1
  Danny Auble authored Mar 02, 2015
  
  94149da1
- update meetings · c70d5091
  Danny Auble authored Mar 02, 2015
  
  c70d5091
27 Feb, 2015 5 commits
- Update sched plugin web description · eee7bf80
  Nicolas Joly authored Feb 27, 2015
```
Add missing arguments to slurm_sched_p_newalloc/slurm_sched_p_freealloc
documentation.
```
  eee7bf80
- Use consistent case style for job accounting fields description. · a5179e9a
  Nicolas Joly authored Feb 27, 2015
  
  a5179e9a
- Small typo in sacct man page. · 18a82a34
  Nicolas Joly authored Feb 27, 2015
  
  18a82a34
- Cosmetic mods, no change in logic · 60841159
  Morris Jette authored Feb 27, 2015
  
  60841159
- Fix job getting EligibleTime set before meeting dependency requirements. · ab773f65
  Brian Christiansen authored Feb 27, 2015
```
Bug 1476
```
  ab773f65
26 Feb, 2015 2 commits
- Account all CPUs to the batch steps. · cc8c2e3e
  David Bigagli authored Feb 26, 2015
  
  cc8c2e3e
- task/affinity clean up · 7b313990
  Morris Jette authored Feb 25, 2015
```
Improved logging and some code restructuring. No change in logic.
```
  7b313990
25 Feb, 2015 4 commits
- Revert "Remove unused variable." · 663ec8f2
  David Bigagli authored Feb 25, 2015
```
This reverts commit e24a418b.
```
  663ec8f2
- Remove unused variable. · e24a418b
  David Bigagli authored Feb 25, 2015
  
  e24a418b
- Add job_submit build instructions · ee90e55a
  Morris Jette authored Feb 25, 2015
  
  ee90e55a
- select/alps - Reverse .my.cnf search order · 96363d42
  Morris Jette authored Feb 25, 2015
```
This is a variation on commit 5391b8cc
Check $HOME/.my.cnf last rather than first to follow more standard search order
```
  96363d42
24 Feb, 2015 5 commits
- Fix sprio showing wrong priority for job arrays until priority is recalculated. · 423029d8
  Brian Christiansen authored Feb 24, 2015
```
Bug 1469
```
  423029d8
- cray/basil, read mysql creds from /root/.my.conf · 5391b8cc
  Nina Suvanphim authored Feb 24, 2015
```
The /root/.my.cnf would typically contain the login credentials for
root.  If those are needed for Slurm, then it should be checking
that directory.

(In reply to Nina Suvanphim from comment #0)
...
> const char *default_conf_paths[] = {
> "/root/.my.cnf", <<<<<<<<<<<<<<<<<------- add this line
> "/etc/my.cnf", "/etc/opt/cray/MySQL/my.cnf",
> "/etc/mysql/my.cnf", NULL };

I'll also note that typically the $HOME/.my.cnf file would be
checked last rather than first.
```
  5391b8cc
- Add some SUG photo links · 3d67a89a
  Morris Jette authored Feb 24, 2015
  
  3d67a89a
- Fix code for apple computers SOL_TCP is not defined · ac0343be
  Danny Auble authored Feb 24, 2015
  
  ac0343be
- Fix wrong variables used in the wrapper functions needed for systems that · 8d0c9901
  Danny Auble authored Feb 24, 2015
```
don't support strong_alias
```
  8d0c9901
20 Feb, 2015 1 commit

Fix to GRES NoConsume logic · 33c48ac5

Dorian Krause authored Feb 20, 2015

we came across the following error message in the slurmctld logs when
using non-consumable resources:

error: gres/potion: job 39 dealloc of node node1 bad node_offset 0 count
is 0

The error comes from _job_dealloc():

node_gres_data=0x7f8a18000b70, node_offset=0, gres_name=0x1999e00
"potion", job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:3980
(job_gres_list=0x199b7c0, node_gres_list=0x199bc38, node_offset=0,
job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:4190
job_ptr=0x19e9d50, pre_err=0x7f8a31353cb0 "_will_run_test", remove_all=true)
    at select_linear.c:2091
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, max_share=1, req_nodes=1,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40) at
select_linear.c:3176
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at select_linear.c:3390
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at node_select.c:588
avail_bitmap=0x7f8a2f910d38, min_nodes=1, max_nodes=1, req_nodes=1,
exc_core_bitmap=0x0)
    at backfill.c:367

The cause of this problem is that _node_state_dup() in gres.c does not
duplicate the no_consume flag.
The cr_ptr passed to _rm_job_from_nodes() is created with _dup_cr()
which calls _node_state_dup().

Below is a simple patch to fix the problem. A "future-proof" alternative
might be to memcpy() from gres_ptr to new_gres and
only handle pointers separately.

33c48ac5

19 Feb, 2015 3 commits
- Load lua-5.2 library if using lua5.2 for lua job submit plugin. · 408c108e
  Brian Christiansen authored Feb 19, 2015
```
Bug 1471
```
  408c108e
- Remove vestigial/wrong documentation · 2e7bed24
  Morris Jette authored Feb 19, 2015
```
"If  you  specify a maximum node count and the host list contains more
nodes, the extra node names will be silently ignored."
Not so.
```
  2e7bed24
- MySQL - Fix potential issue when PrivateData=Usage and a normal user · 9a03f2a5
  Danny Auble authored Feb 18, 2015
```
runs certain sreport reports.
```
  9a03f2a5
18 Feb, 2015 2 commits
- Add SLURM_JOB_GPUS to Prolog · 2e95c20b
  Morris Jette authored Feb 17, 2015
```
Add SLURM_JOB_GPUS environment variable to those available in Prolog.
Also add list of environment variables available in the various
prologs and epilogs on the web page.
bug 1458
```
  2e95c20b
- Print FAIR_TREE in "scontrol show config" output for PriorityFlags. · 27eef95d
  Brian Christiansen authored Feb 17, 2015
  
  27eef95d
17 Feb, 2015 4 commits
- BGQ - Close very small window where a step could of been removed before the · c169c935
  Danny Auble authored Feb 17, 2015
```
runjob happened, and the step was part of an array.  This is an addition to
commit 49e0f5f2
```
  c169c935
- BGQ - Fix issue with job arrays not being handled correctly · 49e0f5f2
  Danny Auble authored Feb 17, 2015
```
in the runjob_mux plugin.
```
  49e0f5f2
- Update NEWS · 6984348d
  Brian Christiansen authored Feb 17, 2015
```
Bug 1461
Commit: 2e2d924e
```
  6984348d
- Prevent slurmdbd abort if node DOWN with NULL reason · 2e2d924e
  Morris Jette authored Feb 17, 2015
```
See bug 1461
```
  2e2d924e
13 Feb, 2015 2 commits

Fix squeue. · c13e8540
David Bigagli authored Feb 13, 2015

c13e8540

Avoid triggering accounting if node state unchanged · 23f84ace

Morris Jette authored Feb 12, 2015

If call was made to change a node's state to the same state it
was already in and set its reason to the same value it already
had, then an accounting record was generated. If a script, say
NodeHealthCheck is repeatedly setting a node state (say DRAIN),
it could generate a huge number of redundant accounting records.
This eliminates these redundant records.
related to bug 1437

23f84ace

12 Feb, 2015 4 commits
- Start v14.11.5 NEWS file · 4531ab3f
  Morris Jette authored Feb 12, 2015
  
  4531ab3f
- Update META for v14.11.4 tag · 1b2c8e18
  Morris Jette authored Feb 12, 2015
  
  1b2c8e18
- Fix perlapi tests for libslurm perl module. · ea7a0c7c
  Brian Christiansen authored Feb 12, 2015
  
  ea7a0c7c
- Fix issue with "sreport cluster AccountUtilizationByUser" when using PrivateData=users. · 37b56085
  Brian Christiansen authored Feb 12, 2015
```
Bug 1446
```
  37b56085
11 Feb, 2015 1 commit
- MySQL - If a node state and reason are the same on a node state change · 1685ba56
  Danny Auble authored Feb 11, 2015
```
don't insert a new row in the event table.
```
  1685ba56
10 Feb, 2015 2 commits

Additional fix to 50e0c84f. · 50b43afd
Brian Christiansen authored Feb 09, 2015
```
uid's are 0 when associations are loaded.
```
50b43afd

Backfill scheduler bug on job's partition change · a0d12d0c

Morris Jette authored Feb 09, 2015

The backfill scheduler build a queue of eligible job/partition
information and then proceeds to determine when and where those
jobs will start. The backfill scheduler can be configured to
periodically release locks in order to let other operations
take place. If the partition(s) associated with one of those
jobs changes during one of those periods, the job will still
be considered for scheduling in the old partition until the
backfill scheduler starts over with a new job/partition list.
This change to the backfill scheduler validates each job's
partition in from the list based upon current information
(considering any partition changes).
See bug 1436

a0d12d0c