Commits · 42dc54eac21c1f591c2d18da65e298199a11f752 · Manuel G. Marciani / ces_slurm_simulator

27 Feb, 2015 2 commits

Insure prolog runs on job rqueue · 42dc54ea

Morris Jette authored Feb 27, 2015

Set the delay time for job requeue to the job credential lifetime (1200
second by default). This insures that prolog runs on every node when a
job is requeued. (This change will slow down launch of re-queued jobs).
Without this change, if a job is restated within 1200 seconds, the nodes
previously used would not run the prolog again, since the job ID is
still seen as active (from the previous execution). It is also advisable
to set the value of DEFAULT_EXPIRATION_WINDOW in src/common/slurm_cred.c
to the lowest value reasonable. We need to add a new configuration parameter
so this is easly changed in the future.

42dc54ea

Display job's estimated NodeCount based off of partition's configured... · ce32018a
Brian Christiansen authored Feb 27, 2015
```
Display job's estimated NodeCount based off of partition's configured resources rather than the whole system's.

Bug 1478
```
ce32018a

26 Feb, 2015 2 commits
- Account all CPUs to the batch steps. · cc8c2e3e
  David Bigagli authored Feb 26, 2015
  
  cc8c2e3e
- task/affinity - fix memory binding for cpusets · 701e5b33
  Morris Jette authored Feb 25, 2015
```
Previously, there was no binding of tasks to the appropriate NUMA.
Based upon work by Josko Plazonic <plazonic@princeton.edu>.
```
  701e5b33
25 Feb, 2015 1 commit

Apply email notifications to entire job arrays · 9fa4909d

Morris Jette authored Feb 25, 2015

Mail notifications on job BEGIN, END and FAIL now apply to a job array as a
whole rather than generating individual email messages for each task in the
job array.

9fa4909d

24 Feb, 2015 4 commits

Fix sprio showing wrong priority for job arrays until priority is recalculated. · 423029d8
Brian Christiansen authored Feb 24, 2015
```
Bug 1469
```
423029d8

cray/basil, read mysql creds from /root/.my.conf · 5391b8cc

Nina Suvanphim authored Feb 24, 2015

The /root/.my.cnf would typically contain the login credentials for
root.  If those are needed for Slurm, then it should be checking
that directory.

(In reply to Nina Suvanphim from comment #0)
...
> const char *default_conf_paths[] = {
> "/root/.my.cnf", <<<<<<<<<<<<<<<<<------- add this line
> "/etc/my.cnf", "/etc/opt/cray/MySQL/my.cnf",
> "/etc/mysql/my.cnf", NULL };

I'll also note that typically the $HOME/.my.cnf file would be
checked last rather than first.

5391b8cc

Fix code for apple computers SOL_TCP is not defined · ac0343be
Danny Auble authored Feb 24, 2015

ac0343be
Fix wrong variables used in the wrapper functions needed for systems that · 8d0c9901
Danny Auble authored Feb 24, 2015
```
don't support strong_alias
```
8d0c9901

20 Feb, 2015 2 commits

scontrol: Require Reason when setting node DOWN · e7c61bdd
Morris Jette authored Feb 20, 2015

e7c61bdd

Fix to GRES NoConsume logic · 33c48ac5

Dorian Krause authored Feb 20, 2015

we came across the following error message in the slurmctld logs when
using non-consumable resources:

error: gres/potion: job 39 dealloc of node node1 bad node_offset 0 count
is 0

The error comes from _job_dealloc():

node_gres_data=0x7f8a18000b70, node_offset=0, gres_name=0x1999e00
"potion", job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:3980
(job_gres_list=0x199b7c0, node_gres_list=0x199bc38, node_offset=0,
job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:4190
job_ptr=0x19e9d50, pre_err=0x7f8a31353cb0 "_will_run_test", remove_all=true)
    at select_linear.c:2091
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, max_share=1, req_nodes=1,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40) at
select_linear.c:3176
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at select_linear.c:3390
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at node_select.c:588
avail_bitmap=0x7f8a2f910d38, min_nodes=1, max_nodes=1, req_nodes=1,
exc_core_bitmap=0x0)
    at backfill.c:367

The cause of this problem is that _node_state_dup() in gres.c does not
duplicate the no_consume flag.
The cr_ptr passed to _rm_job_from_nodes() is created with _dup_cr()
which calls _node_state_dup().

Below is a simple patch to fix the problem. A "future-proof" alternative
might be to memcpy() from gres_ptr to new_gres and
only handle pointers separately.

33c48ac5

19 Feb, 2015 2 commits
- Load lua-5.2 library if using lua5.2 for lua job submit plugin. · 408c108e
  Brian Christiansen authored Feb 19, 2015
```
Bug 1471
```
  408c108e
- MySQL - Fix potential issue when PrivateData=Usage and a normal user · 9a03f2a5
  Danny Auble authored Feb 18, 2015
```
runs certain sreport reports.
```
  9a03f2a5
18 Feb, 2015 5 commits

Added "--mail-type=stage_out" option · b3c8ed49

Morris Jette authored Feb 18, 2015

Added "--mail=stage_out" option to job submission commands to notify user
when burst buffer state out is complete.

b3c8ed49

Add SLURM_JOB_CONSTAINTS to Prolog env vars · 06db2ded
Morris Jette authored Feb 18, 2015
```
Add SLURM_JOB_CONSTAINTS to environment variables available to the Prolog.
bug 1458
```
06db2ded

Add GPU info to prolog run on job allocation · 6966f77e

Morris Jette authored Feb 18, 2015

Add job credential to "Run Prolog" RPC used with a configuration of
PrologFlags=alloc. This allows the Prolog to be passed identification of
GPUs allocated to the job.

6966f77e

Add SLURM_JOB_GPUS to Prolog · 2e95c20b

Morris Jette authored Feb 17, 2015

Add SLURM_JOB_GPUS environment variable to those available in Prolog.
Also add list of environment variables available in the various
prologs and epilogs on the web page.
bug 1458

2e95c20b

Print FAIR_TREE in "scontrol show config" output for PriorityFlags. · 27eef95d
Brian Christiansen authored Feb 17, 2015

27eef95d

17 Feb, 2015 2 commits
- BGQ - Fix issue with job arrays not being handled correctly · 49e0f5f2
  Danny Auble authored Feb 17, 2015
```
in the runjob_mux plugin.
```
  49e0f5f2
- Update NEWS · 6984348d
  Brian Christiansen authored Feb 17, 2015
```
Bug 1461
Commit: 2e2d924e
```
  6984348d
14 Feb, 2015 1 commit
- CRAY - addition of acct_gather_energy/cray plugin · 8b5690dd
  Danny Auble authored Feb 13, 2015
  
  8b5690dd
13 Feb, 2015 1 commit
- Fix squeue. · c13e8540
  David Bigagli authored Feb 13, 2015
  
  c13e8540
12 Feb, 2015 4 commits
- Start v15.08.0-pre3 NEWS · 2273a545
  Morris Jette authored Feb 12, 2015
  
  2273a545
- Start v14.11.5 NEWS file · 4531ab3f
  Morris Jette authored Feb 12, 2015
  
  4531ab3f
- Fix perlapi tests for libslurm perl module. · ea7a0c7c
  Brian Christiansen authored Feb 12, 2015
  
  ea7a0c7c
- Fix issue with "sreport cluster AccountUtilizationByUser" when using PrivateData=users. · 37b56085
  Brian Christiansen authored Feb 12, 2015
```
Bug 1446
```
  37b56085
11 Feb, 2015 2 commits
- MySQL - If a node state and reason are the same on a node state change · 1685ba56
  Danny Auble authored Feb 11, 2015
```
don't insert a new row in the event table.
```
  1685ba56
- Add QOS name to the output of a partition in squeue/scontrol/sview/smap. · 20c68297
  Nathan Yee authored Feb 11, 2015
```
Bug 1321
```
  20c68297
10 Feb, 2015 2 commits
- Additional fix to 50e0c84f. · 50b43afd
  Brian Christiansen authored Feb 09, 2015
```
uid's are 0 when associations are loaded.
```
  50b43afd
- Fix segfault in controller when deleting a user association of a user which... · 50e0c84f
  Brian Christiansen authored Feb 09, 2015
```
 Fix segfault in controller when deleting a user association of a user which had been previously removed from the system.

 Bug 1238
```
  50e0c84f
09 Feb, 2015 4 commits

Fix job array task requeue race condition · ae0ba3d8

Morris Jette authored Feb 09, 2015

Fix slurmctld initialization problem which could cause requeue of the last
task in a job array to fail if executed prior to the slurmctld loading
the maximum size of a job array into a variable in the job_mgr.c module.

ae0ba3d8

Fix bug that could lose task of job array · 0efa0ba4

Morris Jette authored Feb 09, 2015

Fix slurmctld job recovery logic which could cause the last task in a job
array to be lost on restart.

0efa0ba4

Fix build for non-standard hwloc location · bd303aff
Nicolas Joly authored Feb 09, 2015

bd303aff

Reserve highest job ID bit for intercluster jobs · 4ba96b86

Morris Jette authored Feb 09, 2015

In order to support inter-cluster job dependencies, the MaxJobID
configuration parameter default value has been reduced from 4,294,901,760
to 2,147,418,112 and it's maximum value is now 2,147,463,647.
ANY JOBS WITH A JOB ID ABOVE 2,147,463,647 WILL BE PURGED WHEN SLURM IS
UPGRADED FROM AN OLDER VERSION!

4ba96b86

06 Feb, 2015 2 commits

Add power management and inter-cluster options · 5cfc577a

Morris Jette authored Feb 06, 2015

 -- Add job submission command options: --sicp (available for inter-cluster
    dependencies) and --power (specify power management options) to salloc,
    sbatch, and srun commands.
 -- Add DebugFlags option of SICP (inter-cluster option logging).
 -- Added DebugFlags value "SICP".
 -- Added job_descriptor field of "cluster".

5cfc577a

Fix patch 6b036010 and update NEWS. · 7ac3761d
David Bigagli authored Feb 06, 2015

7ac3761d

05 Feb, 2015 1 commit
- If a job is requeued because of RequeueExit or RequeueExitHold sent · 3e5d8f8e
  David Bigagli authored Feb 05, 2015
```
event  REQUEUED to slurmdbd.
```
  3e5d8f8e
04 Feb, 2015 3 commits
- Report correct job "shared" field value · 3de14946
  Morris Jette authored Feb 04, 2015
```
Previously it was not possible to distinguish between a job needing
exclusive nodes and the default job/partition configuration.
```
  3de14946
- Allow user specified delimiter in sacct. · 93e02214
  David Bigagli authored Feb 04, 2015
  
  93e02214
- job array slurmctld abort fix · 0ff342b5
  Morris Jette authored Feb 04, 2015
```
Fix job array logic that can cause slurmctld to abort.
bug 1426
```
  0ff342b5