Commits · 630afd8f599106f1ca22d0195724b1843a84c704 · Manuel G. Marciani / ces_slurm_simulator

22 Sep, 2016 6 commits
- Fix variable used in sizeof function · 630afd8f
  Morris Jette authored Sep 22, 2016
  
  630afd8f
- [PATCH 2/2] drop call to PMI2_Job_GetId from testpmixring.c · cb572e38
  Adam Moody authored Sep 22, 2016
  
  cb572e38
- [PATCH 1/2] define PMIX_Ring behavior in singleton mode · a4155bbd
  Adam Moody authored Sep 22, 2016
  
  a4155bbd
- Fix typos in docs and "sacct --help". · fbac3e4b
  Gennaro Oliva authored Sep 22, 2016
  
  fbac3e4b
- Updates to KNL web page · a36b4035
  Morris Jette authored Sep 21, 2016
  
  a36b4035
- Fix squeue filter by job license when a job has requested more than 1 · cbd1ffad
  Alejandro Sanchez authored Sep 21, 2016
```
license of a certain type.
```
  cbd1ffad
21 Sep, 2016 5 commits

Docs - fix two typos. · f835d8b6
Tim Wickberg authored Sep 21, 2016

f835d8b6

Increase default CapmcTimeout from 10 to 60 seconds · 1fe5c7cc

Morris Jette authored Sep 21, 2016

node_features/knl_cray plugin: Increase default CapmcTimeout parameter from
    10 to 60 seconds.
bug 3100

1fe5c7cc

capmc_suspend/resume fail rather than operate on individual nodes · f07482fd

Morris Jette authored Sep 20, 2016

capmc_suspend/resume - If a request modify NUMA or MCDRAM state on a set of
    nodes or reboot a set of nodes fails then just requeue the job and abort the
    entire operation rather than trying to operate on individual nodes.
bug 3100

f07482fd

Allow clearing of node PowerUp state flag · 61b14031
Morris Jette authored Sep 20, 2016
```
Allow a node's PowerUp state flag to be cleared using update_node RPC.
bug 3100
```
61b14031

Pass SLURM_JOB_ID to ResumeProgram · 45ca1dd4

Morris Jette authored Sep 20, 2016

When powering up a node to change it's state (e.g. KNL NUMA or MCDRAM mode)
    then pass to the ResumeProgram the job ID assigned to the nodes in the
    SLURM_JOB_ID environment variable.
bug 3100

45ca1dd4

20 Sep, 2016 1 commit

Remove error message on valid state · 85c136bc

Morris Jette authored Sep 19, 2016

Don't log error for job end_time being zero if node health check is still
    running.
bug 3053

85c136bc

19 Sep, 2016 1 commit
- Add FAQ describing how to colorize squeue output · 31c87fce
  Damien François authored Sep 19, 2016
  
  31c87fce
17 Sep, 2016 1 commit

Restore ability to manually power down nodes · da722a89

Morris Jette authored Sep 16, 2016

Restore ability to manually power down nodes, broken in 15.08.12
in commit b4904661

The patch introduced in commit b4904661 (not powering down dead node) has a bad side effect. Adding the "(node_ptr->last_idle != 0)" condition prevents from powering down nodes with the following command:

scontrol update nodename=nX state=power_down

because the state update function relies on zeroing the "last_idle" variable when a power_down is requested (see src/slurmctld/node_mgr.c, line 1589).

Reverting this commit should solve the problem...but I let you decide...

Didier GAZEN

da722a89

16 Sep, 2016 2 commits
- Update HDF5 macro used for configure to new version · fded45ed
  Gennaro Oliva authored Sep 16, 2016
```
bug 3093
```
  fded45ed
- Update KNL modes for out-of-band reboot · 3a465f80
  Morris Jette authored Sep 15, 2016
```
node_features/knl_cray: If a node is rebooted outside of Slurm's direction,
    update it's active features with current MCDRAM and NUMA mode information.
bug 3071
```
  3a465f80
15 Sep, 2016 3 commits
- node_features/knl_cray: Fix MCDRAM state source · bbaa48b2
  Morris Jette authored Sep 15, 2016
```
Fix race condition that could result in MCDRAM state information coming
  from capmc rather than cnselect (used state for next boot rather than
  latest boot).
bug 3080
```
  bbaa48b2
- Fix slurm.conf MAX_TRES flag entry. · 28dc29f5
  Tim Wickberg authored Sep 15, 2016
  
  28dc29f5
- Docs - assorted spelling corrections. · 09d20dc3
  Nicolas Joly authored Sep 15, 2016
  
  09d20dc3
14 Sep, 2016 3 commits
- Silence srun warning message when intentionally lowering task-per-node count for a step. · daacf5af
  Alejandro Sanchez authored Sep 14, 2016
```
No functional change, just silencing the warning message in this instance.

Bug 3079.
```
  daacf5af
- Docs - use 'megabytes' instead of 'MegaBytes'. · 9f2d00c5
  Tim Wickberg authored Sep 14, 2016
```
Fix some whitespace on line endings while here.
```
  9f2d00c5
- Docs - mention SLURM_MEM_PER_CPU/NODE environment variables and allowed suffixes. · f62d3e70
  Alejandro Sanchez authored Sep 14, 2016
```
Bug 3073.
```
  f62d3e70
12 Sep, 2016 7 commits
- Merge branch 'slurm-16.05' of https://github.com/SchedMD/slurm into slurm-16.05 · e145d604
  Morris Jette authored Sep 12, 2016
  
  e145d604
- Updates to cpu and mem binding for man page · 1feb808e
  Morris Jette authored Sep 12, 2016
  
  1feb808e
- Update --mem_bind option description · e2f63dc3
  Morris Jette authored Sep 12, 2016
```
bug 3065
```
  e2f63dc3
- SLUG Agenda - shuffle around abstracts to match new order. · 9f12f6f4
  Tim Wickberg authored Sep 12, 2016
  
  9f12f6f4
- Swap Slurm Overview and Slurm 16.05 Overview in agenda. · d16c62ed
  Tim Wickberg authored Sep 12, 2016
  
  d16c62ed
- Update SLUG16 agenda. · 02edfef8
  Tim Wickberg authored Sep 12, 2016
```
Add Slurm overview by Alex on first day, move KNL to second morning,
shift roadmap to right after lunch.
```
  02edfef8
- Expand description of NUMA memory binding and KNL MCDRAM · bd64287e
  Morris Jette authored Sep 12, 2016
```
bug 3065
```
  bd64287e
09 Sep, 2016 7 commits

Cap job termination message staggering at 5 seconds · 3e2251cb

Morris Jette authored Sep 09, 2016

Previous cap was 2 sec (default TCP timeout) times the node count
  and divided by 1000. A 9000 node job would have the messages
  spread out over 18 seconds. This change caps the spread at
  5 seconds and assumes the normal TCP logic can handle the rest
bug 3044

3e2251cb

Don't generate srun hostlist for log for huge task count · 175e9dea

Morris Jette authored Sep 09, 2016

If the overhead of determining the hostlist for a given task list
is too high, then report a hostlist of "Unknown" instead. If the
overhead is too high, then srun will become unresponsive and communications
will timeout/fail.
bug 3044

175e9dea

Cosmetic change, no change in logic · 0d8d1b34
Morris Jette authored Sep 09, 2016

0d8d1b34

Modify srun task completion handling · 166e3bec

Morris Jette authored Sep 09, 2016

Modify srun task completion handling to only build the task/node string for
    logging purposes if it is needed. Modified for performance purposes.
bug 3044

166e3bec

Add function to report current logging level · 469c63fc
Morris Jette authored Sep 09, 2016
```
Add get_log_level() function to return the highest LOG_LEVEL_* used
for any logging mechanism.
```
469c63fc
Revert "Fix issue filtering licenses for output with squeue." · 9c4eabed
Tim Wickberg authored Sep 09, 2016
```
This reverts commit 1ec2a4ae.
```
9c4eabed
Fix issue filtering licenses for output with squeue. · 1ec2a4ae
Alejandro Sanchez authored Sep 09, 2016
```
Bug 3063.
```
1ec2a4ae

08 Sep, 2016 2 commits

Restructure srun task_exit logic · 6b6d4e1a

Morris Jette authored Sep 08, 2016

Restructure srun command locking for task_exit processing logic for improved
  parallelism. This change decreases the amount of time consumed by serial
  logic by 2 orders of magnitude.
bug 3044

6b6d4e1a

Correct comment · e3d6dc68
Morris Jette authored Sep 08, 2016

e3d6dc68

07 Sep, 2016 2 commits

Preserve node "RESERVATION" state · 5eee1d28

Morris Jette authored Sep 07, 2016

Preserve node "RESERVATION" state when one of multiple overlapping
    reservations ends. Previous logic would clear the node's
    RESERVATION state flag when any one of the reservations on the
    node ended rather than keeping the node in RESERVATION state
    until the last reservation ended.
bug 3057

5eee1d28

Decrease frequency of reservation purge logic · 062e1ca6
Morris Jette authored Sep 07, 2016
```
The logic is now heavier weight, so increase interval between tests
  from 2 to 5 seconds
```
062e1ca6