Commits · 42dc54eac21c1f591c2d18da65e298199a11f752 · Manuel G. Marciani / ces_slurm_simulator

27 Feb, 2015 6 commits

Insure prolog runs on job rqueue · 42dc54ea

Morris Jette authored Feb 27, 2015

Set the delay time for job requeue to the job credential lifetime (1200
second by default). This insures that prolog runs on every node when a
job is requeued. (This change will slow down launch of re-queued jobs).
Without this change, if a job is restated within 1200 seconds, the nodes
previously used would not run the prolog again, since the job ID is
still seen as active (from the previous execution). It is also advisable
to set the value of DEFAULT_EXPIRATION_WINDOW in src/common/slurm_cred.c
to the lowest value reasonable. We need to add a new configuration parameter
so this is easly changed in the future.

42dc54ea

Merge branch 'slurm-14.11' · cd663c21
Morris Jette authored Feb 27, 2015
```
Conflicts:
	src/slurmctld/job_mgr.c
```
cd663c21
Display job's estimated NodeCount based off of partition's configured... · ce32018a
Brian Christiansen authored Feb 27, 2015
```
Display job's estimated NodeCount based off of partition's configured resources rather than the whole system's.

Bug 1478
```
ce32018a
Cosmetic mods, no change in logic · b99fee15
Morris Jette authored Feb 27, 2015

b99fee15
power/cray - Log cluster-wide power totals · 9e567ecc
Morris Jette authored Feb 27, 2015
```
This provides a better global view of what the limits and caps are.
```
9e567ecc

power/cray developments · 3ff19460

Morris Jette authored Feb 26, 2015

Remove time from "capmc get_node_energy_counter" call. If no recent
  data is available, no data is being returned, so just get latest
  information.
Initialize a variable to avoid xfree of uninitialized variable.
Correct joule to watt calculation (">" changed to "<")
Read configuration once when slurmctld starts rather than twice
Compute a node's power consumption with more precision based upon
  time to the microsecond

3ff19460

26 Feb, 2015 9 commits
- Fixes to JSON configure path logic · eb9f22ca
  Morris Jette authored Feb 26, 2015
  
  eb9f22ca
- Update Cray build/install instructions · ea51d8b8
  Morris Jette authored Feb 26, 2015
```
Add links to burst buffer and power management pages.
Add JSON-C build/installation instructions.
```
  ea51d8b8
- Account all CPUs to the batch steps. · cc8c2e3e
  David Bigagli authored Feb 26, 2015
  
  cc8c2e3e
- Add link to LLNL SRC (Scalable Checkpoint Restart) · 725a7cb0
  Morris Jette authored Feb 26, 2015
  
  725a7cb0
- task/affinity - fix memory binding for cpusets · 701e5b33
  Morris Jette authored Feb 25, 2015
```
Previously, there was no binding of tasks to the appropriate NUMA.
Based upon work by Josko Plazonic <plazonic@princeton.edu>.
```
  701e5b33
- task/affinity clean up · 3eff4e87
  Morris Jette authored Feb 25, 2015
```
Improved logging and some code restructuring. No change in logic.
```
  3eff4e87
- Revert "Remove unused variable." · bdd3a651
  David Bigagli authored Feb 25, 2015
```
This reverts commit e24a418b.
```
  bdd3a651
- Remove unused variable. · edd5ad26
  David Bigagli authored Feb 25, 2015
  
  edd5ad26
- task/affinity clean up · 7b313990
  Morris Jette authored Feb 25, 2015
```
Improved logging and some code restructuring. No change in logic.
```
  7b313990
25 Feb, 2015 7 commits
- Apply email notifications to entire job arrays · 9fa4909d
  Morris Jette authored Feb 25, 2015
```
Mail notifications on job BEGIN, END and FAIL now apply to a job array as a
whole rather than generating individual email messages for each task in the
job array.
```
  9fa4909d
- Revert "Remove unused variable." · 663ec8f2
  David Bigagli authored Feb 25, 2015
```
This reverts commit e24a418b.
```
  663ec8f2
- Remove unused variable. · e24a418b
  David Bigagli authored Feb 25, 2015
  
  e24a418b
- Merge branch 'slurm-14.11' · 7e046346
  Morris Jette authored Feb 25, 2015
  
  7e046346
- Add job_submit build instructions · ee90e55a
  Morris Jette authored Feb 25, 2015
  
  ee90e55a
- select/alps - Reverse .my.cnf search order · 96363d42
  Morris Jette authored Feb 25, 2015
```
This is a variation on commit 5391b8cc
Check $HOME/.my.cnf last rather than first to follow more standard search order
```
  96363d42
- topology/hypercube: Cosmetic changes for clean build, no change in logic · b626d838
  Morris Jette authored Feb 24, 2015
  
  b626d838
24 Feb, 2015 10 commits

Fix sprio showing wrong priority for job arrays until priority is recalculated. · 423029d8
Brian Christiansen authored Feb 24, 2015
```
Bug 1469
```
423029d8
Add missing topology/hypercube plugin for SGI · fa6de30d
Michael A. Raymond authored Feb 24, 2015

fa6de30d
Merge branch 'slurm-14.11' · 128148c1
Morris Jette authored Feb 24, 2015

128148c1

cray/basil, read mysql creds from /root/.my.conf · 5391b8cc

Nina Suvanphim authored Feb 24, 2015

The /root/.my.cnf would typically contain the login credentials for
root.  If those are needed for Slurm, then it should be checking
that directory.

(In reply to Nina Suvanphim from comment #0)
...
> const char *default_conf_paths[] = {
> "/root/.my.cnf", <<<<<<<<<<<<<<<<<------- add this line
> "/etc/my.cnf", "/etc/opt/cray/MySQL/my.cnf",
> "/etc/mysql/my.cnf", NULL };

I'll also note that typically the $HOME/.my.cnf file would be
checked last rather than first.

5391b8cc

power/cray development · d32dac43
Morris Jette authored Feb 24, 2015
```
Fix some logic related to power distribution across nodes
```
d32dac43
Merge branch 'slurm-14.11' · 5a133bcb
Morris Jette authored Feb 24, 2015

5a133bcb
Add some SUG photo links · 3d67a89a
Morris Jette authored Feb 24, 2015

3d67a89a
Fix code for apple computers SOL_TCP is not defined · ac0343be
Danny Auble authored Feb 24, 2015

ac0343be
Fix wrong variables used in the wrapper functions needed for systems that · 8d0c9901
Danny Auble authored Feb 24, 2015
```
don't support strong_alias
```
8d0c9901

power/cray development · acdec1f5

Morris Jette authored Feb 23, 2015

Update power management web page: Add notes about powering nodes down/up
Prevent underflow in power distribution logic
Add logic to identify nodes in "ready" state. Only ready nodes can have
  their power caps modified
Don't change power cap if node not in ready state
Various improvements to logging
Refactor code to eliminate duplicate/repeated building of full NID list
Plug some memory leaks

acdec1f5

23 Feb, 2015 1 commit

Fix test for scontrol change · 9cb22140

Morris Jette authored Feb 23, 2015

Modify test 12.7 so that we specify a reason when setting a node DOWN
A recent change to the Slurm code now requires a reason

9cb22140

21 Feb, 2015 1 commit
- power/cray: Read initial caps from capmc · 58da1582
  Morris Jette authored Feb 20, 2015
  
  58da1582
20 Feb, 2015 5 commits

scontrol: Require Reason when setting node DOWN · e7c61bdd
Morris Jette authored Feb 20, 2015

e7c61bdd

power/cray work · 82de9635

Morris Jette authored Feb 20, 2015

Correct capmc arguments to set power cap.
Convert "capmc get_node_energy_counter" to use hostlist expressin rather
   than listing every node in a comma separated list.
Log commands and args run by the plugin via the power_run_script()
   function in src/plugins/power/common/power_common.c.
Use hostlist to build condenced nid list for power cap set/clear functions.

82de9635

Merge branch 'slurm-14.11' · b8fbbf2b
Morris Jette authored Feb 20, 2015

b8fbbf2b

Fix to GRES NoConsume logic · 33c48ac5

Dorian Krause authored Feb 20, 2015

we came across the following error message in the slurmctld logs when
using non-consumable resources:

error: gres/potion: job 39 dealloc of node node1 bad node_offset 0 count
is 0

The error comes from _job_dealloc():

node_gres_data=0x7f8a18000b70, node_offset=0, gres_name=0x1999e00
"potion", job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:3980
(job_gres_list=0x199b7c0, node_gres_list=0x199bc38, node_offset=0,
job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:4190
job_ptr=0x19e9d50, pre_err=0x7f8a31353cb0 "_will_run_test", remove_all=true)
    at select_linear.c:2091
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, max_share=1, req_nodes=1,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40) at
select_linear.c:3176
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at select_linear.c:3390
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at node_select.c:588
avail_bitmap=0x7f8a2f910d38, min_nodes=1, max_nodes=1, req_nodes=1,
exc_core_bitmap=0x0)
    at backfill.c:367

The cause of this problem is that _node_state_dup() in gres.c does not
duplicate the no_consume flag.
The cr_ptr passed to _rm_job_from_nodes() is created with _dup_cr()
which calls _node_state_dup().

Below is a simple patch to fix the problem. A "future-proof" alternative
might be to memcpy() from gres_ptr to new_gres and
only handle pointers separately.

33c48ac5

power/cray - compute energy consumption via capmc · d500de54
Morris Jette authored Feb 19, 2015

d500de54

19 Feb, 2015 1 commit
- Load lua-5.2 library if using lua5.2 for lua job submit plugin. · 408c108e
  Brian Christiansen authored Feb 19, 2015
```
Bug 1471
```
  408c108e