Commits · 56be32a376bd0d62074d611ff6b5d9558d6cb7db · Manuel G. Marciani / ces_slurm_simulator

13 Aug, 2013 6 commits

Add contributor to team web page · 56be32a3
Morris Jette authored Aug 13, 2013

56be32a3

sched/wiki2 - Correct CPU load reported to Moab · 18ac1981

Michael Gutteridge authored Aug 13, 2013

I'm running Slurm 2.6.0 and MWM 7.2.4 in our test cluster at the moment. I happened to notice that node load reporting wasn't consistent- periodically you'd see a "sane" load reported in Moab, but most of the time the reported load was zero despite an accurate CPULoad value reported by "scontrol show node".

Finally got to digging into this. It appears that the only time load was being reported properly was in the Moab scheduling cycle directly after slurmctld did a node ping. In subsequent scheduling cycles the load (again, as reported by Moab) was back to zero.

The node ping is significant as that is the only time the node is updated- since the wiki2 interface only reports records that change, and the load record isn't changed, it isn't reported in the queries after the node ping.

Judging from this behavior, I'm guessing that Moab does not store the load value- every time it queries resources in Slurm it sets the node's load back to zero.

I've altered src/plugins/sched/wiki2/get_nodes.c slightly- basically moved the section that reports CPULOAD above the check for updated info (update_time > last_node_update).

So I don't know if this is the appropriate way to fix it. The wiki specification that Adaptive has published doesn't seem to indicate how this should function. Either MWM should assume the last value reported is still accurate or Slurm needs to report it for every wiki GETNODES command.

Anyway, the patch is attached, it seems to be working for me, and I've rolled it into our debian build directory. YMMV.

Michael

18ac1981

Merge branch 'slurm-2.5' into slurm-2.6 · 5f3d85ce
Morris Jette authored Aug 13, 2013

5f3d85ce

select/cons_res - Avoid extraneous "oversubscribe" error messages · 302d8b3f

jette authored Aug 13, 2013

This problem was reported by Harvard University and could be
reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id".
With other job types, the error message could be logged many times
for each job. This change logs the error once per job and only if
the job request does not include the -O/--overcommit option.

302d8b3f

Minor web page updates · 56076ef8
Morris Jette authored Aug 12, 2013

56076ef8
MYSQL - fix issue when rolling up usage and events happened when a cluster · 9c09a71b
Danny Auble authored Aug 12, 2013
```
was down (slurmctld not running) during that time period.
```
9c09a71b

09 Aug, 2013 2 commits
- PGSQL - Notes about Postgres functionality being removed in the next · f0c534b7
  Danny Auble authored Aug 09, 2013
```
version of Slurm.
```
  f0c534b7
- Minor documentation update · c48a00ce
  Danny Auble authored Aug 09, 2013
  
  c48a00ce
08 Aug, 2013 3 commits
- Alphabetize debug_flags2str · b7b16680
  Danny Auble authored Aug 08, 2013
  
  b7b16680
- Correct documentation about how 'Elapsed' time fields are printed with · 5a41b17e
  Danny Auble authored Aug 08, 2013
```
sacct.
```
  5a41b17e
- Make note about Ubuntu systems and the autocomplete script. · e1fadb20
  Danny Auble authored Aug 08, 2013
  
  e1fadb20
07 Aug, 2013 3 commits
- Remove documentation about "*cpu" portion of gres specification · c183c5be
  Morris Jette authored Aug 07, 2013
```
Remove documentation about "*cpu" portion of gres specification
in the man pages for salloc, sbatch, and srun. Support for this
specification was never implemented nor does the GRES data structure
include a field for it.
```
  c183c5be
- Correction to cons_res web page · 0e9f8388
  Morris Jette authored Aug 07, 2013
  
  0e9f8388
- sview - Add missing debug_flag options. · 4f33c49e
  Danny Auble authored Aug 06, 2013
  
  4f33c49e
06 Aug, 2013 7 commits
- Better debug · f136ef4d
  Danny Auble authored Aug 06, 2013
  
  f136ef4d
- run multifactor add_usage even when no delta happens. · 934e9fe9
  Danny Auble authored Aug 06, 2013
  
  934e9fe9
- Make it so if a QOS has a usage_factor of 0 CPURunMins is still handled · 1f8ca376
  Danny Auble authored Aug 06, 2013
```
as the job completes.
```
  1f8ca376
- Handle complete removal of CPURunMins time at the end of the job instead · a96b5ae7
  Danny Auble authored Aug 06, 2013
```
of at multifactor poll.
```
  a96b5ae7
- Rename decay_factor to real_decay, no code change · d3f64132
  Danny Auble authored Aug 06, 2013
  
  d3f64132
- Minor tweaks for poe tests · a6d45691
  Morris Jette authored Aug 06, 2013
```
Need higher memory limits due to pmdv12 size
pmdv12 fails to recognize immediate application exit, hangs with defunct process
```
  a6d45691
- Node issues of MPI launched tasks with accounting and task binding · be8047d1
  Morris Jette authored Aug 06, 2013
  
  be8047d1
05 Aug, 2013 4 commits
- Note about acct_gather.conf file and non loaded plugins · 6dc38232
  Danny Auble authored Aug 05, 2013
  
  6dc38232
- Update acct_gather.conf docs about options and plugins · cd097593
  Danny Auble authored Aug 05, 2013
  
  cd097593
- Update SUG13 info · 5cecdb88
  Morris Jette authored Aug 05, 2013
  
  5cecdb88
- Add Tianhe-2 job launch timing info · ad611499
  Morris Jette authored Aug 05, 2013
  
  ad611499
01 Aug, 2013 3 commits
- Minor fix to not have run together words in help · cde3c22f
  Rod Schultz authored Aug 01, 2013
  
  cde3c22f
- Fix long line in man page · 95161b34
  Danny Auble authored Aug 01, 2013
  
  95161b34
- If cannot collect energy values send message to the controller · 4e18e004
  David Bigagli authored Jul 31, 2013
```
to drain the node and log error slurmd log file.
```
  4e18e004
31 Jul, 2013 6 commits
- For better usability use info() instead of debug(). · 44a12ad7
  David Bigagli authored Jul 31, 2013
  
  44a12ad7
- Update news web page for new releases to come · 42d48574
  Morris Jette authored Jul 31, 2013
  
  42d48574
- Write only one header in the csv file. · 55d220ce
  David Bigagli authored Jul 31, 2013
  
  55d220ce
- Update the NEWS file. · a3620580
  David Bigagli authored Jul 30, 2013
  
  a3620580
- Print the header in the csv file only once, set the debug messages · 9a4be9e7
  David Bigagli authored Jul 30, 2013
```
at debug() level, make the argument check case insensitive,
avoid printing duplicate \n.
```
  9a4be9e7
- Corrected the AcctGatherProfileType documentation. · e9147886
  David Bigagli authored Jul 30, 2013
  
  e9147886
30 Jul, 2013 1 commit
- IPMI - Fix Math bug getting new wattage. · b950ad0f
  Thomas Cadeau authored Jul 30, 2013
  
  b950ad0f
29 Jul, 2013 1 commit
- Update Slurm contributor list · e1d0c3af
  Morris Jette authored Jul 29, 2013
  
  e1d0c3af
26 Jul, 2013 4 commits
- Updated the NEWS file. · 216d8141
  David Bigagli authored Jul 26, 2013
  
  216d8141
- Corrected the hdf5 profile user guide and acct_gather.conf docs. · 69ec80d4
  David Bigagli authored Jul 26, 2013
  
  69ec80d4
- Note nature of past two commits · b036f491
  Morris Jette authored Jul 25, 2013
  
  b036f491
- Set srun default task count properly when --ntasks-per-node specified · 9ac92bce
  Morris Jette authored Jul 25, 2013
```
Similar problem to that described in bug 343 for sbatch
There are many differences in the salloc/srun/sbatch argument
processing that should probably be made more uniform, but no time
to do so now
```
  9ac92bce