Commits · 6461f808ba6ca308c369cf87205c9a77ab3adc8d · Manuel G. Marciani / ces_slurm_simulator

16 Aug, 2013 2 commits
- Merge remote-tracking branch 'origin/slurm-2.6' · 6461f808
  Danny Auble authored Aug 15, 2013
```
Conflicts:
	src/common/stepd_api.c
```
  6461f808
- Fix issue with a 2.5 slurmstepd locking up when talking to a 2.6 slurmd. · e804c9bb
  Danny Auble authored Aug 15, 2013
  
  e804c9bb
15 Aug, 2013 11 commits
- Merge remote-tracking branch 'origin/slurm-2.6' · 233eae9f
  Danny Auble authored Aug 15, 2013
  
  233eae9f
- CRAY - fix issue with accelerators on a cray when parsing BASIL 1.3 XML. · c30fe1b3
  Danny Auble authored Aug 15, 2013
  
  c30fe1b3
- Merge remote-tracking branch 'origin/slurm-2.6' · f4dbfd3c
  Danny Auble authored Aug 15, 2013
  
  f4dbfd3c
- Fix issue with potentially referencing past an array in parse_time() · 2833c19a
  Danny Auble authored Aug 15, 2013
  
  2833c19a
- Merge remote-tracking branch 'origin/slurm-2.6' · 68b9b20d
  Danny Auble authored Aug 15, 2013
```
Conflicts:
	src/common/read_config.c
	src/plugins/accounting_storage/pgsql/accounting_storage_pgsql.c
	src/plugins/jobcomp/pgsql/jobcomp_pgsql.c
	src/slurmctld/job_mgr.c
```
  68b9b20d
- Fix job state version · 46680f0c
  Danny Auble authored Aug 15, 2013
  
  46680f0c
- Fix in accounting_storage/filetxt to correct start times which sometimes · 9eba4384
  Danny Auble authored Aug 15, 2013
```
could end up before the job started. Bug 371
```
  9eba4384
- Fixed deadlock issue with new priority fix · 9dd3e445
  Danny Auble authored Aug 15, 2013
  
  9dd3e445
- Proctrack/pgid - Add support for proctrack_p_plugin_get_pids() · 188df55c
  Morris Jette authored Aug 14, 2013
```
This function can now be called to test for processes which are
dumping in order to avoid sending them a SIGKILL until dump
completes. Change in logic required for job_container/cray.
```
  188df55c
- Slight improvements to proctrack/linux code · 672ec376
  Morris Jette authored Aug 14, 2013
```
Handle invalid process ID better (probably never happens)
Re-arrange logic to eliminate duplicate file close on error
```
  672ec376
- Fix CPURunMins if a job is requeued from a failed launch. · 8aaa817e
  Danny Auble authored Aug 14, 2013
  
  8aaa817e
14 Aug, 2013 16 commits
- Change test12.2 to use rss instead of vmsize to do it's testings. · fe136e3f
  Danny Auble authored Aug 14, 2013
  
  fe136e3f
- Make jobacct_gather/cgroup work correctly and also make all jobacct_gather · 2eba8d7f
  Danny Auble authored Aug 14, 2013
```
plugins more maintainable.
```
  2eba8d7f
- Merge branch 'slurm-2.6' of https://github.com/SchedMD/slurm into slurm-2.6 · 27bf1bbd
  jette authored Aug 14, 2013
  
  27bf1bbd
- Change test due to accounting frequency change · 2f65854c
  jette authored Aug 14, 2013
```
We now reject jobs with an invalid accounting frequency at
submit time rather than launch time, so the error is slightly
different and the test needs to change for that.
```
  2f65854c
- Fairly major update to Consumable Resources web page · d97bd588
  Morris Jette authored Aug 14, 2013
  
  d97bd588
- Validate a job's accounting frequency at submission time · 26560fa5
  Morris Jette authored Aug 14, 2013
```
This avoids waiting for the job's initiation to fail.
```
  26560fa5
- Do not drain a node if a job's accounting frequency is bad · df4507cc
  Morris Jette authored Aug 14, 2013
```
Only cancel the job.
```
  df4507cc
- ALPS - Only use the -ljob with salloc when running on a ALPS Cray install. · bb2e37a2
  Danny Auble authored Aug 14, 2013
  
  bb2e37a2
- Fix job state recovery logic for accounting frequency · 6d878aa7
  Morris Jette authored Aug 14, 2013
```
Fix job state recovery logic in which a job's accounting frequency was
not set. This would result in a value of 65534 seconds being used (the
equivalent of NO_VAL in uint16_t), which could result in the job being
requeued or aborted.
```
  6d878aa7
- Merge branch 'slurm-2.5' into slurm-2.6 · aece6880
  Morris Jette authored Aug 14, 2013
  
  aece6880
- Remove vestigial debug messages · aa7aade6
  Danny Auble authored Aug 14, 2013
  
  aa7aade6
- Remove vestigial log message · 2e9d2082
  jette authored Aug 13, 2013
  
  2e9d2082
- Fix pack and unpack between 2.6 and 2.5 · 585e7947
  David Bigagli authored Aug 13, 2013
  
  585e7947
- Fix infinite loop for one byte config file · 3820cf2e
  Morris Jette authored Aug 13, 2013
```
Problem reported by BYU. slurm.conf included a file one byte in
length. Logic created a buffer one byte long and used fgets()
to read the file. fgets() reads one byte less than the buffer
size to include a trailing '\0', so it fails to read the file.
```
  3820cf2e
- Fix race condition from commit a96b5ae7 · d9c9ac4c
  Danny Auble authored Aug 13, 2013
```
Basically the system size has to be set up before you call the
priority/multifactor plugin.  If a job is finishing while the slurmctld
is starting then it would fatal on the init if it wasn't set up.
```
  d9c9ac4c
- revert back to previous id numbers · 09ac65d7
  Danny Auble authored Aug 13, 2013
  
  09ac65d7
13 Aug, 2013 11 commits

Update topology info to match current v2.6 capabilties · 3ae38af8
Morris Jette authored Aug 13, 2013

3ae38af8
Update reservation web page for new v2.6 features · e6e03b45
Morris Jette authored Aug 13, 2013
```
core reservations and reservation prolog/epilog
```
e6e03b45
Have qsub.pl pass along return code from sbatch/salloc · 8ab9d720
John Thiltges authored Aug 13, 2013

8ab9d720
Update NEWS to reflect several recent code changes · 1c9c3be2
Morris Jette authored Aug 13, 2013

1c9c3be2
Change node update time when CPU load is updated · a13deb8a
Morris Jette authored Aug 13, 2013

a13deb8a
Add contributor to team web page · 56be32a3
Morris Jette authored Aug 13, 2013

56be32a3

sched/wiki2 - Correct CPU load reported to Moab · 18ac1981

Michael Gutteridge authored Aug 13, 2013

I'm running Slurm 2.6.0 and MWM 7.2.4 in our test cluster at the moment. I happened to notice that node load reporting wasn't consistent- periodically you'd see a "sane" load reported in Moab, but most of the time the reported load was zero despite an accurate CPULoad value reported by "scontrol show node".

Finally got to digging into this. It appears that the only time load was being reported properly was in the Moab scheduling cycle directly after slurmctld did a node ping. In subsequent scheduling cycles the load (again, as reported by Moab) was back to zero.

The node ping is significant as that is the only time the node is updated- since the wiki2 interface only reports records that change, and the load record isn't changed, it isn't reported in the queries after the node ping.

Judging from this behavior, I'm guessing that Moab does not store the load value- every time it queries resources in Slurm it sets the node's load back to zero.

I've altered src/plugins/sched/wiki2/get_nodes.c slightly- basically moved the section that reports CPULOAD above the check for updated info (update_time > last_node_update).

So I don't know if this is the appropriate way to fix it. The wiki specification that Adaptive has published doesn't seem to indicate how this should function. Either MWM should assume the last value reported is still accurate or Slurm needs to report it for every wiki GETNODES command.

Anyway, the patch is attached, it seems to be working for me, and I've rolled it into our debian build directory. YMMV.

Michael

18ac1981

select/cons_res - Add test for zero node allocation · e180d341

jette authored Aug 13, 2013

I don't see how this could happen, but it might explain something
reported by Harvard University. In any case, this could prevent
an infinite loop if the task distribution funciton is passed a
job allocation with zero nodes.

e180d341

Merge branch 'slurm-2.5' into slurm-2.6 · 5f3d85ce
Morris Jette authored Aug 13, 2013

5f3d85ce

select/cons_res - Avoid extraneous "oversubscribe" error messages · 302d8b3f

jette authored Aug 13, 2013

This problem was reported by Harvard University and could be
reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id".
With other job types, the error message could be logged many times
for each job. This change logs the error once per job and only if
the job request does not include the -O/--overcommit option.

302d8b3f

Take older conversion out of the mysql code. This was needed when we · 3503950f

Danny Auble authored Aug 12, 2013

went from the old single table per enterprise style to that of separate
tables per clusters. (2.0 -> 2.*). If people are still running <2.2 they
really need to upgrade (long before this), and can get the translations
by upgrading to >=2.1.0 before they install this version.

3503950f