Commits · 163d95478505d69bf0a84dd3c8838cbe041d120c · Manuel G. Marciani / ces_slurm_simulator

14 Jan, 2013 5 commits

Prevent srun abort on task launch failure · 163d9547

Hongjia Cao authored Jan 14, 2013

On job step launch failure, the function
"slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
which causes srun to be aborted:

srun: error: Task launch for 22495.0 failed on node cn6: Job credential
expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
cn5
cn4
cn7
srun: error: Timed out waiting for job step to complete
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
Aborted (core dumped)

The attached patch(version 2.5.1) fixes it. But the message of
"
Job step aborted: Waiting up to 2 seconds for job step to finish.
Timed out waiting for job step to complete
"
will still be printed twice.

163d9547

Add debugging hint to MPI guide for MPICH2 · dd8c22c7
Morris Jette authored Jan 14, 2013

dd8c22c7
Fix bug in accounting_storage/pgsql · 667cbf15
Yair Yarom authored Jan 14, 2013

667cbf15
Add more notes discouraging use of PGSQL plugin · 08cfbf0a
Morris Jette authored Jan 14, 2013

08cfbf0a
Revision of gres topology bug fix · e9c216c4
Morris Jette authored Jan 14, 2013

e9c216c4

11 Jan, 2013 6 commits
- Merge branch 'slurm-2.5' of https://github.com/SchedMD/slurm into slurm-2.5 · 7f02e865
  jette authored Jan 11, 2013
  
  7f02e865
- Modify test for change in sbcast credential validation · 67977529
  jette authored Jan 11, 2013
```
User root or SlurmUser don't need valid sbcast credential
```
  67977529
- Add log message describing possible failure mode · 59a5be56
  Morris Jette authored Jan 11, 2013
  
  59a5be56
- Add small sleep to test for unsynchronized clocks across a cluster · 557b2607
  jette authored Jan 11, 2013
  
  557b2607
- Increase time in a test for slow poe launch · 9c328fad
  jette authored Jan 11, 2013
  
  9c328fad
- Modify switch/nrt logic to permit build without libnrt.so library. · d534d48b
  Morris Jette authored Jan 10, 2013
  
  d534d48b
10 Jan, 2013 7 commits
- Increase test delay for slow poe task launch · d8837a0a
  jette authored Jan 10, 2013
  
  d8837a0a
- Debugged version of job_submit/all_partitions · 27c74430
  Morris Jette authored Jan 10, 2013
  
  27c74430
- Add job_submit/all_partitions plugin · f07f173b
  jette authored Jan 10, 2013
  
  f07f173b
- Fix recently introduced bug that prevented scontrol from working with command line input · a6151e5d
  jette authored Jan 09, 2013
  
  a6151e5d
- Clarify documenation with respect to CPU binding · 54ef6172
  Morris Jette authored Jan 09, 2013
  
  54ef6172
- Fix logic to optimize GRES topology with respect to allocated CPUs. · a6bfae71
  Morris Jette authored Jan 09, 2013
  
  a6bfae71
- BGQ - minor fix to show correct step node count. · cf922f10
  Danny Auble authored Jan 09, 2013
  
  cf922f10
09 Jan, 2013 6 commits
- Fixed bad variables · 2d8a8e4b
  Danny Auble authored Jan 09, 2013
  
  2d8a8e4b
- Added salloc to test8.11 for testing. · 8ca946b1
  Nathan Yee authored Jan 09, 2013
  
  8ca946b1
- Update contributor agreement page to advise SchedMD contact · 9c0f8832
  Morris Jette authored Jan 09, 2013
  
  9c0f8832
- Add contributor agreements to web pages · 15c4f769
  Morris Jette authored Jan 09, 2013
  
  15c4f769
- Print new-line after scontrol EOF · 3f52e3ee
  David Bigagli authored Jan 09, 2013
  
  3f52e3ee
- Add missing "safe" flag from print of AccountStorageEnforce option. · 720153b7
  Danny Auble authored Jan 09, 2013
  
  720153b7
08 Jan, 2013 6 commits

BLUEGENE - fix for QOS/Association node limits. · b50e2269
Danny Auble authored Jan 08, 2013

b50e2269
In select/serial enforce both node and core count for reservation createio · f00745ea
jette authored Jan 08, 2013

f00745ea
Disable a test for select/serial plugin · 9bc0cf0b
jette authored Jan 08, 2013

9bc0cf0b
Modify test for change in how squeue reports node state alloc or idle · 82e0bb50
Morris Jette authored Jan 08, 2013

82e0bb50

Report node state as MAINT only if not allocated jobs · 2af5ce33

Rod Schultz authored Jan 08, 2013

One of our testers has observed that when a long running job continues to run after a maintenance reservation comes into effect sinfo reports the node as being in the allocated state while scontrol shows it to be in the maintenance state.

This can happen when a node is not completely allocated. (select cons_res, a partition which is not Shared=EXCLUSIVE, jobs allocated without –exclusive, or jobs that are allocated only some of the cpus on a node.)

Execution paths leading up to calls to node_state_string (slurm_protocol_defs.c) or node_state_string_compact, in scontrol, test for allocated_cpus less that total_cpus on the node and set the node state to MIXED rather than ALLOCATED, while similar paths in sinfo do not.

I think this is probably a bug, since the mixed state is defined and think it is desirable that both command return the same result.

The problem can be fixed with two logic changes (in multiple places)

1) node_state_string and node_state_string_compact have to check for mixed as well as allocated before returning the MAINT state. This means that the reported state for the node with the allocated job will be MIXED.

2) Sinfo must also check allocated_cpus less than total_cpus and set the state to MIXED before calling either node_state_string or node_state_string_compact.

The attached patch (against 2.5.1) makes these changes. The attached script is a test case.

2af5ce33

Fix advanced reservation recovery logic when upgrading from version 2.4. · 604b869e
Morris Jette authored Jan 08, 2013

604b869e

03 Jan, 2013 8 commits
- Start news for V2.5.2 · 64048706
  Morris Jette authored Jan 03, 2013
  
  64048706
- Update META for v2.5.1 tag · 00f20189
  Morris Jette authored Jan 03, 2013
  
  00f20189
- Disable job --exclusive option with select/serial plugin · 54a3a7db
  Morris Jette authored Jan 03, 2013
  
  54a3a7db
- Modify test for changed scontrol output and disable with select/serial · 108edd31
  Morris Jette authored Jan 03, 2013
  
  108edd31
- Correction to commit 9eb37af0 · da312784
  Morris Jette authored Jan 03, 2013
```
Command line argument would not be processed, but scontrol would
exit immediately
```
  da312784
- Exit scontrol command on stdin EOF · 9eb37af0
  Morris Jette authored Jan 03, 2013
  
  9eb37af0
- Fix typo · e3cba0ed
  Morris Jette authored Jan 03, 2013
  
  e3cba0ed
- Correct core reservation logic for use with select/serial plugin. · 0a06566a
  Morris Jette authored Jan 02, 2013
  
  0a06566a
28 Dec, 2012 2 commits
- Support multiple versions of account gather RPCs · 5f087561
  Morris Jette authored Dec 28, 2012
  
  5f087561
- Support version 2.5 slurmctld with version 2.4 slurmd daemons. · 844f70a2
  Morris Jette authored Dec 28, 2012
  
  844f70a2