Commits · dd7fd98a2d0b284356d20d90523e99471b2c2cfd · Manuel G. Marciani / ces_slurm_simulator

16 Jan, 2013 13 commits
- Add links to LBL Node Health Check program in FAQ and Download web pages · dd7fd98a
  Morris Jette authored Jan 16, 2013
  
  dd7fd98a
- Fix for scheduling batch jobs in multiple partitions · 04fbf26a
  Morris Jette authored Jan 16, 2013
```
Without this change a high priority batch job may not start at submit
time. In addtion, a pending job with mutltiple partitions be cancelled
when the scheduler runs if any of it's partitions can not be used by
the job.
```
  04fbf26a
- In sacctmgr, purge local qos list after update · a5310890
  David Bigagli authored Jan 16, 2013
  
  a5310890
- Remove commit 0ca48b70 · 111fd8b2
  Morris Jette authored Jan 16, 2013
```
The original work this was based upon has been replaced with new
logic.
```
  111fd8b2
- Fix for job request with multiple partitions and node features · d84ad203
  Morris Jette authored Jan 16, 2013
```
Without this patch, if the first listed partition lacks nodes
with required features the job would be rejected.
```
  d84ad203
- Revert commit 3d69c635 · 0b9c3c58
  Morris Jette authored Jan 16, 2013
```
While this will validate job at submit time, it results in redundant
looping when scheduling jobs. Working on alternate patch now.
```
  0b9c3c58
- Missed goto for last patch · 0ca48b70
  Danny Auble authored Jan 16, 2013
  
  0ca48b70
- Check all requested partitions to make sure at least one is valid on job · 3d69c635
  Danny Auble authored Jan 16, 2013
```
submission.
```
  3d69c635
- Initialize core_map structure for select/linear, needed for GRES binding · 8880019b
  Morris Jette authored Jan 15, 2013
  
  8880019b
- Merge branch 'slurm-2.5' of https://github.com/SchedMD/slurm into slurm-2.5 · 0db4991d
  Morris Jette authored Jan 15, 2013
  
  0db4991d
- Correction to GRES element selection logic · d5d43fd3
  Morris Jette authored Jan 15, 2013
  
  d5d43fd3
- Correct gres logic to handle difference in core/cpu count · 903b5654
  Morris Jette authored Jan 15, 2013
```
The gres_plugin_job_test was returning a count of cores available
to a job, but the select plugins was treating this as a CPU count.
This change converts the core count into a CPU count as needed in
the select plugin and changes the comments related to the function
gres_plugin_job_test().
```
  903b5654
- Fix issue where sjstat and sjobexitmod was installed in 2 different rpms · 8b91fd63
  Danny Auble authored Jan 15, 2013
  
  8b91fd63
15 Jan, 2013 1 commit

QoS limits enforcement: correct a bug with 0-valued per user used limits · 4136520d

Matthieu Hautreux authored Jan 15, 2013

QoS limits enforcement on the controller side is based on a list of used_limits
per user. When a user is not yet added to the list, which is common when the
controller is restarted and the user has no running jobs, the current logic is
to not check some of the "per user limits" and let the submission succeed.
However, if one of these limits is a zero-valued limit, the check chould
failed as it means that no job should be submitted at all as it would
necessarily result in a crossing of the limit.

This patch ensures that even when a user is not yet present in the per user
used_limits list, the 0-valued limits are correctly treated.

4136520d

14 Jan, 2013 6 commits

Handle srun task launch failure without duplicate error messages or abort. · 6e83dc4c
jette authored Jan 14, 2013

6e83dc4c

Prevent srun abort on task launch failure · 163d9547

Hongjia Cao authored Jan 14, 2013

On job step launch failure, the function
"slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
which causes srun to be aborted:

srun: error: Task launch for 22495.0 failed on node cn6: Job credential
expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
cn5
cn4
cn7
srun: error: Timed out waiting for job step to complete
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
Aborted (core dumped)

The attached patch(version 2.5.1) fixes it. But the message of
"
Job step aborted: Waiting up to 2 seconds for job step to finish.
Timed out waiting for job step to complete
"
will still be printed twice.

163d9547

Add debugging hint to MPI guide for MPICH2 · dd8c22c7
Morris Jette authored Jan 14, 2013

dd8c22c7
Fix bug in accounting_storage/pgsql · 667cbf15
Yair Yarom authored Jan 14, 2013

667cbf15
Add more notes discouraging use of PGSQL plugin · 08cfbf0a
Morris Jette authored Jan 14, 2013

08cfbf0a
Revision of gres topology bug fix · e9c216c4
Morris Jette authored Jan 14, 2013

e9c216c4

11 Jan, 2013 6 commits
- Merge branch 'slurm-2.5' of https://github.com/SchedMD/slurm into slurm-2.5 · 7f02e865
  jette authored Jan 11, 2013
  
  7f02e865
- Modify test for change in sbcast credential validation · 67977529
  jette authored Jan 11, 2013
```
User root or SlurmUser don't need valid sbcast credential
```
  67977529
- Add log message describing possible failure mode · 59a5be56
  Morris Jette authored Jan 11, 2013
  
  59a5be56
- Add small sleep to test for unsynchronized clocks across a cluster · 557b2607
  jette authored Jan 11, 2013
  
  557b2607
- Increase time in a test for slow poe launch · 9c328fad
  jette authored Jan 11, 2013
  
  9c328fad
- Modify switch/nrt logic to permit build without libnrt.so library. · d534d48b
  Morris Jette authored Jan 10, 2013
  
  d534d48b
10 Jan, 2013 7 commits
- Increase test delay for slow poe task launch · d8837a0a
  jette authored Jan 10, 2013
  
  d8837a0a
- Debugged version of job_submit/all_partitions · 27c74430
  Morris Jette authored Jan 10, 2013
  
  27c74430
- Add job_submit/all_partitions plugin · f07f173b
  jette authored Jan 10, 2013
  
  f07f173b
- Fix recently introduced bug that prevented scontrol from working with command line input · a6151e5d
  jette authored Jan 09, 2013
  
  a6151e5d
- Clarify documenation with respect to CPU binding · 54ef6172
  Morris Jette authored Jan 09, 2013
  
  54ef6172
- Fix logic to optimize GRES topology with respect to allocated CPUs. · a6bfae71
  Morris Jette authored Jan 09, 2013
  
  a6bfae71
- BGQ - minor fix to show correct step node count. · cf922f10
  Danny Auble authored Jan 09, 2013
  
  cf922f10
09 Jan, 2013 6 commits
- Fixed bad variables · 2d8a8e4b
  Danny Auble authored Jan 09, 2013
  
  2d8a8e4b
- Added salloc to test8.11 for testing. · 8ca946b1
  Nathan Yee authored Jan 09, 2013
  
  8ca946b1
- Update contributor agreement page to advise SchedMD contact · 9c0f8832
  Morris Jette authored Jan 09, 2013
  
  9c0f8832
- Add contributor agreements to web pages · 15c4f769
  Morris Jette authored Jan 09, 2013
  
  15c4f769
- Print new-line after scontrol EOF · 3f52e3ee
  David Bigagli authored Jan 09, 2013
  
  3f52e3ee
- Add missing "safe" flag from print of AccountStorageEnforce option. · 720153b7
  Danny Auble authored Jan 09, 2013
  
  720153b7
08 Jan, 2013 1 commit
- BLUEGENE - fix for QOS/Association node limits. · b50e2269
  Danny Auble authored Jan 08, 2013
  
  b50e2269