Commits · 6596f17e439df3606897e43c4f99786de3db9e29 · Manuel G. Marciani / ces_slurm_simulator

15 Jan, 2013 2 commits
- Merge priority/multifactor2 plugin into priority/multifactor · 6596f17e
  David Bigagli authored Jan 14, 2013
```
Add PriorityFlags value of "TICKET_BASED".
```
  6596f17e
- Merge branch 'slurm-2.5' · a68ebb39
  Morris Jette authored Jan 14, 2013
  
  a68ebb39
14 Jan, 2013 12 commits

Handle srun task launch failure without duplicate error messages or abort. · 6e83dc4c
jette authored Jan 14, 2013

6e83dc4c

Prevent srun abort on task launch failure · 163d9547

Hongjia Cao authored Jan 14, 2013

On job step launch failure, the function
"slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
which causes srun to be aborted:

srun: error: Task launch for 22495.0 failed on node cn6: Job credential
expired
srun: error: Application launch failed: Job credential expired
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
cn5
cn4
cn7
srun: error: Timed out waiting for job step to complete
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
Aborted (core dumped)

The attached patch(version 2.5.1) fixes it. But the message of
"
Job step aborted: Waiting up to 2 seconds for job step to finish.
Timed out waiting for job step to complete
"
will still be printed twice.

163d9547

Added new SelectTypeParameter value of "CR_ALLOCATE_FULL_SOCKET". · cdf679d0
Morris Jette authored Jan 14, 2013

cdf679d0
Remove "Related Software" from Slurm web header, just use "Download" · 474ddcce
Morris Jette authored Jan 14, 2013

474ddcce
select/cons_res plugin: CPU allocation logic fix · 1ef41ac9
Morris Jette authored Jan 14, 2013
```
Correction to CPU allocation count logic in for cores without hyperthreading.
```
1ef41ac9

Add SLURM_SRUN_REDUCE_TASK_EXIT_MSG environment variable · 96986199

Hongjia Cao authored Jan 14, 2013

With jobs launched using srun directly which end abnormally, there will
be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
SIGNAL 9 ***) from each node. And/or there will be a
task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
large scale jobs, these messages become tedious and the other error
messages will be buried. The attached two patches(for slurm-2.5.1)
introduce two environment variables to control the output of such
messages:

SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
print the step-killed-message;

SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
exit messages with the same exit code will be printed only once.

96986199

Add SLURM_STEP_KILLED_MSG_NODE_ID environment variable · 232ab305

Hongjia Cao authored Jan 14, 2013

With jobs launched using srun directly which end abnormally, there will
be a step-killed-message(slurmd[cn123]: *** 1234.0 KILLED AT ... WITH
SIGNAL 9 ***) from each node. And/or there will be a
task-exit-message(srun: error: task[0-1]: Terminated) for each node. For
large scale jobs, these messages become tedious and the other error
messages will be buried. The attached two patches(for slurm-2.5.1)
introduce two environment variables to control the output of such
messages:

SLURM_STEP_KILLED_MSG_NODE_ID: if set, only the specified node will
print the step-killed-message;

SLURM_SRUN_REDUCE_TASK_EXIT_MSG: if set and non-zero, successive task
exit messages with the same exit code will be printed only once.

232ab305

Merge branch 'slurm-2.5' · fef33d8d
Morris Jette authored Jan 14, 2013

fef33d8d
Add debugging hint to MPI guide for MPICH2 · dd8c22c7
Morris Jette authored Jan 14, 2013

dd8c22c7
Fix bug in accounting_storage/pgsql · 667cbf15
Yair Yarom authored Jan 14, 2013

667cbf15
Add more notes discouraging use of PGSQL plugin · 08cfbf0a
Morris Jette authored Jan 14, 2013

08cfbf0a
Revision of gres topology bug fix · e9c216c4
Morris Jette authored Jan 14, 2013

e9c216c4

11 Jan, 2013 10 commits
- Merge branch 'slurm-2.5' of https://github.com/SchedMD/slurm into slurm-2.5 · 7f02e865
  jette authored Jan 11, 2013
  
  7f02e865
- Modify test for change in sbcast credential validation · 67977529
  jette authored Jan 11, 2013
```
User root or SlurmUser don't need valid sbcast credential
```
  67977529
- Merge branch 'slurm-2.5' · 862cdd39
  Morris Jette authored Jan 11, 2013
  
  862cdd39
- Add log message describing possible failure mode · 59a5be56
  Morris Jette authored Jan 11, 2013
  
  59a5be56
- Do not require sbcast credential check if RPC comes from root · a82fdd81
  Morris Jette authored Jan 11, 2013
```
This can be useful for testing purposes
```
  a82fdd81
- Merge branch 'slurm-2.5' · e666076c
  Morris Jette authored Jan 11, 2013
  
  e666076c
- Add small sleep to test for unsynchronized clocks across a cluster · 557b2607
  jette authored Jan 11, 2013
  
  557b2607
- Increase time in a test for slow poe launch · 9c328fad
  jette authored Jan 11, 2013
  
  9c328fad
- Merge branch 'slurm-2.5' · 1d49940d
  Morris Jette authored Jan 10, 2013
  
  1d49940d
- Modify switch/nrt logic to permit build without libnrt.so library. · d534d48b
  Morris Jette authored Jan 10, 2013
  
  d534d48b
10 Jan, 2013 15 commits
- Increase test delay for slow poe task launch · d8837a0a
  jette authored Jan 10, 2013
  
  d8837a0a
- Add AlpsEngine cray.conf parameter information to Cray way page · dd142548
  Morris Jette authored Jan 10, 2013
  
  dd142548
- Remove trailing space · 93a017e2
  Morris Jette authored Jan 10, 2013
  
  93a017e2
- Cray - Add new cray.conf parameter of "AlpsEngine" · 35a61b44
  Morris Jette authored Jan 10, 2013
```
Used to specify the communication protocol to be used for ALPS/BASIL.
```
  35a61b44
- Merge branch 'slurm-2.5' · a58d08b0
  Morris Jette authored Jan 10, 2013
  
  a58d08b0
- Debugged version of job_submit/all_partitions · 27c74430
  Morris Jette authored Jan 10, 2013
  
  27c74430
- Add job_submit/all_partitions plugin · f07f173b
  jette authored Jan 10, 2013
  
  f07f173b
- Note about full deprecation of sacct --dump and --fdump · 23ea70f2
  Danny Auble authored Jan 10, 2013
  
  23ea70f2
- Fix recently introduced bug that prevented scontrol from working with command line input · 037acd7d
  jette authored Jan 09, 2013
  
  037acd7d
- Fix recently introduced bug that prevented scontrol from working with command line input · a6151e5d
  jette authored Jan 09, 2013
  
  a6151e5d
- Merge branch 'slurm-2.5' · dea974a5
  Morris Jette authored Jan 09, 2013
  
  dea974a5
- Clarify documenation with respect to CPU binding · 54ef6172
  Morris Jette authored Jan 09, 2013
  
  54ef6172
- Fix logic to optimize GRES topology with respect to allocated CPUs. · a6bfae71
  Morris Jette authored Jan 09, 2013
  
  a6bfae71
- Merge remote-tracking branch 'origin/slurm-2.5' · fbd601b4
  Danny Auble authored Jan 09, 2013
  
  fbd601b4
- BGQ - minor fix to show correct step node count. · cf922f10
  Danny Auble authored Jan 09, 2013
  
  cf922f10
09 Jan, 2013 1 commit
- Merge remote-tracking branch 'origin/slurm-2.5' · da2e1dd0
  Danny Auble authored Jan 09, 2013
  
  da2e1dd0