Commits · 11e2759f5c390e6d8c0448d6916104c28b4c3344 · Manuel G. Marciani / ces_slurm_simulator

11 Jul, 2012 1 commit
- BLUEGENE - remove race condition where if a block is removed while waiting · 11e2759f
  Danny Auble authored Jul 11, 2012
```
for a job to finish on it the number of unused cpus wasn't updated
correctly.
```
  11e2759f
09 Jul, 2012 1 commit
- Fix bug in task layout with select/cons_res plugin and --ntasks-per-node · f9f087f2
  Martin Perry authored Jul 09, 2012
```
See Bugzilla #73 for more complete description of the problem.
Patch by Martin Perry, Bull.
```
  f9f087f2
06 Jul, 2012 1 commit

Fix for incorrect partition point for job · dd1d573f

Carles Fenoy authored Jul 05, 2012

If job is submitted to more than one partition, it's partition pointer can
be set to an invalid value. This can result in the count of CPUs allocated
on a node being bad, resulting in over- or under-allocation of its CPUs.
Patch by Carles Fenoy, BSC.

Hi all,

After a tough day I've finally found the problem and a solution for 2.4.1
I was able to reproduce the explained behavior by submitting jobs to 2 partitions.
This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job.

I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c)

This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here.
I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition.

job_ptr = job_queue_rec->job_ptr;

part_ptr = job_queue_rec->part_ptr;
job_ptr->part_ptr = part_ptr;
xfree(job_queue_rec);

if (!IS_JOB_PENDING(job_ptr))

continue; /* started in other partition */

Hope this is enough information to solve it.

I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed.

Regards,
Carles Fenoy

dd1d573f

04 Jul, 2012 1 commit
- Tweak test for down node · 4dc4fe90
  Morris Jette authored Jul 03, 2012
  
  4dc4fe90
03 Jul, 2012 4 commits
- BLUEGENE - Correct potential deadlock issue when hardware goes bad and · f0949d91
  Danny Auble authored Jul 03, 2012
```
there are jobs running on that hardware.
```
  f0949d91
- Add gres count value check (>0 && <NO_VAL, 0xfffffffe) · 88ad2c61
  Morris Jette authored Jul 03, 2012
  
  88ad2c61
- Clarify time limit handling in man page. · d37cab14
  Lipari, Don authored Jul 03, 2012
  
  d37cab14
- Fix typo in bluegene web page · 00b78dfa
  Tim Wickberg authored Jul 03, 2012
  
  00b78dfa
02 Jul, 2012 5 commits
- Update META for tag 2.4.1 · c8651870
  Danny Auble authored Jul 02, 2012
  
  c8651870
- fix to make 2.4.0 work to 2.4.1 state · 219aa3e8
  Danny Auble authored Jul 02, 2012
  
  219aa3e8
- Fix bug for job state change from 2.3 -> 2.4 job state can now be preserved · 3bc86988
  Carles Fenoy authored Jul 02, 2012
```
correctly when transitioning.  This also applies for 2.4.0 -> 2.4.1, no
state will be lost. (Thanks to Carles Fenoy)
```
  3bc86988
- Note maximum gres count is 4G · f35ad166
  Morris Jette authored Jul 02, 2012
  
  f35ad166
- Note maximum gres count supported · 9410e98e
  Morris Jette authored Jul 02, 2012
  
  9410e98e
29 Jun, 2012 2 commits
- Document that gang scheduled jobs all must fit into memory · 8bad9a3c
  Morris Jette authored Jun 29, 2012
  
  8bad9a3c
- fix mpi formatting problem in slurm.conf man page · f259fca4
  Morris Jette authored Jun 28, 2012
  
  f259fca4
28 Jun, 2012 2 commits
- Changes for 2.4 tag · 94ea2e84
  Danny Auble authored Jun 28, 2012
  
  94ea2e84
- Fix typos intialize->initialize from Janne Blomqvist · 5173c388
  Janne Blomqvist authored Jun 28, 2012
```
janne.blomqvist@aalto.fi
```
  5173c388
27 Jun, 2012 3 commits
- Fix for setting reason field for user/system hold · a5431885
  Mark Nelson authored Jun 27, 2012
  
  a5431885
- Note how --distribution=arbitrary does not control task layout at job level · 24e9ee76
  Morris Jette authored Jun 27, 2012
  
  24e9ee76
- Fix for step arbitrary allocation with hostlist from job's env vars · 07407fc3
  Morris Jette authored Jun 26, 2012
  
  07407fc3
26 Jun, 2012 10 commits
- Added logic for a Natural Sort · 369437f1
  Danny Auble authored Jun 26, 2012
```
(via code from Martin Pool <mbp sourcefrog net>)
so we can get a correct alphanumeric sort of hostnames.
```
  369437f1
- Correct plugin race condition introduced in · 7bed576b
  Morris Jette authored Jun 26, 2012
```
https://github.com/SchedMD/slurm/commit/e7c17c70a899fb98c9054272ee078f67b2f7e4fc
```
  7bed576b
- make sure sview sorts nodes in partition list. · 10d43dfa
  Danny Auble authored Jun 26, 2012
  
  10d43dfa
- Backport expect test for select/serial · caa000de
  Morris Jette authored Jun 26, 2012
  
  caa000de
- update date · 562fb157
  Danny Auble authored Jun 26, 2012
  
  562fb157
- BGQ - change linking from libslurm.o to libslurmhelper.la to avoid warning. · d83bf5f8
  Danny Auble authored Jun 26, 2012
  
  d83bf5f8
- BGQ - Modified documents to explain new plugin_flags needed in · f8ae9a15
  Danny Auble authored Jun 26, 2012
```
bg.properties in order for the runjob_mux to run correctly.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  f8ae9a15
- replace signed with unsigned counter to avoid warning when compiling with · f7a14fa5
  Danny Auble authored Jun 26, 2012
```
c++
```
  f7a14fa5
- If preempted job should have a grace time and preempt mode is not cancel · 501688ed
  Danny Auble authored Jun 26, 2012
```
but job is going to be canceled because it is interactive or other reason
it now receives the grace time.
```
  501688ed
- Put nodes names in alphabetic order in node table. · 37587b6b
  Morris Jette authored Jun 26, 2012
  
  37587b6b
25 Jun, 2012 6 commits
- BLUEGENE - fix issue if a cable was in an error state make it so we can · 337db3f1
  Danny Auble authored Jun 25, 2012
```
check if a block is still makable if the cable wasn't in error.
```
  337db3f1
- BLUEGENE - remove xassert if num_unused_cpus isn't correct · 013a496b
  Danny Auble authored Jun 25, 2012
  
  013a496b
- BLUEGENE - fix possible race condition if cleaning up a block and the · 66c0a2b3
  Danny Auble authored Jun 25, 2012
```
removal of the job on the block failed.
```
  66c0a2b3
- same as last patch for alloc_cpus -> cpus_alloc · 9955b7df
  Danny Auble authored Jun 25, 2012
  
  9955b7df
- Fix bug when querying accounting looking for a job node size. · bbb4e741
  Danny Auble authored Jun 25, 2012
  
  bbb4e741
- Add FAQ about job start time estimate · 31dcc0c2
  Rod Schultz authored Jun 24, 2012
  
  31dcc0c2
22 Jun, 2012 4 commits
- remove NEWS item missed from commit · 86196b70
  Danny Auble authored Jun 22, 2012
```
29d79ef8
```
  86196b70
- BLUEGENE - alter node count correctly if not given but task count is. · a92947d6
  Danny Auble authored Jun 22, 2012
  
  a92947d6
- BLUEGENE - fix race condition where if a nodeboard/card goes down at the · ea8ca91d
  Danny Auble authored Jun 22, 2012
```
same time a block is destroyed and that block just happens to be the
smallest overlapping block over the bad hardware.
```
  ea8ca91d
- Move logic to always give the first · c79cd503
  Danny Auble authored Jun 22, 2012
  
  c79cd503