Commits · c5704adc78d923a7ab096ace85fa24014500c87e · Manuel G. Marciani / ces_slurm_simulator

12 Jul, 2012 12 commits
- switch/nrt: add adapter cau & immed_send counts to adapter data structures · c5704adc
  Morris Jette authored Jul 12, 2012
  
  c5704adc
- Merge remote-tracking branch 'origin/slurm-2.4' · 366f357d
  Danny Auble authored Jul 12, 2012
  
  366f357d
- BGQ - correctly remove running jobs when freeing a shared block. · a1f9b6a7
  Danny Auble authored Jul 12, 2012
  
  a1f9b6a7
- update slurm spec file to correctly build on a cray · 5fa2a17d
  Danny Auble authored Jul 12, 2012
  
  5fa2a17d
- BLUEGENE - better debug messages · eeb31e78
  Danny Auble authored Jul 12, 2012
  
  eeb31e78
- BLUEGENE - Handle job completion correctly if an admin removes a block · 5430c095
  Danny Auble authored Jul 12, 2012
```
where other blocks on an overlapping midplane are running jobs.
```
  5430c095
- switch/nrt: Add some hooks for CAU and Immediate_slots support · a61cc1b9
  Morris Jette authored Jul 12, 2012
  
  a61cc1b9
- Merge branch 'slurm-2.4' · 8f525f29
  Morris Jette authored Jul 12, 2012
  
  8f525f29
- Minor format change to sbatch man page · aedc5be9
  Morris Jette authored Jul 12, 2012
  
  aedc5be9
- Fix for bad merge from v2.4, change in node data structure port field from number to string · 38aa3708
  Morris Jette authored Jul 11, 2012
  
  38aa3708
- switch/nrt: Fix mem leaks and don't allocate tables for IP_ONLY devices · 85f23211
  Morris Jette authored Jul 11, 2012
  
  85f23211
- switch/nrt: treat --network=ip the same as --network=ipv4 · c9506dec
  Morris Jette authored Jul 11, 2012
  
  c9506dec
11 Jul, 2012 11 commits
- Merge remote-tracking branch 'origin/slurm-2.4' · 7d125ca4
  Danny Auble authored Jul 11, 2012
  
  7d125ca4
- BLUEGENE - If a large block (> 1 midplane) is in error and underlying · 0c371d36
  Danny Auble authored Jul 11, 2012
```
hardware is marked bad remove the larger block and create a block over
just the bad hardware making the other hardware available to run on.
```
  0c371d36
- switch/nrt: Support wider range of NRT versions · 3797d6e7
  Morris Jette authored Jul 11, 2012
  
  3797d6e7
- BGQ - make sure we have a valid block when creating or finishing a step · 4731a11b
  Danny Auble authored Jul 11, 2012
```
allocation.
```
  4731a11b
- BLUEGENE - Sanity check just to make sure BLOCK_MAGIC is correct · 74b70963
  Danny Auble authored Jul 11, 2012
  
  74b70963
- BLUEGENE - remove race condition where if a block is removed while waiting · 11e2759f
  Danny Auble authored Jul 11, 2012
```
for a job to finish on it the number of unused cpus wasn't updated
correctly.
```
  11e2759f
- launch/poe: document poe configuration file · 087862f2
  Morris Jette authored Jul 11, 2012
  
  087862f2
- switch/nrt: Note known bug in NRT API · 447ecb7e
  Morris Jette authored Jul 11, 2012
  
  447ecb7e
- switch/nrt: fix logic for sn_single if multple adapters of same · cdb3973c
  Morris Jette authored Jul 11, 2012
```
same type and network ID. Add logic to match adapter name also.
This is needed due to the additional IP_ONLY adapter named virbr0
as used for virtualization.
```
  cdb3973c
- switch/nrt: fix several problems related to new CentOS and PE software · ffa3d713
  Morris Jette authored Jul 10, 2012
  
  ffa3d713
- switch/nrt Changes for latest CentOS · 9938facd
  Morris Jette authored Jul 10, 2012
  
  9938facd
10 Jul, 2012 4 commits

NRT - Add information about if PMD is calling the libpermapi plugin or not · ccd24a9e
Danny Auble authored Jul 10, 2012

ccd24a9e

Correct job node_cnt value for job completion plugin · 97ce2e19

Morris Jette authored Jul 10, 2012

When using the jobcomp/script interface, we have noticed the NODECNT
environment variable is off-by-one when logging completed jobs in
the NODE_FAIL state (though the NODELIST is correct).

This appears to be because in many places in job_completion_logger()
is called after deallocate_nodes(), which appears to decrement
job->node_cnt for DOWN nodes.

If job_completion_logger() only called the job completion plugin,
then I would guess that it might be safe to move this call ahead
of deallocate_nodes(). However, it seems like job_completion_logger()
also does a bunch of accounting stuff (?), so perhaps that would
need to be split out first?

Also, there is the possibility that this is working as designed,
though if so a well placed comment in the code might be appreciated.
If the decreased nodecount is intended, though, should the DOWN
nodes also be removed from the job's NODELIST? - Mark Grondona

97ce2e19

Purely cosmetic mods · f995a5b8
Morris Jette authored Jul 10, 2012

f995a5b8
Merge branch 'slurm-2.4' · f321ce46
Morris Jette authored Jul 09, 2012

f321ce46

09 Jul, 2012 1 commit
- Fix bug in task layout with select/cons_res plugin and --ntasks-per-node · f9f087f2
  Martin Perry authored Jul 09, 2012
```
See Bugzilla #73 for more complete description of the problem.
Patch by Martin Perry, Bull.
```
  f9f087f2
07 Jul, 2012 1 commit

switch/nrt: change how a specific adapter name is identified · 76322e71

Morris Jette authored Jul 06, 2012

Change the --network option. Rather than just putting the adapter
name as an token in the option, specify it with the keyword "devname=".

76322e71

06 Jul, 2012 4 commits

Add web page with information about SLURM/POE interface · 5a68d9ab
Morris Jette authored Jul 06, 2012
```
The document still needs work, but is a decent start
```
5a68d9ab

Move srun loading of plugins earlier in the logic · 1b2ff838

Morris Jette authored Jul 06, 2012

This move reduces the risk of srun failing horribly due to code that
is inconsistent with the plugins if srun is running during a SLURM
upgrade, especially a major upgrade in which the plugin function
arguments can change

1b2ff838

Merge branch 'slurm-2.4' · 76a0e82e
Morris Jette authored Jul 05, 2012
```
Conflicts:
	src/slurmctld/job_scheduler.c
```
76a0e82e

Fix for incorrect partition point for job · dd1d573f

Carles Fenoy authored Jul 05, 2012

If job is submitted to more than one partition, it's partition pointer can
be set to an invalid value. This can result in the count of CPUs allocated
on a node being bad, resulting in over- or under-allocation of its CPUs.
Patch by Carles Fenoy, BSC.

Hi all,

After a tough day I've finally found the problem and a solution for 2.4.1
I was able to reproduce the explained behavior by submitting jobs to 2 partitions.
This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job.

I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c)

This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here.
I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition.

job_ptr = job_queue_rec->job_ptr;

part_ptr = job_queue_rec->part_ptr;
job_ptr->part_ptr = part_ptr;
xfree(job_queue_rec);

if (!IS_JOB_PENDING(job_ptr))

continue; /* started in other partition */

Hope this is enough information to solve it.

I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed.

Regards,
Carles Fenoy

dd1d573f

05 Jul, 2012 6 commits
- switch/nrt: elimnates all known memory leaks in the code · 832fe7e8
  Morris Jette authored Jul 05, 2012
  
  832fe7e8
- switch/nrt: memory leak fix. · 7a140c93
  Morris Jette authored Jul 05, 2012
```
This code change is completely different from IBM's example code, but
eliminates memory leaks that exist in iBM's sample code.
```
  7a140c93
- switch/nrt: major memory leak fix · d7f384e4
  Morris Jette authored Jul 05, 2012
  
  d7f384e4
- switch/nrt: fix unload table logic · f56323de
  Morris Jette authored Jul 05, 2012
  
  f56323de
- switch/nrt: Fix memory leaks · d757077c
  Morris Jette authored Jul 05, 2012
  
  d757077c
- switch/nrt: Add missing unload_table logic · 59631af3
  Morris Jette authored Jul 05, 2012
  
  59631af3
04 Jul, 2012 1 commit
- Merge branch 'slurm-2.4' · 943b8d27
  Morris Jette authored Jul 03, 2012
```
Conflicts:
	NEWS
```
  943b8d27