- 12 Jul, 2012 12 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
where other blocks on an overlapping midplane are running jobs.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 11 Jul, 2012 11 commits
-
-
Danny Auble authored
-
Danny Auble authored
hardware is marked bad remove the larger block and create a block over just the bad hardware making the other hardware available to run on.
-
Morris Jette authored
-
Danny Auble authored
allocation.
-
Danny Auble authored
-
Danny Auble authored
for a job to finish on it the number of unused cpus wasn't updated correctly.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
same type and network ID. Add logic to match adapter name also. This is needed due to the additional IP_ONLY adapter named virbr0 as used for virtualization.
-
Morris Jette authored
-
Morris Jette authored
-
- 10 Jul, 2012 4 commits
-
-
Danny Auble authored
-
Morris Jette authored
When using the jobcomp/script interface, we have noticed the NODECNT environment variable is off-by-one when logging completed jobs in the NODE_FAIL state (though the NODELIST is correct). This appears to be because in many places in job_completion_logger() is called after deallocate_nodes(), which appears to decrement job->node_cnt for DOWN nodes. If job_completion_logger() only called the job completion plugin, then I would guess that it might be safe to move this call ahead of deallocate_nodes(). However, it seems like job_completion_logger() also does a bunch of accounting stuff (?), so perhaps that would need to be split out first? Also, there is the possibility that this is working as designed, though if so a well placed comment in the code might be appreciated. If the decreased nodecount is intended, though, should the DOWN nodes also be removed from the job's NODELIST? - Mark Grondona
-
Morris Jette authored
-
Morris Jette authored
-
- 09 Jul, 2012 1 commit
-
-
Martin Perry authored
See Bugzilla #73 for more complete description of the problem. Patch by Martin Perry, Bull.
-
- 07 Jul, 2012 1 commit
-
-
Morris Jette authored
Change the --network option. Rather than just putting the adapter name as an token in the option, specify it with the keyword "devname=".
-
- 06 Jul, 2012 4 commits
-
-
Morris Jette authored
The document still needs work, but is a decent start
-
Morris Jette authored
This move reduces the risk of srun failing horribly due to code that is inconsistent with the plugins if srun is running during a SLURM upgrade, especially a major upgrade in which the plugin function arguments can change
-
Morris Jette authored
Conflicts: src/slurmctld/job_scheduler.c
-
Carles Fenoy authored
If job is submitted to more than one partition, it's partition pointer can be set to an invalid value. This can result in the count of CPUs allocated on a node being bad, resulting in over- or under-allocation of its CPUs. Patch by Carles Fenoy, BSC. Hi all, After a tough day I've finally found the problem and a solution for 2.4.1 I was able to reproduce the explained behavior by submitting jobs to 2 partitions. This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job. I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c) This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here. I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition. job_ptr = job_queue_rec->job_ptr; part_ptr = job_queue_rec->part_ptr; job_ptr->part_ptr = part_ptr; xfree(job_queue_rec); if (!IS_JOB_PENDING(job_ptr)) continue; /* started in other partition */ Hope this is enough information to solve it. I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed. Regards, Carles Fenoy
-
- 05 Jul, 2012 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
This code change is completely different from IBM's example code, but eliminates memory leaks that exist in iBM's sample code.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 04 Jul, 2012 1 commit
-
-
Morris Jette authored
Conflicts: NEWS
-