- 24 Jul, 2012 1 commit
-
-
Danny Auble authored
-
- 23 Jul, 2012 1 commit
-
-
Morris Jette authored
Cray and BlueGene - Do not treat lack of usable front-end nodes when slurmctld deamon starts as a fatal error. Also preserve correct front-end node for jobs when there is more than one front-end node and the slurmctld daemon restarts.
-
- 19 Jul, 2012 8 commits
-
-
Danny Auble authored
-
Danny Auble authored
while it is attempting to free underlying hardware is marked in error making small blocks overlapping with the freeing block. This only applies to dynamic layout mode.
-
Bill Brophy authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Francois Diakhate authored
-
Alejandro Lucero Palau authored
-
- 17 Jul, 2012 3 commits
-
-
Morris Jette authored
This corresponds to commit dd2dce54 from Mark Grondona's work in squeue, but applied to the sview command.
-
Morris Jette authored
Slurm 2.4 minor fixes
-
Morris Jette authored
-
- 16 Jul, 2012 1 commit
-
-
Morris Jette authored
This addresses trouble ticket 85
-
- 13 Jul, 2012 9 commits
-
-
Danny Auble authored
runjob_mux
-
Danny Auble authored
is always set when sending or receiving a message.
-
Tim Wickberg authored
-
Mark A. Grondona authored
Set SLURM_CONF in default prolog/epilog environment instead of only in spank prolog/epilog environment. This change fixes a potential hang during spank prolog/epilog execution due to the possibility of memory allocation after fork(2) and before exec(2) when invoking slurmstepd spank prolog|epilog. This also has the benefit that SLURM commands used in prolog and epilog scripts will use the correct slurm.conf file.
-
Mark A. Grondona authored
If exec_wait_child_wait_for_parent() fails for any reason, it is safer to abort immediately rather than proceed to execute the user's job.
-
Mark A. Grondona authored
On a failure of fork(2), slurmstepd would print an error and exit, possibly leaving previously forked children waiting. Ensure a better cleanup by killing all active children on fork failure before exiting slurmstepd.
-
Mark A. Grondona authored
Close the read end of the pipe slurmstepd uses to notify children it is time to call exec(2) in order to save one file descriptor per task. (Previously, the read side of the pipe wasn't closed until exec_wait_info was destroyed)
-
Mark A. Grondona authored
For some reason squeue was treating completing jobs the same as pending jobs, and reported the number of nodes as the maximum of requested nodelist, requested node count or CPUs (divided into nodes?) This is in contrast to the squeue manpage which explicitly states that the number of nodes reported for completing jobs should be only the nodes that are still allocated to the job. This patch removes the special handling of completing jobs in src/squeue/print.c:_get_node_cnt(), so that the squeue output for completing jobs matches documentation. A comment is also added so that developers looking at the code understand what is going on.
-
Morris Jette authored
-
- 12 Jul, 2012 10 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
than 1 midplane but not the entire allocation.
-
Danny Auble authored
multi midplane block allocation.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
where other blocks on an overlapping midplane are running jobs.
-
Morris Jette authored
-
- 11 Jul, 2012 4 commits
-
-
Danny Auble authored
hardware is marked bad remove the larger block and create a block over just the bad hardware making the other hardware available to run on.
-
Danny Auble authored
allocation.
-
Danny Auble authored
-
Danny Auble authored
for a job to finish on it the number of unused cpus wasn't updated correctly.
-
- 09 Jul, 2012 1 commit
-
-
Martin Perry authored
See Bugzilla #73 for more complete description of the problem. Patch by Martin Perry, Bull.
-
- 06 Jul, 2012 1 commit
-
-
Carles Fenoy authored
If job is submitted to more than one partition, it's partition pointer can be set to an invalid value. This can result in the count of CPUs allocated on a node being bad, resulting in over- or under-allocation of its CPUs. Patch by Carles Fenoy, BSC. Hi all, After a tough day I've finally found the problem and a solution for 2.4.1 I was able to reproduce the explained behavior by submitting jobs to 2 partitions. This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job. I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c) This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here. I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition. job_ptr = job_queue_rec->job_ptr; part_ptr = job_queue_rec->part_ptr; job_ptr->part_ptr = part_ptr; xfree(job_queue_rec); if (!IS_JOB_PENDING(job_ptr)) continue; /* started in other partition */ Hope this is enough information to solve it. I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed. Regards, Carles Fenoy
-
- 04 Jul, 2012 1 commit
-
-
Morris Jette authored
-