Commits · 3912e694629b1c850fe96ffe47af9448a6a5a27d · Manuel G. Marciani / ces_slurm_simulator

10 Aug, 2012 1 commit

Improve reason information for job not schedulable due to busy nodes · 3912e694

Morris Jette authored Aug 10, 2012

Return ESLURM_NODES_BUSY rather than ESLURM_NODE_NOT_AVAIL error on job
submit when required nodes are up, but completing a job or in exclusive
job allocation.

3912e694

09 Aug, 2012 2 commits

Fix sbcast's credential to last till the end of a job instead of the · 069eead2
Matthieu Hautreux authored Aug 09, 2012
```
previous 20 minute time limit.  The previous behavior would fail for
large files 20 minutes into the transfer.
```
069eead2

Fix open file descriptor leak · 723f10a0

Morris Jette authored Aug 09, 2012

Close the batch job's environment file when it contains no data to avoid
leaking file descriptors.

723f10a0

08 Aug, 2012 2 commits
- BLUEGENE - updated documentation. · e4785c96
  Danny Auble authored Aug 08, 2012
  
  e4785c96
- BGQ - fix smap to set the correct default MloaderImage · 93993d17
  Danny Auble authored Aug 08, 2012
  
  93993d17
07 Aug, 2012 1 commit
- Fix salloc --gid to work correctly. Reported by Brian Gilmer · 57ad4e5f
  Brian Gilmer authored Aug 07, 2012
  
  57ad4e5f
06 Aug, 2012 1 commit
- BGQ - fixed srun when only requesting a task count and not a node count · a95c7b51
  Danny Auble authored Aug 06, 2012
```
to operate the same way salloc or sbatch did and assign a task per cpu
by default instead of task per node.
```
  a95c7b51
03 Aug, 2012 1 commit
- cgroups - fix if initial directory is non-existent SLURM creates it · 86a38b40
  Danny Auble authored Aug 03, 2012
```
correctly.  Before the errno wasn't being checked correctly
```
  86a38b40
01 Aug, 2012 4 commits
- Accounting - Fix so complete 32 bit numbers can be put in for a priority. · e3d2b258
  Danny Auble authored Aug 01, 2012
  
  e3d2b258
- Start NEWS for version 2.4.3 · a4c02041
  Morris Jette authored Aug 01, 2012
  
  a4c02041
- BGQ - update documentation about runjob_mux_refresh_config which works · 5a110989
  Danny Auble authored Aug 01, 2012
```
correctly as of IBM driver V1R1M1 efix 008.
```
  5a110989
- FRONTEND - Made error warning more apparent if a frontend node isn't · 4f24ccb5
  Danny Auble authored Aug 01, 2012
```
configured correctly.
```
  4f24ccb5
31 Jul, 2012 7 commits

Fixed sacct --state=S query to return information about suspended jobs · f8ff6b38
Danny Auble authored Jul 31, 2012
```
current or in the past.
```
f8ff6b38
BLUEGENE - correct start time setup when no jobs are blocking the way · 50f698d9
Mark Nelson authored Jul 31, 2012
```
from Mark Nelson
```
50f698d9

Use mount and umount syscalls when handling cgroup namespaces. · 485c80bc

Janne Blomqvist authored Jul 31, 2012

Using the syscalls directly rather than calling bin/(u)mount via
system() avoids a few fork + exec calls, and provides better error
handling if something goes wrong.

Users of this functionality are also updated to use slurm_strerror in
order to provide a more informative error message.

The mount and umount syscalls are Linux-specific, but so are cgroups
so no portability is lost.

485c80bc

remove last patch to give author credit · 557c52d1
Danny Auble authored Jul 31, 2012

557c52d1

Use mount and umount syscalls when handling cgroup namespaces. · c3889ec4

Danny Auble authored Jul 31, 2012

Using the syscalls directly rather than calling bin/(u)mount via
system() avoids a few fork + exec calls, and provides better error
handling if something goes wrong.

Users of this functionality are also updated to use slurm_strerror in
order to provide a more informative error message.

The mount and umount syscalls are Linux-specific, but so are cgroups
so no portability is lost.

c3889ec4

Use mount and umount syscalls when handling cgroup namespaces. · b4c1d3d7

Danny Auble authored Jul 31, 2012

Using the syscalls directly rather than calling bin/(u)mount via
system() avoids a few fork + exec calls, and provides better error
handling if something goes wrong.

Users of this functionality are also updated to use slurm_strerror in
order to provide a more informative error message.

The mount and umount syscalls are Linux-specific, but so are cgroups
so no portability is lost.

b4c1d3d7

BGQ - added version string to the load of the runjob_mux plugin to verify · 610cfe65
Danny Auble authored Jul 31, 2012
```
    the current plugin has been loaded when using runjob_mux_refresh_config
```
610cfe65

26 Jul, 2012 1 commit

Correct parsing of srun/sbatch input/output/error file names starting with "none" · 4234e00a

Morris Jette authored Jul 26, 2012

Correct parsing of srun/sbatch input/output/error file names so that only
the name "none" is mapped to /dev/null and not any file name starting
with "none" (e.g. "none.o"). This fixes bug #98.

4234e00a

24 Jul, 2012 1 commit

Gres: Fix for tracking allocated resources when one item and associated file · 102258a2

Morris Jette authored Jul 24, 2012

Gres: If a gres has a count of one and an associated file then when doing
a reconfiguration, the node's bitmap was not cleared resulting in an
underflow upon job termination or removal from scheduling matrix by the
backfill scheduler.

102258a2

23 Jul, 2012 1 commit

Cray and BlueGene: Correct logic for front-end node allocation tracking · ca95f242

Morris Jette authored Jul 23, 2012

Cray and BlueGene - Do not treat lack of usable front-end nodes when
slurmctld deamon starts as a fatal error. Also preserve correct front-end
node for jobs when there is more than one front-end node and the slurmctld
daemon restarts.

ca95f242

19 Jul, 2012 2 commits
- BLUEGENE - Fix for handling blocks when a larger block will not free and · 1b2b3c85
  Danny Auble authored Jul 19, 2012
```
while it is attempting to free underlying hardware is marked in error
making small blocks overlapping with the freeing block.  This only
applies to dynamic layout mode.
```
  1b2b3c85
- Reset backfilled job counter only when explicitly cleared using scontrol. · b4202119
  Alejandro Lucero Palau authored Jul 19, 2012
  
  b4202119
13 Jul, 2012 2 commits
- Fix initialization of protocol_version for some messages to make sure it · b34e5c28
  Danny Auble authored Jul 13, 2012
```
is always set when sending or receiving a message.
```
  b34e5c28
- BGL - Fix for syncing users on block from Tim Wickberg · 865bec2a
  Tim Wickberg authored Jul 13, 2012
  
  865bec2a
12 Jul, 2012 4 commits
- BGQ - Make it possible for a multi midplane allocation to run on more · 010570f4
  Danny Auble authored Jul 12, 2012
```
than 1 midplane but not the entire allocation.
```
  010570f4
- BGQ - correct logic to place multiple (< 1 midplane) steps inside a · 5ed86088
  Danny Auble authored Jul 12, 2012
```
multi midplane block allocation.
```
  5ed86088
- BGQ - correctly remove running jobs when freeing a shared block. · a1f9b6a7
  Danny Auble authored Jul 12, 2012
  
  a1f9b6a7
- BLUEGENE - Handle job completion correctly if an admin removes a block · 5430c095
  Danny Auble authored Jul 12, 2012
```
where other blocks on an overlapping midplane are running jobs.
```
  5430c095
11 Jul, 2012 3 commits
- BLUEGENE - If a large block (> 1 midplane) is in error and underlying · 0c371d36
  Danny Auble authored Jul 11, 2012
```
hardware is marked bad remove the larger block and create a block over
just the bad hardware making the other hardware available to run on.
```
  0c371d36
- BGQ - make sure we have a valid block when creating or finishing a step · 4731a11b
  Danny Auble authored Jul 11, 2012
```
allocation.
```
  4731a11b
- BLUEGENE - remove race condition where if a block is removed while waiting · 11e2759f
  Danny Auble authored Jul 11, 2012
```
for a job to finish on it the number of unused cpus wasn't updated
correctly.
```
  11e2759f
09 Jul, 2012 1 commit
- Fix bug in task layout with select/cons_res plugin and --ntasks-per-node · f9f087f2
  Martin Perry authored Jul 09, 2012
```
See Bugzilla #73 for more complete description of the problem.
Patch by Martin Perry, Bull.
```
  f9f087f2
06 Jul, 2012 1 commit

Fix for incorrect partition point for job · dd1d573f

Carles Fenoy authored Jul 05, 2012

If job is submitted to more than one partition, it's partition pointer can
be set to an invalid value. This can result in the count of CPUs allocated
on a node being bad, resulting in over- or under-allocation of its CPUs.
Patch by Carles Fenoy, BSC.

Hi all,

After a tough day I've finally found the problem and a solution for 2.4.1
I was able to reproduce the explained behavior by submitting jobs to 2 partitions.
This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job.

I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c)

This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here.
I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition.

job_ptr = job_queue_rec->job_ptr;

part_ptr = job_queue_rec->part_ptr;
job_ptr->part_ptr = part_ptr;
xfree(job_queue_rec);

if (!IS_JOB_PENDING(job_ptr))

continue; /* started in other partition */

Hope this is enough information to solve it.

I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed.

Regards,
Carles Fenoy

dd1d573f

03 Jul, 2012 1 commit
- BLUEGENE - Correct potential deadlock issue when hardware goes bad and · f0949d91
  Danny Auble authored Jul 03, 2012
```
there are jobs running on that hardware.
```
  f0949d91
02 Jul, 2012 1 commit
- Fix bug for job state change from 2.3 -> 2.4 job state can now be preserved · 3bc86988
  Carles Fenoy authored Jul 02, 2012
```
correctly when transitioning.  This also applies for 2.4.0 -> 2.4.1, no
state will be lost. (Thanks to Carles Fenoy)
```
  3bc86988
28 Jun, 2012 1 commit
- Changes for 2.4 tag · 94ea2e84
  Danny Auble authored Jun 28, 2012
  
  94ea2e84
26 Jun, 2012 2 commits
- BGQ - change linking from libslurm.o to libslurmhelper.la to avoid warning. · d83bf5f8
  Danny Auble authored Jun 26, 2012
  
  d83bf5f8
- BGQ - Modified documents to explain new plugin_flags needed in · f8ae9a15
  Danny Auble authored Jun 26, 2012
```
bg.properties in order for the runjob_mux to run correctly.

Signed-off-by: Danny Auble <da@schedmd.com>
```
  f8ae9a15