Commits · e41807f325bd799652eee8d91419578600d3299e · Manuel G. Marciani / ces_slurm_simulator

19 Jul, 2012 4 commits
- Add "define _GNU_SOURCE" to avoid warning about undefined eccess function · e41807f3
  Morris Jette authored Jul 19, 2012
  
  e41807f3
- Note contributions by Francois Diakhate (CEA) · 9dfa657b
  Morris Jette authored Jul 19, 2012
  
  9dfa657b
- More robust verification of the TMPDIR · 7a320bd5
  Francois Diakhate authored Jul 19, 2012
  
  7a320bd5
- Reset backfilled job counter only when explicitly cleared using scontrol. · b4202119
  Alejandro Lucero Palau authored Jul 19, 2012
  
  b4202119
17 Jul, 2012 3 commits
- In sview, only report count of requested nodes if job is pending. · ab2dea3a
  Morris Jette authored Jul 17, 2012
```
This corresponds to commit dd2dce54
from Mark Grondona's work in squeue, but applied to the sview command.
```
  ab2dea3a
- Merge pull request #19 from grondo/slurm-2.4-minor-fixes · dd2dce54
  Morris Jette authored Jul 17, 2012
```
Slurm 2.4 minor fixes
```
  dd2dce54
- Note how optional arguments to the commands are parsed in man pages · f846d54b
  Morris Jette authored Jul 17, 2012
  
  f846d54b
16 Jul, 2012 1 commit
- Note limited sbatch support for --immediate option · e063642d
  Morris Jette authored Jul 16, 2012
```
This addresses trouble ticket 85
```
  e063642d
13 Jul, 2012 9 commits

BGQ - fix to handle sub block but larger than 1 midplane step in the · 3f38dbd6
Danny Auble authored Jul 13, 2012
```
runjob_mux
```
3f38dbd6
Fix initialization of protocol_version for some messages to make sure it · b34e5c28
Danny Auble authored Jul 13, 2012
```
is always set when sending or receiving a message.
```
b34e5c28
BGL - Fix for syncing users on block from Tim Wickberg · 865bec2a
Tim Wickberg authored Jul 13, 2012

865bec2a

slurmd: set SLURM_CONF in prolog/epilog environment · b2b5b908

Mark A. Grondona authored Jul 11, 2012

Set SLURM_CONF in default prolog/epilog environment instead
of only in spank prolog/epilog environment.

This change fixes a potential hang during spank prolog/epilog
execution due to the possibility of memory allocation after
fork(2) and before exec(2) when invoking slurmstepd spank
prolog|epilog.

This also has the benefit that SLURM commands used in prolog and epilog
scripts will use the correct slurm.conf file.

b2b5b908

slurmstepd: don't call exec if task fails to get notification from parent · 9006dda4

Mark A. Grondona authored May 19, 2012

If exec_wait_child_wait_for_parent() fails for any reason, it is safer
to abort immediately rather than proceed to execute the user's job.

9006dda4

slurmstepd: Kill remaining children if fork fails · 5b8dba9e

Mark A. Grondona authored May 19, 2012

On a failure of fork(2), slurmstepd would print an error and exit,
possibly leaving previously forked children waiting.

Ensure a better cleanup by killing all active children on fork failure
before exiting slurmstepd.

5b8dba9e

slurmstepd: Close childfd of exec_wait_info in parent · eca089e3

Mark A. Grondona authored May 19, 2012

Close the read end of the pipe slurmstepd uses to notify children
it is time to call exec(2) in order to save one file descriptor per
task. (Previously, the read side of the pipe wasn't closed until
exec_wait_info was destroyed)

eca089e3

squeue: report number of nodes in completing for completing jobs · 2ddc6e70

Mark A. Grondona authored Jul 11, 2012

For some reason squeue was treating completing jobs the same as
pending jobs, and reported the number of nodes as the maximum of
requested nodelist, requested node count or CPUs (divided into nodes?)

This is in contrast to the squeue manpage which explicitly states
that the number of nodes reported for completing jobs should
be only the nodes that are still allocated to the job.

This patch removes the special handling of completing jobs in
src/squeue/print.c:_get_node_cnt(), so that the squeue output for
completing jobs matches documentation. A comment is also added
so that developers looking at the code understand what is going on.

2ddc6e70

Update to high throughput computing web page with more option descriptions · 46a3767e
Morris Jette authored Jul 12, 2012

46a3767e

12 Jul, 2012 10 commits
- move an info message to be debug · 2d3c09ae
  Danny Auble authored Jul 12, 2012
  
  2d3c09ae
- BGQ - add correct locking to ensure protected structures · 7432016e
  Danny Auble authored Jul 12, 2012
  
  7432016e
- BGQ - add creation of bitmap if it does not already exist · 8640832b
  Danny Auble authored Jul 12, 2012
  
  8640832b
- BGQ - Make it possible for a multi midplane allocation to run on more · 010570f4
  Danny Auble authored Jul 12, 2012
```
than 1 midplane but not the entire allocation.
```
  010570f4
- BGQ - correct logic to place multiple (< 1 midplane) steps inside a · 5ed86088
  Danny Auble authored Jul 12, 2012
```
multi midplane block allocation.
```
  5ed86088
- BGQ - correctly remove running jobs when freeing a shared block. · a1f9b6a7
  Danny Auble authored Jul 12, 2012
  
  a1f9b6a7
- update slurm spec file to correctly build on a cray · 5fa2a17d
  Danny Auble authored Jul 12, 2012
  
  5fa2a17d
- BLUEGENE - better debug messages · eeb31e78
  Danny Auble authored Jul 12, 2012
  
  eeb31e78
- BLUEGENE - Handle job completion correctly if an admin removes a block · 5430c095
  Danny Auble authored Jul 12, 2012
```
where other blocks on an overlapping midplane are running jobs.
```
  5430c095
- Minor format change to sbatch man page · aedc5be9
  Morris Jette authored Jul 12, 2012
  
  aedc5be9
11 Jul, 2012 4 commits
- BLUEGENE - If a large block (> 1 midplane) is in error and underlying · 0c371d36
  Danny Auble authored Jul 11, 2012
```
hardware is marked bad remove the larger block and create a block over
just the bad hardware making the other hardware available to run on.
```
  0c371d36
- BGQ - make sure we have a valid block when creating or finishing a step · 4731a11b
  Danny Auble authored Jul 11, 2012
```
allocation.
```
  4731a11b
- BLUEGENE - Sanity check just to make sure BLOCK_MAGIC is correct · 74b70963
  Danny Auble authored Jul 11, 2012
  
  74b70963
- BLUEGENE - remove race condition where if a block is removed while waiting · 11e2759f
  Danny Auble authored Jul 11, 2012
```
for a job to finish on it the number of unused cpus wasn't updated
correctly.
```
  11e2759f
09 Jul, 2012 1 commit
- Fix bug in task layout with select/cons_res plugin and --ntasks-per-node · f9f087f2
  Martin Perry authored Jul 09, 2012
```
See Bugzilla #73 for more complete description of the problem.
Patch by Martin Perry, Bull.
```
  f9f087f2
06 Jul, 2012 1 commit

Fix for incorrect partition point for job · dd1d573f

Carles Fenoy authored Jul 05, 2012

If job is submitted to more than one partition, it's partition pointer can
be set to an invalid value. This can result in the count of CPUs allocated
on a node being bad, resulting in over- or under-allocation of its CPUs.
Patch by Carles Fenoy, BSC.

Hi all,

After a tough day I've finally found the problem and a solution for 2.4.1
I was able to reproduce the explained behavior by submitting jobs to 2 partitions.
This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job.

I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c)

This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here.
I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition.

job_ptr = job_queue_rec->job_ptr;

part_ptr = job_queue_rec->part_ptr;
job_ptr->part_ptr = part_ptr;
xfree(job_queue_rec);

if (!IS_JOB_PENDING(job_ptr))

continue; /* started in other partition */

Hope this is enough information to solve it.

I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed.

Regards,
Carles Fenoy

dd1d573f

04 Jul, 2012 1 commit
- Tweak test for down node · 4dc4fe90
  Morris Jette authored Jul 03, 2012
  
  4dc4fe90
03 Jul, 2012 4 commits
- BLUEGENE - Correct potential deadlock issue when hardware goes bad and · f0949d91
  Danny Auble authored Jul 03, 2012
```
there are jobs running on that hardware.
```
  f0949d91
- Add gres count value check (>0 && <NO_VAL, 0xfffffffe) · 88ad2c61
  Morris Jette authored Jul 03, 2012
  
  88ad2c61
- Clarify time limit handling in man page. · d37cab14
  Lipari, Don authored Jul 03, 2012
  
  d37cab14
- Fix typo in bluegene web page · 00b78dfa
  Tim Wickberg authored Jul 03, 2012
  
  00b78dfa
02 Jul, 2012 2 commits
- Update META for tag 2.4.1 · c8651870
  Danny Auble authored Jul 02, 2012
  
  c8651870
- fix to make 2.4.0 work to 2.4.1 state · 219aa3e8
  Danny Auble authored Jul 02, 2012
  
  219aa3e8