Commits · c439a47e205f7ea3b011d8b5024624b17305ddab · Manuel G. Marciani / ces_slurm_simulator

12 Apr, 2013 5 commits

Minor changes to last ipmi patch · c439a47e
Danny Auble authored Apr 12, 2013

c439a47e

Sometimes, generally with several jobs on the same node or calling many sstat... · 072196e5

Thomas Cadeau authored Apr 12, 2013

Sometimes, generally with several jobs on the same node or calling many sstat for the job, the pipe is not ready to be read.
In this case, the function reading the pipe return an error and the values of consumed energy are set to NO_VAL.
From this point, the values are never read again because the process "knows" there is no value to read.
Thus, if there is one error, NO_VAL is saved in database and no information of consumed energy is stored.

To avoid this, we wrote the attached patch.

For first read of pipe, if the pipe doesn't exist, the function retry "NBFIRSTREAD = 3" times with a waiting time of 1 second.
Then during job run and for final read, if the pipe doesn't exist, the values are not updated.

The first time, the pipe is read if the writer thread is running.
If sstat fails to read pipe, the value is not update and last value is printed.
But if there is a problem during last read:

if there was sstat calls, the value exists but we miss all change between last sstat and end of step.
if not, the value is just "0" (no update from the begin).

072196e5

Missing check in from acct_gather.conf check in · bea11a6d
Danny Auble authored Apr 12, 2013

bea11a6d
Replaced ipmi.conf with generic acct_gather.conf file for all acct_gather · c1793844
Danny Auble authored Apr 12, 2013
```
plugins.  For those doing development to use this follow the model set
forth in the acct_gather_energy_ipmi plugin.
```
c1793844

gres/gpu - Fix for gres.conf file with multiple files on a single line · ee6a7066

Morris Jette authored Apr 12, 2013

We're in the process of setting up a few GPU nodes in our cluster, and
want to use Gres to control access to them.

Currently, we have activated one node with 2 GPUs.  The gres.conf file
on that node reads

----------------

Name=gpu Count=2 File=/dev/nvidia[0-1]
Name=localtmp Count=1800
----------------

(the localtmp is just counting access to local tmp disk.)  Nodes without
GPUs have gres.conf files like this:

----------------

Name=gpu Count=0
Name=localtmp Count=90
----------------

slurm.conf contains the following:

GresTypes=gpu,localtmp
Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:90 State=unknown
[...]
Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib

Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for
the job.  However, the values seem a bit strange:

- If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0.

- If we submit two jobs with --gres:1 at the same time,
  CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for
  the other.

- If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the
  value 0,1633906540

ee6a7066

11 Apr, 2013 6 commits
- Merge remote-tracking branch 'origin/slurm-2.5' · e6501902
  Danny Auble authored Apr 11, 2013
```
Conflicts:
	NEWS
```
  e6501902
- CRAY - Fix if srun --mem is given outside an allocation to set the · bf3a7790
  Danny Auble authored Apr 11, 2013
```
APRUN_DEFAULT_MEMORY env var for aprun.  This scenario will not display
the option when used with --launch-cmd.
```
  bf3a7790
- Disable sbcast tests for all front-end systems (e.g. Cray) · dd717234
  Morris Jette authored Apr 11, 2013
  
  dd717234
- CRAY - fix issue with --mem option not giving correct amount of memory · 85d5556a
  Danny Auble authored Apr 11, 2013
```
per cpu.
```
  85d5556a
- Disable sbcast tests for all front-end systems (e.g. Cray) · 5d1a2d9c
  Morris Jette authored Apr 11, 2013
  
  5d1a2d9c
- CRAY - fix issue with --mem option not giving correct amount of memory · a5bc91e6
  Danny Auble authored Apr 11, 2013
```
per cpu.
```
  a5bc91e6
10 Apr, 2013 10 commits
- Merge branch 'slurm-2.5' · b86955c2
  Morris Jette authored Apr 10, 2013
  
  b86955c2
- Always build launch/srun plugin · eeee7682
  Morris Jette authored Apr 10, 2013
  
  eeee7682
- Add some srun logging · 131ca542
  Morris Jette authored Apr 10, 2013
  
  131ca542
- Cray - Cosmetic changes to the select/cray plugin · 226a47f4
  Morris Jette authored Apr 10, 2013
  
  226a47f4
- CRAY - Correct task per node count in BASIL Reservation request · 4896b2ed
  Morris Jette authored Apr 10, 2013
```
If task count specified, but no tasks-per-node, then set the tasks
per node in the BASIL reservation request.
```
  4896b2ed
- Merge remote-tracking branch 'origin/slurm-2.5' · 0e0449c2
  Danny Auble authored Apr 09, 2013
```
Conflicts:
	src/plugins/select/cray/cray_config.c
```
  0e0449c2
- CRAY - add missing variable · 38aaeaaf
  Danny Auble authored Apr 09, 2013
  
  38aaeaaf
- CRAY - Set to better variable name and fix issue on reconfig. · 248bdc0e
  Danny Auble authored Apr 09, 2013
  
  248bdc0e
- CRAY - If hostlist is given with srun make sure the node count is the same · 64d46ce9
  Danny Auble authored Apr 09, 2013
```
as the hosts given.
```
  64d46ce9
- CRAY - Fix memory issue with reading in the cray.conf file. · ead03ea3
  Danny Auble authored Apr 09, 2013
  
  ead03ea3
09 Apr, 2013 6 commits
- Cray - avoid check of macro versions of CLE for version 5.0. · 8b1aa47f
  Danny Auble authored Apr 09, 2013
  
  8b1aa47f
- Fix Typo · f573eda9
  Danny Auble authored Apr 09, 2013
  
  f573eda9
- BLUEGENE - Set the geometry of a job when a block is picked and the job · 8a95ece2
  Danny Auble authored Apr 09, 2013
```
isn't a sub-block job.
```
  8a95ece2
- CRAY - Remove other objects from MySQL query that are available from · 2f9d9f67
  Danny Auble authored Apr 08, 2013
```
the XML.
```
  2f9d9f67
- CRAY - Fix sanity check for systems with more than 32 cores per node. · 9d557ced
  Danny Auble authored Apr 08, 2013
  
  9d557ced
- Disable srun --pty option unless LaunchType=launch/slurm. · 3eb0414c
  Morris Jette authored Apr 08, 2013
```
Fix for bug 258
```
  3eb0414c
08 Apr, 2013 4 commits
- Fix formatting in new test · 4cd6c0ec
  Morris Jette authored Apr 08, 2013
  
  4cd6c0ec
- Add test of srun within an allocation with a hostlist · c90d25f1
  Morris Jette authored Apr 08, 2013
  
  c90d25f1
- Add new variables to sview · 9761721d
  Nathan Yee authored Apr 08, 2013
```
FrontEndNode: allow/deny groups/users
Node: Power consumption
```
  9761721d
- some updates to SUG 2013 info on web page · 8d42aaab
  Morris Jette authored Apr 08, 2013
  
  8d42aaab
06 Apr, 2013 2 commits

Fix for timing issue in job steps now that slurm is much faster · 87e647bd
jette authored Apr 05, 2013
```
(at initiating pending job steps), interupt driven rather than
retry based.\
```
87e647bd

Fix sched/backfill logic for min/max time limit and part limit · 1967d512

Morris Jette authored Apr 05, 2013

Fix sched/backfill logic to initiate jobs with maximum time limit over the
partition limit, but the minimum time limit permits it to start.
Related to bug 251

1967d512

05 Apr, 2013 5 commits
- Merge branch 'slurm-2.5' · be21784e
  Morris Jette authored Apr 05, 2013
```
Conflicts:
	src/slurmctld/job_mgr.c
```
  be21784e
- Refactoring of Cray NPPCU logic, simpler code · 8127554a
  Danny Auble authored Apr 05, 2013
  
  8127554a
- Rename some variables in select/linear, no change in logic · bb12de92
  Morris Jette authored Apr 05, 2013
  
  bb12de92
- Move slurm_get_avail_procs() from src/common to select/linear · 32e79416
  Morris Jette authored Apr 04, 2013
```
This is the only remaining use of the function.
Also added boards parameter to the function and removed some unused parameters
```
  32e79416
- Cosmetic mods to commit a6d3074d · d52369d1
  Morris Jette authored Apr 04, 2013
```
Mostly to split long lines
```
  d52369d1
04 Apr, 2013 2 commits

Cray - NPPCU support patch · a6d3074d

Stephen Trofinoff authored Apr 04, 2013

I am sending the latest update of my NPPCU-support patch for Slurm 2.5.0. As before, this patch is applied over my basic BASIL 1.3 support patch. The reason for this latest version is that it came to my attention, that certain jobs that should have been rejected by Slurm were allowed through. I then further noticed that this would cause the backfill algorithm to slow down dramatically (often not being able to process any other jobs).
The cause of the problem was that when I introduced the functionality into Slurm to properly set the "nppcu" (number of processors per compute unit) attribute in the XML reservation request to ALPS, I didn't also adjust the tests earlier in the code that eliminate nodes from consideration that do not have sufficient resources. In other words, jobs that would exceed the absolute total number of processors on the node would be rejected as always (this is good). Jobs that required the reduced number of "visible" processors on the node or less were allocated and worked fine (this is good). Unfortunately, jobs that needed a number of processors somewhere in between these limits (let's call them the soft and hard limits) were allowed through by Slurm. Making matters worse, when Slurm would subsequently try to request the ALPS reservation, ALPS would correctly reject it but Slurm would keep trying--this would then kill the backfilling. In my opinion, these jobs should have been rejected from the onset by Slurm as they are asking for more processors per node than can be supplied. If the user wants this number of processors they should specify the "--ntasks-per-core=..." (in our case "2" as that is the full number of hardware threads per core). Obviously, this problem only appeared when I used CR_ONE_TASK_PER_CORE in the slurm.conf as I had modified the code to set nppcu to 1 when Slurm was configured with that option and the user didn't explicitly specify a different value.
The patch appears to be working well for us now and so I am submitting it to you for your review.

a6d3074d

Web page updates · 24d48b04
Morris Jette authored Apr 04, 2013

24d48b04