Commits · da4fd3c98ef8215c8af4610b7dedbe24f87842e0 · Manuel G. Marciani / ces_slurm_simulator

17 Apr, 2013 2 commits
- Fix reading past memory field · da4fd3c9
  Danny Auble authored Apr 16, 2013
  
  da4fd3c9
- Correction to "scontrol --detail show job" output for CPUs by node · 57b0fe40
  Morris Jette authored Apr 16, 2013
  
  57b0fe40
16 Apr, 2013 20 commits
- added news for last patch · c1a04150
  Danny Auble authored Apr 16, 2013
  
  c1a04150
- Added more options to update a step's information · a92a2392
  Danny Auble authored Apr 16, 2013
  
  a92a2392
- Merge remote-tracking branch 'origin/slurm-2.5' · 5376d4f2
  Danny Auble authored Apr 16, 2013
  
  5376d4f2
- Fix issue with the user tools not displaying correct node count for pending · 1648e3e0
  Danny Auble authored Apr 16, 2013
```
jobs if ntasks-per-node was used but no node count given.
```
  1648e3e0
- Add ntasks_per_node to be packed. · f6c883ba
  Danny Auble authored Apr 16, 2013
  
  f6c883ba
- set potentially unset variable · 4f7ce8ea
  Danny Auble authored Apr 16, 2013
  
  4f7ce8ea
- Make mention of ext_sensors plugin · 0e0fc7dd
  Danny Auble authored Apr 16, 2013
  
  0e0fc7dd
- fix test for cray systems · 289f2ce4
  Danny Auble authored Apr 16, 2013
  
  289f2ce4
- CRAY - set APRUN_DEFAULT_MEMROY instead of CRAY_AUTO_APRUN_OPTIONS · 1ac178c4
  Danny Auble authored Apr 16, 2013
  
  1ac178c4
- Remove unneeded structures and move enum into plugin · e9e8df67
  Danny Auble authored Apr 16, 2013
  
  e9e8df67
- remove HAVE_RRDTOOL #define · 2cf515be
  Danny Auble authored Apr 15, 2013
  
  2cf515be
- minor whitespace formatting · de712e1c
  Danny Auble authored Apr 15, 2013
  
  de712e1c
- fix another memory leak · a6f321b5
  Danny Auble authored Apr 15, 2013
  
  a6f321b5
- move header to correct place · 3faf59ce
  Danny Auble authored Apr 15, 2013
  
  3faf59ce
- Fix memory leak. · a8d44c39
  Danny Auble authored Apr 15, 2013
  
  a8d44c39
- minor whitespace fixes · a049775b
  Danny Auble authored Apr 15, 2013
  
  a049775b
- ran autogen.sh · fc3c9bb1
  Danny Auble authored Apr 15, 2013
  
  fc3c9bb1
- initial patch for external sensors · bc98723d
  Martin Perry authored Apr 15, 2013
  
  bc98723d
- Treat AccountingStorage of slurmdbd as invalid in slurmdbd.conf · ea964bdc
  Morris Jette authored Apr 16, 2013
  
  ea964bdc
- Correct qsub man page (torque wrapper) · 24794084
  Morris Jette authored Apr 15, 2013
  
  24794084
15 Apr, 2013 2 commits
- For acct_gather read of .conf file make it more robust when · 3a4edc4b
  Danny Auble authored Apr 15, 2013
```
file doesn't exist.
```
  3a4edc4b
- Revert change with added documentation. · 3195944d
  Thomas Cadeau authored Apr 15, 2013
  
  3195944d
13 Apr, 2013 4 commits
- Avoid segv if no config file to parse · abc0695a
  jette authored Apr 12, 2013
  
  abc0695a
- Merge branch 'slurm-2.5' · a9805ae2
  Morris Jette authored Apr 12, 2013
  
  a9805ae2
- Minor home page updates · f5f32a58
  Morris Jette authored Apr 12, 2013
  
  f5f32a58
- Restore test1.94 · b1cb22e2
  Morris Jette authored Apr 12, 2013
```
Accidentally removed in commit 3b4d338d
```
  b1cb22e2
12 Apr, 2013 12 commits

Merge branch 'mpi' · 35b1285f
Morris Jette authored Apr 12, 2013

35b1285f

Morris Jette authored Apr 12, 2013

Execute autogen.sh to rebuild src/plugins/mpi/pmi2/Makefile.in
Cast int to uint32 for un/pack32 function calls
Split a long line.

8beb90b0

Merge remote-tracking branch 'origin/slurm-2.5' · 2eeb897b
Danny Auble authored Apr 12, 2013

2eeb897b
update docs for ipmi.conf -> acct_gather.conf · a93a61ee
Danny Auble authored Apr 12, 2013

a93a61ee
Minor changes to last ipmi patch · c439a47e
Danny Auble authored Apr 12, 2013

c439a47e
Change sview to use GMutex instead of GStaticMutex · ca3c2fa1
Danny Auble authored Apr 12, 2013

ca3c2fa1

Sometimes, generally with several jobs on the same node or calling many sstat... · 072196e5

Thomas Cadeau authored Apr 12, 2013

Sometimes, generally with several jobs on the same node or calling many sstat for the job, the pipe is not ready to be read.
In this case, the function reading the pipe return an error and the values of consumed energy are set to NO_VAL.
From this point, the values are never read again because the process "knows" there is no value to read.
Thus, if there is one error, NO_VAL is saved in database and no information of consumed energy is stored.

To avoid this, we wrote the attached patch.

For first read of pipe, if the pipe doesn't exist, the function retry "NBFIRSTREAD = 3" times with a waiting time of 1 second.
Then during job run and for final read, if the pipe doesn't exist, the values are not updated.

The first time, the pipe is read if the writer thread is running.
If sstat fails to read pipe, the value is not update and last value is printed.
But if there is a problem during last read:

if there was sstat calls, the value exists but we miss all change between last sstat and end of step.
if not, the value is just "0" (no update from the begin).

072196e5

Missing check in from acct_gather.conf check in · bea11a6d
Danny Auble authored Apr 12, 2013

bea11a6d
Replaced ipmi.conf with generic acct_gather.conf file for all acct_gather · c1793844
Danny Auble authored Apr 12, 2013
```
plugins.  For those doing development to use this follow the model set
forth in the acct_gather_energy_ipmi plugin.
```
c1793844

some improvement of mpi/pmi2 · 392ed8d1

Hongjia Cao authored Apr 12, 2013

retransmit kvs messages to avoid failure due to async execution of
slurmstepds.

simple implementation of name publish/unpublish/lookup.

392ed8d1

fix test condition of log_fp() to avoid segfault · 0e391f7d
Hongjia Cao authored Apr 12, 2013

0e391f7d

gres/gpu - Fix for gres.conf file with multiple files on a single line · ee6a7066

Morris Jette authored Apr 12, 2013

We're in the process of setting up a few GPU nodes in our cluster, and
want to use Gres to control access to them.

Currently, we have activated one node with 2 GPUs.  The gres.conf file
on that node reads

----------------

Name=gpu Count=2 File=/dev/nvidia[0-1]
Name=localtmp Count=1800
----------------

(the localtmp is just counting access to local tmp disk.)  Nodes without
GPUs have gres.conf files like this:

----------------

Name=gpu Count=0
Name=localtmp Count=90
----------------

slurm.conf contains the following:

GresTypes=gpu,localtmp
Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:90 State=unknown
[...]
Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib

Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for
the job.  However, the values seem a bit strange:

- If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0.

- If we submit two jobs with --gres:1 at the same time,
  CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for
  the other.

- If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the
  value 0,1633906540

ee6a7066