1. 02 May, 2013 2 commits
  2. 01 May, 2013 6 commits
  3. 30 Apr, 2013 3 commits
    • Morris Jette's avatar
      Change maximum delay for state save from 2 secs to 5 secs. · 5a2a76ff
      Morris Jette authored
      Make timeout configurable at build time by defining SAVE_MAX_WAIT.
      5a2a76ff
    • Olli-Pekka Lehto's avatar
      added script to help manage native and symmetric MPI runs within SLURM · fdf56162
      Olli-Pekka Lehto authored
      Dear all,
      
      As quick fix, I have put together this script to help manage native and symmetric MPI runs within SLURM. It's a bit bare-bones currently but I needed to get it working quickly :)
      
      It does not provide tight integration between the scheduler and MPI daemons and requires a slot on the host, even when running fully on the MIC, so it's really far from an optimal solution but could be a stopgap.
      
      It's inspired by the TACC Stampede documentation. They seem to have a similar script in place.
      
      It's fairly simple, you provide the names of the MIC binary (with -m) and host binary (with -c). The host MPI/OpenMP parameters are given as usual and the Xeon Phi side parameters as environment variables (MIC_PPN, MIC_OMP_NUM_THREADS). Currently it supports only 1 card per host but extending it should be simple enough.
      
      Here are a couple of links to documentation:
      
      Our prototype cluster documentation:
      https://confluence.csc.fi/display/HPCproto/HPC+Prototypes#HPCPrototypes-XeonPhiDevelopment
      Presentation at the PRACE Spring School in Umeå earlier this week:
      https://www.hpc2n.umu.se/sites/default/files/1.03%20CSC%20Cluster%20Introduction.pdf
      
      Feel free to include this in the contribs -directory. It might need a bit of cleanup though and I don't know when I have the time to do this.
      
      I have also added support for TotalView debugger (provided it's installed and configured properly for Xeon Phi usage).
      
      Future ideas:
      
      For the native MIC client, I've been testing it out a bit and looking at ways to minimize the changes needed for support. The two major challenges seem to be in scheduling and affinity:
      
      I think it might be necessary to put it into a specific topology plugin, like the one for BG/Q, but it looks like a lot of work to do that.
      
      Best regards,
      Olli-Pekka
      fdf56162
    • Danny Auble's avatar
      Accounting - make average by task not cpu. · 81ccec93
      Danny Auble authored
      81ccec93
  4. 29 Apr, 2013 3 commits
  5. 26 Apr, 2013 3 commits
  6. 25 Apr, 2013 2 commits
  7. 24 Apr, 2013 1 commit
  8. 23 Apr, 2013 3 commits
  9. 19 Apr, 2013 3 commits
  10. 18 Apr, 2013 1 commit
  11. 17 Apr, 2013 3 commits
  12. 16 Apr, 2013 2 commits
  13. 12 Apr, 2013 3 commits
    • Danny Auble's avatar
      ca3c2fa1
    • Danny Auble's avatar
      Replaced ipmi.conf with generic acct_gather.conf file for all acct_gather · c1793844
      Danny Auble authored
      plugins.  For those doing development to use this follow the model set
      forth in the acct_gather_energy_ipmi plugin.
      c1793844
    • Morris Jette's avatar
      gres/gpu - Fix for gres.conf file with multiple files on a single line · ee6a7066
      Morris Jette authored
      We're in the process of setting up a few GPU nodes in our cluster, and
      want to use Gres to control access to them.
      
      Currently, we have activated one node with 2 GPUs.  The gres.conf file
      on that node reads
      
      ----------------
      
      Name=gpu Count=2 File=/dev/nvidia[0-1]
      Name=localtmp Count=1800
      ----------------
      
      (the localtmp is just counting access to local tmp disk.)  Nodes without
      GPUs have gres.conf files like this:
      
      ----------------
      
      Name=gpu Count=0
      Name=localtmp Count=90
      ----------------
      
      slurm.conf contains the following:
      
      GresTypes=gpu,localtmp
      Nodename=DEFAULT Sockets=2 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=62976 Gres=localtmp:90 State=unknown
      [...]
      Nodename=c19-[1-16] NodeHostname=compute-19-[1-16] Weight=15848 CoresPerSocket=4 Gres=localtmp:1800,gpu:2 Feature=rack19,intel,ib
      
      Submitting a job with sbatch --gres:1 ... sets the CUDA_VISIBLE_DEVICES for
      the job.  However, the values seem a bit strange:
      
      - If we submit one job with --gres:1, CUDA_VISIBLE_DEVICES gets the value 0.
      
      - If we submit two jobs with --gres:1 at the same time,
        CUDA_VISIBLE_DEVICES gets the value 0 for one job, and 1633906540 for
        the other.
      
      - If we submit one job with --gres:2, CUDA_VISIBLE_DEVICES gets the
        value 0,1633906540
      ee6a7066
  14. 11 Apr, 2013 3 commits
  15. 10 Apr, 2013 2 commits