1. 03 Jun, 2013 2 commits
    • Morris Jette's avatar
      Start NEWS for v2.5.8 · c795724d
      Morris Jette authored
      c795724d
    • Hongjia Cao's avatar
      restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8
      Hongjia Cao authored
      We're having some trouble getting our slurm jobs to successfully
      restart after a checkpoint.  For this test, I'm using sbatch and a
      simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
      I'm submitting the job using sbatch:
      
      $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh
      
      I am able to create the checkpoint and vacate the node:
      
      $ scontrol checkpoint create 137
      .... time passes ....
      $ scontrol vacate 137
      
      At that point, I see the checkpoint file from blcr in the current
      directory and the checkpoint file from Slurm
      in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
      restart the job:
      
      $ scontrol checkpoint restart 137
      scontrol_checkpoint error: Node count specification invalid
      
      In slurmctld's log (at level 7) I see:
      
      [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
      [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
      [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
      [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
      f82e0fb8
  2. 31 May, 2013 1 commit
  3. 30 May, 2013 1 commit
  4. 29 May, 2013 1 commit
  5. 24 May, 2013 2 commits
  6. 23 May, 2013 6 commits
  7. 22 May, 2013 2 commits
  8. 21 May, 2013 1 commit
  9. 18 May, 2013 1 commit
  10. 16 May, 2013 2 commits
  11. 14 May, 2013 1 commit
  12. 13 May, 2013 1 commit
  13. 11 May, 2013 1 commit
    • Morris Jette's avatar
      Added MaxCPUsPerNode partition configuration parameter. · e33c5d57
      Morris Jette authored
      This can be especially useful to schedule GPUs. For example a node can be
      associated with two Slurm partitions (e.g. "cpu" and "gpu") and the
      partition/queue "cpu" could be limited to only a subset of the node's CPUs,
      insuring that one or more CPUs would be available to jobs in the "gpu"
      partition/queue.
      e33c5d57
  14. 10 May, 2013 1 commit
  15. 08 May, 2013 2 commits
  16. 02 May, 2013 3 commits
  17. 01 May, 2013 6 commits
  18. 30 Apr, 2013 3 commits
    • Morris Jette's avatar
      Change maximum delay for state save from 2 secs to 5 secs. · 5a2a76ff
      Morris Jette authored
      Make timeout configurable at build time by defining SAVE_MAX_WAIT.
      5a2a76ff
    • Olli-Pekka Lehto's avatar
      added script to help manage native and symmetric MPI runs within SLURM · fdf56162
      Olli-Pekka Lehto authored
      Dear all,
      
      As quick fix, I have put together this script to help manage native and symmetric MPI runs within SLURM. It's a bit bare-bones currently but I needed to get it working quickly :)
      
      It does not provide tight integration between the scheduler and MPI daemons and requires a slot on the host, even when running fully on the MIC, so it's really far from an optimal solution but could be a stopgap.
      
      It's inspired by the TACC Stampede documentation. They seem to have a similar script in place.
      
      It's fairly simple, you provide the names of the MIC binary (with -m) and host binary (with -c). The host MPI/OpenMP parameters are given as usual and the Xeon Phi side parameters as environment variables (MIC_PPN, MIC_OMP_NUM_THREADS). Currently it supports only 1 card per host but extending it should be simple enough.
      
      Here are a couple of links to documentation:
      
      Our prototype cluster documentation:
      https://confluence.csc.fi/display/HPCproto/HPC+Prototypes#HPCPrototypes-XeonPhiDevelopment
      Presentation at the PRACE Spring School in Umeå earlier this week:
      https://www.hpc2n.umu.se/sites/default/files/1.03%20CSC%20Cluster%20Introduction.pdf
      
      Feel free to include this in the contribs -directory. It might need a bit of cleanup though and I don't know when I have the time to do this.
      
      I have also added support for TotalView debugger (provided it's installed and configured properly for Xeon Phi usage).
      
      Future ideas:
      
      For the native MIC client, I've been testing it out a bit and looking at ways to minimize the changes needed for support. The two major challenges seem to be in scheduling and affinity:
      
      I think it might be necessary to put it into a specific topology plugin, like the one for BG/Q, but it looks like a lot of work to do that.
      
      Best regards,
      Olli-Pekka
      fdf56162
    • Danny Auble's avatar
      Accounting - make average by task not cpu. · 81ccec93
      Danny Auble authored
      81ccec93
  19. 29 Apr, 2013 3 commits