1. 14 Mar, 2012 1 commit
    • Morris Jette's avatar
      Add Cray BASIL/XML logging options · 0a2b9b0f
      Morris Jette authored
      Cray - Enable logging of BASIL communications with environment variables.
      Set XML_LOG to enable logging. Set XML_LOG_LOC to specify path to log file
      or "SLURM" to write to SlurmctldLogFile or unset for "slurm_basil_xml.log".
      Based on work by Steve Tronfinoff, CSCS.
      0a2b9b0f
  2. 13 Mar, 2012 5 commits
  3. 12 Mar, 2012 1 commit
  4. 09 Mar, 2012 1 commit
  5. 07 Mar, 2012 1 commit
  6. 06 Mar, 2012 2 commits
  7. 02 Mar, 2012 1 commit
  8. 29 Feb, 2012 1 commit
  9. 28 Feb, 2012 1 commit
  10. 24 Feb, 2012 5 commits
  11. 23 Feb, 2012 1 commit
  12. 20 Feb, 2012 1 commit
  13. 17 Feb, 2012 1 commit
  14. 16 Feb, 2012 1 commit
  15. 11 Feb, 2012 1 commit
  16. 06 Feb, 2012 4 commits
    • Danny Auble's avatar
      BGQ - fix for handling mix of steps running at same time some of which · 5cb21068
      Danny Auble authored
      are full allocation jobs, and others that are smaller.
      5cb21068
    • Danny Auble's avatar
      BLUEGENE - Better handling blocks that go into error state or deallocate · 915881ab
      Danny Auble authored
      while jobs are running on them.
      915881ab
    • Danny Auble's avatar
      NEWS for last BGQ comment · 278179d3
      Danny Auble authored
      278179d3
    • Danny Auble's avatar
      The openpty(3) call used by slurmstepd to allocate a pseudo-terminal · 2a1c08b0
      Danny Auble authored
      is a convenience function in BSD and glibc that internally calls
      the equivalent of
      
          int masterfd = open("/dev/ptmx", flags);
          grantpt (masterfd);
          unlockpt (masterfd);
          int slavefd = open (slave, O_RDRW|O_NOCTTY);
      
      (in psuedocode)
      
      On Linux, with some combinations of glibc/kernel (in this
      case glibc-2.14/Linux-3.1), the equivalent of grantpt(3) was failing
      in slurmstepd with EPERM, because the allocated pty was getting
      root ownership instead of the user running the slurm job.
      
      From the POSIX description of grantpt:
      
       "The grantpt() function shall change the mode and ownership of the
        slave pseudo-terminal device... The user ID of the slave shall
        be set to the real UID of the calling process..."
      
       http://pubs.opengroup.org/onlinepubs/007904875/functions/grantpt.html
      
      This means that for POSIX-compliance, the real user id of slurmstepd
      must be the user executing the SLURM job at the time openpty(3) is
      called. Unfortunately, the real user id of slurmstepd at this
      point is still root, and only the effective uid is set to the user.
      
      This patch is a work-around that uses the (non-portable) setresuid(2)
      system call to reset the real and effective uids of the slurmstepd
      process to the job user, but keep the saved uid of root. Then after
      the openpty(3) call, the previous credentials are reestablished
      using the same call.
      2a1c08b0
  17. 04 Feb, 2012 1 commit
    • Morris Jette's avatar
      Fix for srun with --exclude and --nodes · a79386fd
      Morris Jette authored
      Fix for srun allocating running within existing allocation with --exclude
      option and --nnodes count small enough to remove more nodes.
      
          > salloc -N 8
          salloc: Granted job allocation 1000008
          > srun -N 2 -n 2 --exclude=tux3 hostname
          srun: error: Unable to create job step: Requested node configuration is not available
      
      Patch from Phil Eckert, LLNL.
      a79386fd
  18. 03 Feb, 2012 1 commit
    • Morris Jette's avatar
      Fix for srun with --exclude and --nodes · a4551158
      Morris Jette authored
      Fix for srun allocating running within existing allocation with --exclude
      option and --nnodes count small enough to remove more nodes.
      
          > salloc -N 8
          salloc: Granted job allocation 1000008
          > srun -N 2 -n 2 --exclude=tux3 hostname
          srun: error: Unable to create job step: Requested node configuration is not available
      
      Patch from Phil Eckert, LLNL.
      a4551158
  19. 02 Feb, 2012 3 commits
    • Morris Jette's avatar
      Fix bug in step task distribution · 11db9adb
      Morris Jette authored
      Fix bug in step task distribution when nodes are not configured in numeric
      order. Patch from Hongjia Cao, NUDT.
      11db9adb
    • Morris Jette's avatar
      Fix bug in step task distribution · fac3586b
      Morris Jette authored
      Fix bug in step task distribution when nodes are not configured in numeric
      order. Patch from Hongjia Cao, NUDT.
      fac3586b
    • Morris Jette's avatar
      Transfer GPU file information to slurmstepd · bccf0f85
      Morris Jette authored
      Add logic to cache GPU file information (bitmap index mapping to device
      file number) in the slurmd daemon and transfer that information to the
      slurmstepd whenever a job step is initiated. This is needed to set the
      appropriate CUDA_VISIBLE_DEVICES environment variable value when the
      devices are not in strict numeric order (e.g. some GPUs are skipped).
      Based upon work by Nicolas Bigaouette.
      bccf0f85
  20. 01 Feb, 2012 2 commits
    • Morris Jette's avatar
      Fix job requeue bug · c0a7a7a4
      Morris Jette authored
      Fix bug when requeued batch job is scheduled to run on a different node
      zero, but attemts job launch on old node zero causing fatal error
      "Invalid host_index -1 for job #"
      c0a7a7a4
    • Morris Jette's avatar
      Avoid slurmctld abort due to bad pointer · 43936335
      Morris Jette authored
      Avoid slurmctld abort due to bad pointer when setting an advanced
      reservation MAINT flag if it contains no nodes (only licenses).
      43936335
  21. 31 Jan, 2012 4 commits
  22. 28 Jan, 2012 1 commit