1. 13 Mar, 2012 1 commit
  2. 12 Mar, 2012 1 commit
  3. 02 Mar, 2012 2 commits
    • Morris Jette's avatar
      cray/srun wrapper, don't use aprun -q by default · ea9adc17
      Morris Jette authored
      In cray/srun wrapper, only include aprun "-q" option when srun "--quiet"
      option is used.
      ea9adc17
    • Morris Jette's avatar
      Fix for possible SEGV · ed56303c
      Morris Jette authored
      Here's what seems to have happened:
      
      - A job was pending, waiting for resources.
      - slurm.conf was changed to remove some nodes, and a scontrol reconfigure was done.
      - As a result of the reconfigure, the pending job became non-runnable, due to "Requested node configuration is not available". The scheduler set the job state to JOB_FAILED and called delete_job_details.
      - scontrol reconfigure was done again.
      - read_slurm_conf called _restore_job_dependencies.
      - _restore_job_dependencies called build_feature_list for each job in the job list
      - When build_feature_list tried to reference the now deleted job details for the failed job, it got a segmentation fault.
      
      The problem was reported by a customer on Slurm 2.2.7.  I have not been able to reproduce it on 2.4.0-pre3, although the relevant code looks the same. There may be a timing window. The attached patch attempts to fix the problem by adding a check to _restore_job_dependencies.  If the job state is JOB_FAILED, the job is skipped.
      
      Regards,
      Martin
      
      This is an alternative solutionh to bug316980fix.patch
      ed56303c
  4. 29 Feb, 2012 1 commit
  5. 28 Feb, 2012 5 commits
  6. 27 Feb, 2012 1 commit
    • Morris Jette's avatar
      Reduce gres error logging · 670be35a
      Morris Jette authored
      Only report "gres/<name> lacks File parameter" if some nodes define
      File AND this node does not AND (new part here) the GRES count on
      this node is non-zero
      670be35a
  7. 24 Feb, 2012 9 commits
  8. 23 Feb, 2012 1 commit
  9. 22 Feb, 2012 4 commits
  10. 21 Feb, 2012 1 commit
  11. 20 Feb, 2012 2 commits
  12. 06 Feb, 2012 3 commits
    • Danny Auble's avatar
      e3269071
    • Danny Auble's avatar
      The openpty(3) call used by slurmstepd to allocate a pseudo-terminal · 2a1c08b0
      Danny Auble authored
      is a convenience function in BSD and glibc that internally calls
      the equivalent of
      
          int masterfd = open("/dev/ptmx", flags);
          grantpt (masterfd);
          unlockpt (masterfd);
          int slavefd = open (slave, O_RDRW|O_NOCTTY);
      
      (in psuedocode)
      
      On Linux, with some combinations of glibc/kernel (in this
      case glibc-2.14/Linux-3.1), the equivalent of grantpt(3) was failing
      in slurmstepd with EPERM, because the allocated pty was getting
      root ownership instead of the user running the slurm job.
      
      From the POSIX description of grantpt:
      
       "The grantpt() function shall change the mode and ownership of the
        slave pseudo-terminal device... The user ID of the slave shall
        be set to the real UID of the calling process..."
      
       http://pubs.opengroup.org/onlinepubs/007904875/functions/grantpt.html
      
      This means that for POSIX-compliance, the real user id of slurmstepd
      must be the user executing the SLURM job at the time openpty(3) is
      called. Unfortunately, the real user id of slurmstepd at this
      point is still root, and only the effective uid is set to the user.
      
      This patch is a work-around that uses the (non-portable) setresuid(2)
      system call to reset the real and effective uids of the slurmstepd
      process to the job user, but keep the saved uid of root. Then after
      the openpty(3) call, the previous credentials are reestablished
      using the same call.
      2a1c08b0
    • Danny Auble's avatar
      1b1e6196
  13. 03 Feb, 2012 1 commit
    • Morris Jette's avatar
      Fix for srun with --exclude and --nodes · a4551158
      Morris Jette authored
      Fix for srun allocating running within existing allocation with --exclude
      option and --nnodes count small enough to remove more nodes.
      
          > salloc -N 8
          salloc: Granted job allocation 1000008
          > srun -N 2 -n 2 --exclude=tux3 hostname
          srun: error: Unable to create job step: Requested node configuration is not available
      
      Patch from Phil Eckert, LLNL.
      a4551158
  14. 02 Feb, 2012 1 commit
  15. 01 Feb, 2012 5 commits
  16. 31 Jan, 2012 2 commits