1. 24 Apr, 2012 1 commit
  2. 23 Apr, 2012 2 commits
  3. 20 Apr, 2012 1 commit
  4. 17 Apr, 2012 1 commit
  5. 12 Apr, 2012 1 commit
  6. 10 Apr, 2012 3 commits
  7. 03 Apr, 2012 1 commit
    • Morris Jette's avatar
      Limit depth of circular job dependency check · 0caecbc5
      Morris Jette authored
      Add support for new SchedulerParameters of max_depend_depth defining the
      maximum number of jobs to test for circular dependencies (i.e. job A waits
      for job B to start and job B waits for job A to start). Default value is
      10 jobs.
      0caecbc5
  8. 30 Mar, 2012 1 commit
  9. 29 Mar, 2012 1 commit
    • Morris Jette's avatar
      Fix in select/cons_res+topology+job with node range count · f64b29a2
      Morris Jette authored
      The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range of node counts is specified and the select/cons_res plugin and the topology/tree plugin and even then it is not easy to reproduce (you included all of the details below).
      
      Quoting Martin.Perry@Bull.com:
      
      > Certain combinations of topology configuration and srun -N option produce
      > spurious job rejection with "Requested node configuration is not
      > available" with select/cons_res. The following example illustrates the
      > problem.
      >
      > [sulu] (slurm) etc> cat slurm.conf
      > ...
      > TopologyPlugin=topology/tree
      > SelectType=select/cons_res
      > SelectTypeParameters=CR_Core
      > ...
      >
      > [sulu] (slurm) etc> cat topology.conf
      > SwitchName=s1 Nodes=xna[13-26]
      > SwitchName=s2 Nodes=xna[41-45]
      > SwitchName=s3 Switches=s[1-2]
      >
      > [sulu] (slurm) etc> sinfo
      > PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
      > ...
      > jkob         up   infinite      4   idle xna[14,19-20,41]
      > ...
      >
      > [sulu] (slurm) etc> srun -N 2-4 -n 4 -p jkob hostname
      > srun: Force Terminated job 79
      > srun: error: Unable to allocate resources: Requested node configuration is
      > not available
      >
      > The problem does not occur with select/linear, or topology/none, or if -N
      > is omitted, or for certain other values for -N (for example, -N 4-4 and -N
      > 2-3 work ok). The problem seems to be in function _eval_nodes_topo in
      > src/plugins/select/cons_res/job_test.c. The srun man page states that when
      > -N is used, "the job will be allocated as many nodes as possible within
      > the range specified and without delaying the initiation of the job."
      > Consistent with this description, the requested number of nodes in the
      > above example is 4 (req_nodes=4).  However, the code that selects the
      > best-fit topology switches appears to make the selection based on the
      > minimum required number of nodes (min_nodes=2). It therefore selects
      > switch s1.  s1 has only 3 nodes from partition jkob. Since this is fewer
      > than req_nodes the job is rejected with the "node configuration" error.
      >
      > I'm not sure where the code is going wrong.  It could be in the
      > calculation of the number of needed nodes in function _enough_nodes.  Or
      > it could be in the code that initializes/updates req_nodes or rem_nodes. I
      > don't feel confident that I understand the logic well enough to propose a
      > fix without introducing a regression.
      >
      > Regards,
      > Martin
      f64b29a2
  10. 27 Mar, 2012 2 commits
  11. 26 Mar, 2012 1 commit
  12. 21 Mar, 2012 2 commits
  13. 20 Mar, 2012 1 commit
  14. 16 Mar, 2012 7 commits
  15. 14 Mar, 2012 1 commit
    • Morris Jette's avatar
      Set Cray srun default job name · 0b24e690
      Morris Jette authored
      Cray - For srun wrapper when creating a job allocation, set the default job
      name to the executable file's name. Ignore leading directory names in the path.
      0b24e690
  16. 13 Mar, 2012 3 commits
  17. 12 Mar, 2012 1 commit
  18. 02 Mar, 2012 1 commit
  19. 29 Feb, 2012 1 commit
  20. 28 Feb, 2012 1 commit
  21. 24 Feb, 2012 4 commits
  22. 23 Feb, 2012 1 commit
  23. 20 Feb, 2012 1 commit
  24. 06 Feb, 2012 1 commit
    • Danny Auble's avatar
      The openpty(3) call used by slurmstepd to allocate a pseudo-terminal · 2a1c08b0
      Danny Auble authored
      is a convenience function in BSD and glibc that internally calls
      the equivalent of
      
          int masterfd = open("/dev/ptmx", flags);
          grantpt (masterfd);
          unlockpt (masterfd);
          int slavefd = open (slave, O_RDRW|O_NOCTTY);
      
      (in psuedocode)
      
      On Linux, with some combinations of glibc/kernel (in this
      case glibc-2.14/Linux-3.1), the equivalent of grantpt(3) was failing
      in slurmstepd with EPERM, because the allocated pty was getting
      root ownership instead of the user running the slurm job.
      
      From the POSIX description of grantpt:
      
       "The grantpt() function shall change the mode and ownership of the
        slave pseudo-terminal device... The user ID of the slave shall
        be set to the real UID of the calling process..."
      
       http://pubs.opengroup.org/onlinepubs/007904875/functions/grantpt.html
      
      This means that for POSIX-compliance, the real user id of slurmstepd
      must be the user executing the SLURM job at the time openpty(3) is
      called. Unfortunately, the real user id of slurmstepd at this
      point is still root, and only the effective uid is set to the user.
      
      This patch is a work-around that uses the (non-portable) setresuid(2)
      system call to reset the real and effective uids of the slurmstepd
      process to the job user, but keep the saved uid of root. Then after
      the openpty(3) call, the previous credentials are reestablished
      using the same call.
      2a1c08b0