1. 11 Jan, 2011 5 commits
  2. 10 Jan, 2011 6 commits
    • Moe Jette's avatar
      31776ea0
    • Moe Jette's avatar
      -- Add scancel --reservation option to cancell all jobs associated with a · 2aba39da
      Moe Jette authored
          specific reservation.
      2aba39da
    • Moe Jette's avatar
      svn merge -r 21695:22013 https://eris.llnl.gov/svn/slurm/branches/v2.3-frontend · 1c335b73
      Moe Jette authored
       -- Added support for more than one front-end node to run slurmd on
          architectures where the slurmd does not execute on the compute nodes
          (e.g. BlueGene). New configuration paramters FrontendNode and FrontendAddr
          added. See "man slurm.conf" for more information.
      1c335b73
    • Moe Jette's avatar
    • Moe Jette's avatar
      fec68ac8
    • Moe Jette's avatar
      Patch from Gerrit: 01_salloc-Bug-Fix-nested-terminal-foreground-process.diff · 69e1d108
      Moe Jette authored
      
      salloc: notify terminal foreground process
      
      This fixes another bug observed in salloc child process cleanup. I found
      that some shells, e.g. zsh, do not forward all signals to their children.
      
      The patch fixes the problem that
       * command_pid is still active but does not equal tpgid,
       * tpgid is not the same as salloc's process group,
       * tpgid is very unlikely to come from another process, since we block
         the suspend/TSTP signal,
       * signalling command_pid does not automatically imply that the active
         terminal foreground process is also signalled,
       * hence send a HUP to signify "death of controlling process".
      
      This setup fixed the problem on zsh. I then went and tested a more complex setup:
      
      Before:
      -------
      palu2:0 ~>ps  f -o pid,pgid,tpgid,ppid,stat,tty,cmd
        PID  PGID TPGID  PPID STAT TT       CMD
      21117 21117 21597 21116 Ss   pts/9    -bash
      21260 21260 21597 21117 Sl   pts/9     \_ ./slurm_build/git/src/salloc/salloc -v --time=00:01:00 -N17 zsh
      21266 21266 21597 21260 S    pts/9         \_ zsh
      21323 21323 21597 21266 S    pts/9             \_ /bin/bash
      21397 21397 21597 21323 S    pts/9                 \_ -bin/tcsh
      21526 21526 21597 21397 S    pts/9                     \_ /bin/sh
      21597 21597 21597 21526 S+   pts/9                         \_ aprun -N1 -n17 sleep 12345
      21601 21597 21597 21597 S+   pts/9                             \_ aprun -N1 -n17 sleep 12345
      
      After the timeout:
      ------------------
      palu2:0 ~>ps  f -o pid,pgid,tpgid,ppid,stat,tty,cmd
        PID  PGID TPGID  PPID STAT TT       CMD
      21323 21323 21117     1 S    pts/9    /bin/bash
      21397 21397 21117 21323 S    pts/9     \_ -bin/tcsh
      21526 21526 21117 21397 S    pts/9         \_ /bin/sh
      
      ==> The 'dangerous' aprun terminal foreground process group 21597 has been removed, while the child
          subprocess groups 21323, 21397, and 21526 now exist as orph01_salloc-Bug-Fix-nested-terminal-foreground-process.diff
      aned groups, to be cleaned up by init.
      69e1d108
  3. 08 Jan, 2011 1 commit
  4. 07 Jan, 2011 10 commits
  5. 06 Jan, 2011 6 commits
  6. 05 Jan, 2011 3 commits
  7. 04 Jan, 2011 7 commits
  8. 03 Jan, 2011 2 commits