- 15 Jan, 2011 1 commit
-
-
Danny Auble authored
-
- 14 Jan, 2011 9 commits
-
-
Don Lipari authored
slurm.conf man page
-
Danny Auble authored
-
Danny Auble authored
BLUEGENE - fixed race condition with preemption where if the wind blows the right way the slurmctld could lock up when preempting jobs to run others.
-
Moe Jette authored
This is needed by select/bluegene to synchronize block state for completing jobs.
-
Moe Jette authored
-
Moe Jette authored
with respect to BlueGene systems. Formerly logged bogus inconsistencies.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
Fixed issue where QOS priority wasn't re-normalized until a slurmctld restart when a QOS priority was changed.
-
- 13 Jan, 2011 3 commits
-
-
Danny Auble authored
Made it so QOS with UsageFactor set to 0 would make it so jobs running under that QOS wouldn't add time to fairshare or association/qos limits.
-
Moe Jette authored
-
Don Lipari authored
-
- 12 Jan, 2011 4 commits
-
-
Moe Jette authored
job. Formerly would display information about one job, but update next selected job.
-
Don Lipari authored
-
Joseph P. Donaghy authored
-
Moe Jette authored
until the job is released. Patch from Rod Schultz, Bull.
-
- 11 Jan, 2011 11 commits
-
-
Moe Jette authored
attempts from 60 seconds to 29 seconds. This should eliminate a possible synchronization problem with gang scheduling that could result in job step creation requests only occuring when a job is suspended.
-
Danny Auble authored
BLUEGENE - better checking small blocks in dynamic mode whether a full midplane job could run or not.
-
Moe Jette authored
running to be more clear and only print when --verbose option is used.
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
Moe Jette authored
it's partition is configured "Shared=EXCLUSIVE" (which is redundant).
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
after a job completed.
-
Moe Jette authored
-
- 10 Jan, 2011 5 commits
-
-
Moe Jette authored
size get queued).
-
Moe Jette authored
specific reservation.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
salloc: notify terminal foreground process This fixes another bug observed in salloc child process cleanup. I found that some shells, e.g. zsh, do not forward all signals to their children. The patch fixes the problem that * command_pid is still active but does not equal tpgid, * tpgid is not the same as salloc's process group, * tpgid is very unlikely to come from another process, since we block the suspend/TSTP signal, * signalling command_pid does not automatically imply that the active terminal foreground process is also signalled, * hence send a HUP to signify "death of controlling process". This setup fixed the problem on zsh. I then went and tested a more complex setup: Before: ------- palu2:0 ~>ps f -o pid,pgid,tpgid,ppid,stat,tty,cmd PID PGID TPGID PPID STAT TT CMD 21117 21117 21597 21116 Ss pts/9 -bash 21260 21260 21597 21117 Sl pts/9 \_ ./slurm_build/git/src/salloc/salloc -v --time=00:01:00 -N17 zsh 21266 21266 21597 21260 S pts/9 \_ zsh 21323 21323 21597 21266 S pts/9 \_ /bin/bash 21397 21397 21597 21323 S pts/9 \_ -bin/tcsh 21526 21526 21597 21397 S pts/9 \_ /bin/sh 21597 21597 21597 21526 S+ pts/9 \_ aprun -N1 -n17 sleep 12345 21601 21597 21597 21597 S+ pts/9 \_ aprun -N1 -n17 sleep 12345 After the timeout: ------------------ palu2:0 ~>ps f -o pid,pgid,tpgid,ppid,stat,tty,cmd PID PGID TPGID PPID STAT TT CMD 21323 21323 21117 1 S pts/9 /bin/bash 21397 21397 21117 21323 S pts/9 \_ -bin/tcsh 21526 21526 21117 21397 S pts/9 \_ /bin/sh ==> The 'dangerous' aprun terminal foreground process group 21597 has been removed, while the child subprocess groups 21323, 21397, and 21526 now exist as orph01_salloc-Bug-Fix-nested-terminal-foreground-process.diff aned groups, to be cleaned up by init.
-
- 07 Jan, 2011 2 commits
- 06 Jan, 2011 3 commits
- 03 Jan, 2011 2 commits
-
-
Moe Jette authored
-
Danny Auble authored
-