- 14 Mar, 2012 1 commit
-
-
Morris Jette authored
Cray - Enable logging of BASIL communications with environment variables. Set XML_LOG to enable logging. Set XML_LOG_LOC to specify path to log file or "SLURM" to write to SlurmctldLogFile or unset for "slurm_basil_xml.log". Based on work by Steve Tronfinoff, CSCS.
-
- 13 Mar, 2012 5 commits
-
-
Morris Jette authored
permit the srun and salloc commands to be executed in the background on Cray systems
-
Morris Jette authored
permit the srun and salloc commands to be executed in the background on Cray systems
-
Morris Jette authored
Add new job state reason of "FrontEndDown" which applies only to Cray and IBM BlueGene systems.
-
Danny Auble authored
-
Danny Auble authored
-
- 12 Mar, 2012 1 commit
-
-
Danny Auble authored
the queue when trying to place a larger than midplane job.
-
- 09 Mar, 2012 1 commit
-
-
Danny Auble authored
-
- 07 Mar, 2012 1 commit
-
-
Danny Auble authored
an admin updates the node to idle/resume the compute nodes will go instantly to idle instead of idle* which means no response.
-
- 06 Mar, 2012 2 commits
-
-
Danny Auble authored
gone. Previously it had a timelimit which has proven to not be the right thing.
-
Danny Auble authored
-
- 02 Mar, 2012 1 commit
-
-
Morris Jette authored
In cray/srun wrapper, only include aprun "-q" option when srun "--quiet" option is used.
-
- 29 Feb, 2012 1 commit
-
-
Morris Jette authored
-
- 28 Feb, 2012 1 commit
-
-
Morris Jette authored
-
- 24 Feb, 2012 5 commits
-
-
Morris Jette authored
Change default SchedulerParameters max_switch_wait field value from 60 to 300 seconds.
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
-
Morris Jette authored
-
- 23 Feb, 2012 1 commit
-
-
Danny Auble authored
-
- 20 Feb, 2012 1 commit
-
-
jette authored
Patch from Aleksej Saushev.
-
- 17 Feb, 2012 1 commit
-
-
Danny Auble authored
CnodeCount/CnodeErrCount so to point out there are cnodes in an error state on the block. Draining the block and having it reboot when all jobs are gone will clear up the cnodes in Software Failure.
-
- 16 Feb, 2012 1 commit
-
-
Danny Auble authored
for a long time after the SLURM job has been flushed from the system we don't have to worry about rebooting the block to sync the system.
-
- 11 Feb, 2012 1 commit
-
-
Danny Auble authored
blocks.
-
- 06 Feb, 2012 4 commits
-
-
Danny Auble authored
are full allocation jobs, and others that are smaller.
-
Danny Auble authored
while jobs are running on them.
-
Danny Auble authored
-
Danny Auble authored
is a convenience function in BSD and glibc that internally calls the equivalent of int masterfd = open("/dev/ptmx", flags); grantpt (masterfd); unlockpt (masterfd); int slavefd = open (slave, O_RDRW|O_NOCTTY); (in psuedocode) On Linux, with some combinations of glibc/kernel (in this case glibc-2.14/Linux-3.1), the equivalent of grantpt(3) was failing in slurmstepd with EPERM, because the allocated pty was getting root ownership instead of the user running the slurm job. From the POSIX description of grantpt: "The grantpt() function shall change the mode and ownership of the slave pseudo-terminal device... The user ID of the slave shall be set to the real UID of the calling process..." http://pubs.opengroup.org/onlinepubs/007904875/functions/grantpt.html This means that for POSIX-compliance, the real user id of slurmstepd must be the user executing the SLURM job at the time openpty(3) is called. Unfortunately, the real user id of slurmstepd at this point is still root, and only the effective uid is set to the user. This patch is a work-around that uses the (non-portable) setresuid(2) system call to reset the real and effective uids of the slurmstepd process to the job user, but keep the saved uid of root. Then after the openpty(3) call, the previous credentials are reestablished using the same call.
-
- 04 Feb, 2012 1 commit
-
-
Morris Jette authored
Fix for srun allocating running within existing allocation with --exclude option and --nnodes count small enough to remove more nodes. > salloc -N 8 salloc: Granted job allocation 1000008 > srun -N 2 -n 2 --exclude=tux3 hostname srun: error: Unable to create job step: Requested node configuration is not available Patch from Phil Eckert, LLNL.
-
- 03 Feb, 2012 1 commit
-
-
Morris Jette authored
Fix for srun allocating running within existing allocation with --exclude option and --nnodes count small enough to remove more nodes. > salloc -N 8 salloc: Granted job allocation 1000008 > srun -N 2 -n 2 --exclude=tux3 hostname srun: error: Unable to create job step: Requested node configuration is not available Patch from Phil Eckert, LLNL.
-
- 02 Feb, 2012 3 commits
-
-
Morris Jette authored
Fix bug in step task distribution when nodes are not configured in numeric order. Patch from Hongjia Cao, NUDT.
-
Morris Jette authored
Fix bug in step task distribution when nodes are not configured in numeric order. Patch from Hongjia Cao, NUDT.
-
Morris Jette authored
Add logic to cache GPU file information (bitmap index mapping to device file number) in the slurmd daemon and transfer that information to the slurmstepd whenever a job step is initiated. This is needed to set the appropriate CUDA_VISIBLE_DEVICES environment variable value when the devices are not in strict numeric order (e.g. some GPUs are skipped). Based upon work by Nicolas Bigaouette.
-
- 01 Feb, 2012 2 commits
-
-
Morris Jette authored
Fix bug when requeued batch job is scheduled to run on a different node zero, but attemts job launch on old node zero causing fatal error "Invalid host_index -1 for job #"
-
Morris Jette authored
Avoid slurmctld abort due to bad pointer when setting an advanced reservation MAINT flag if it contains no nodes (only licenses).
-
- 31 Jan, 2012 4 commits
-
-
Danny Auble authored
blocks are in an error state.
-
Morris Jette authored
-
Morris Jette authored
-
Danny Auble authored
to give a correct priority on the first decay cycle after a restart of the slurmctld. Patch from Martin Perry, Bull.
-
- 28 Jan, 2012 1 commit
-
-
Danny Auble authored
-