Commits · 5a3b5858086ab2a68dd0276793a8641ffbe0f7fe · Manuel G. Marciani / ces_slurm_simulator

14 Mar, 2012 1 commit

Add Cray BASIL/XML logging options · 0a2b9b0f

Morris Jette authored Mar 13, 2012

Cray - Enable logging of BASIL communications with environment variables.
Set XML_LOG to enable logging. Set XML_LOG_LOC to specify path to log file
or "SLURM" to write to SlurmctldLogFile or unset for "slurm_basil_xml.log".
Based on work by Steve Tronfinoff, CSCS.

0a2b9b0f

13 Mar, 2012 5 commits
- Enable Cray configure option of "--enable-salloc-background" · dee4155c
  Morris Jette authored Mar 13, 2012
```
permit the srun and salloc commands to be executed in the background
on Cray systems
```
  dee4155c
- Enable Cray configure option of "--enable-salloc-background" · bd4aff44
  Morris Jette authored Mar 13, 2012
```
permit the srun and salloc commands to be executed in the background
on Cray systems
```
  bd4aff44
- Add job reason of "FrontEndDown" · c6d9a826
  Morris Jette authored Mar 13, 2012
```
Add new job state reason of "FrontEndDown" which applies only to Cray and
IBM BlueGene systems.
```
  c6d9a826
- CRAY - ignore all interactive nodes and jobs on interactive nodes. · 6f487fde
  Danny Auble authored Mar 12, 2012
  
  6f487fde
- CRAY - ignore all interactive nodes and jobs on interactive nodes. · 8f12be5d
  Danny Auble authored Mar 12, 2012
  
  8f12be5d
12 Mar, 2012 1 commit
- BLUEGENE - fix issue where if a small block was in error it could hold up · 1306cbe3
  Danny Auble authored Mar 12, 2012
```
the queue when trying to place a larger than midplane job.
```
  1306cbe3
09 Mar, 2012 1 commit
- NEWS for last patch · f40d25ee
  Danny Auble authored Mar 09, 2012
  
  f40d25ee
07 Mar, 2012 1 commit

FRONTEND - fix issue where if a compute node was in a down state and · caadbfcb

Danny Auble authored Mar 06, 2012

an admin updates the node to idle/resume the compute nodes will go
instantly to idle instead of idle* which means no response.

caadbfcb

06 Mar, 2012 2 commits
- BLUEGENE - make it so the epilog runs until slurmctld tells it the job is · 9c461154
  Danny Auble authored Mar 06, 2012
```
gone.  Previously it had a timelimit which has proven to not be the right
thing.
```
  9c461154
- BGQ - catch errors from the kill option of the runjob client. · 2a56fd6d
  Danny Auble authored Mar 06, 2012
  
  2a56fd6d
02 Mar, 2012 1 commit
- cray/srun wrapper, don't use aprun -q by default · ea9adc17
  Morris Jette authored Mar 02, 2012
```
In cray/srun wrapper, only include aprun "-q" option when srun "--quiet"
option is used.
```
  ea9adc17
29 Feb, 2012 1 commit
- Fix bug in cray/srun wrapper stdin/out/err file handling. · 2ca7a0fc
  Morris Jette authored Feb 29, 2012
  
  2ca7a0fc
28 Feb, 2012 1 commit
- Note recent SLURM changes. · 38619c30
  Morris Jette authored Feb 28, 2012
  
  38619c30
24 Feb, 2012 5 commits
- Change default SchedulerParameters max_switch_wait · 46d63854
  Morris Jette authored Feb 24, 2012
```
Change default SchedulerParameters max_switch_wait field value from 60 to
300 seconds.
```
  46d63854
- Add missing read lock to slurmctld/agent.c · 0a06f4e6
  Morris Jette authored Feb 24, 2012
  
  0a06f4e6
- Correct "scontrol show daemons" if multiple ControlMachine hosts configured · 10916457
  Morris Jette authored Feb 24, 2012
  
  10916457
- Fixed extremely hard to reproduce threading issue in assoc_mgr. · b4e5051b
  Danny Auble authored Feb 24, 2012
  
  b4e5051b
- UPdate NEWS for recent patches · 6da55b36
  Morris Jette authored Feb 23, 2012
  
  6da55b36
23 Feb, 2012 1 commit
- Fix smap regression to display nodes that are drained or down correctly. · 3f467a75
  Danny Auble authored Feb 22, 2012
  
  3f467a75
20 Feb, 2012 1 commit
- Modify linking to include "-ldl" only when needed · d1adfe62
  jette authored Feb 19, 2012
```
Patch from Aleksej Saushev.
```
  d1adfe62
17 Feb, 2012 1 commit

BGQ - In scontrol/sview node counts are now displayed with · a78fcbe2

Danny Auble authored Feb 17, 2012

CnodeCount/CnodeErrCount so to point out there are cnodes in an error state
on the block. Draining the block and having it reboot when all jobs are
gone will clear up the cnodes in Software Failure.

a78fcbe2

16 Feb, 2012 1 commit

BGQ - fixed sync issue where if a job finishes in SLURM but not in mmcs · 2d8cdc98

Danny Auble authored Feb 16, 2012

for a long time after the SLURM job has been flushed from the system
we don't have to worry about rebooting the block to sync the system.

2d8cdc98

11 Feb, 2012 1 commit
- BGQ - fix for core dump after running multiple sub-block jobs on static · 5d2b961d
  Danny Auble authored Feb 10, 2012
```
blocks.
```
  5d2b961d
06 Feb, 2012 4 commits

BGQ - fix for handling mix of steps running at same time some of which · 5cb21068
Danny Auble authored Feb 06, 2012
```
are full allocation jobs, and others that are smaller.
```
5cb21068
BLUEGENE - Better handling blocks that go into error state or deallocate · 915881ab
Danny Auble authored Feb 06, 2012
```
while jobs are running on them.
```
915881ab
NEWS for last BGQ comment · 278179d3
Danny Auble authored Feb 06, 2012

278179d3

The openpty(3) call used by slurmstepd to allocate a pseudo-terminal · 2a1c08b0

Danny Auble authored Feb 02, 2012

is a convenience function in BSD and glibc that internally calls
the equivalent of

    int masterfd = open("/dev/ptmx", flags);
    grantpt (masterfd);
    unlockpt (masterfd);
    int slavefd = open (slave, O_RDRW|O_NOCTTY);

(in psuedocode)

On Linux, with some combinations of glibc/kernel (in this
case glibc-2.14/Linux-3.1), the equivalent of grantpt(3) was failing
in slurmstepd with EPERM, because the allocated pty was getting
root ownership instead of the user running the slurm job.

From the POSIX description of grantpt:

 "The grantpt() function shall change the mode and ownership of the
  slave pseudo-terminal device... The user ID of the slave shall
  be set to the real UID of the calling process..."

 http://pubs.opengroup.org/onlinepubs/007904875/functions/grantpt.html

This means that for POSIX-compliance, the real user id of slurmstepd
must be the user executing the SLURM job at the time openpty(3) is
called. Unfortunately, the real user id of slurmstepd at this
point is still root, and only the effective uid is set to the user.

This patch is a work-around that uses the (non-portable) setresuid(2)
system call to reset the real and effective uids of the slurmstepd
process to the job user, but keep the saved uid of root. Then after
the openpty(3) call, the previous credentials are reestablished
using the same call.

2a1c08b0

04 Feb, 2012 1 commit

Fix for srun with --exclude and --nodes · a79386fd

Morris Jette authored Feb 03, 2012

Fix for srun allocating running within existing allocation with --exclude
option and --nnodes count small enough to remove more nodes.

    > salloc -N 8
    salloc: Granted job allocation 1000008
    > srun -N 2 -n 2 --exclude=tux3 hostname
    srun: error: Unable to create job step: Requested node configuration is not available

Patch from Phil Eckert, LLNL.

a79386fd

03 Feb, 2012 1 commit

Fix for srun with --exclude and --nodes · a4551158

Morris Jette authored Feb 03, 2012

Fix for srun allocating running within existing allocation with --exclude
option and --nnodes count small enough to remove more nodes.

    > salloc -N 8
    salloc: Granted job allocation 1000008
    > srun -N 2 -n 2 --exclude=tux3 hostname
    srun: error: Unable to create job step: Requested node configuration is not available

Patch from Phil Eckert, LLNL.

a4551158

02 Feb, 2012 3 commits

Fix bug in step task distribution · 11db9adb

Morris Jette authored Feb 02, 2012

Fix bug in step task distribution when nodes are not configured in numeric
order. Patch from Hongjia Cao, NUDT.

11db9adb

Fix bug in step task distribution · fac3586b

Morris Jette authored Feb 02, 2012

Fix bug in step task distribution when nodes are not configured in numeric
order. Patch from Hongjia Cao, NUDT.

fac3586b

Transfer GPU file information to slurmstepd · bccf0f85

Morris Jette authored Feb 01, 2012

Add logic to cache GPU file information (bitmap index mapping to device
file number) in the slurmd daemon and transfer that information to the
slurmstepd whenever a job step is initiated. This is needed to set the
appropriate CUDA_VISIBLE_DEVICES environment variable value when the
devices are not in strict numeric order (e.g. some GPUs are skipped).
Based upon work by Nicolas Bigaouette.

bccf0f85

01 Feb, 2012 2 commits

Fix job requeue bug · c0a7a7a4

Morris Jette authored Feb 01, 2012

Fix bug when requeued batch job is scheduled to run on a different node
zero, but attemts job launch on old node zero causing fatal error
"Invalid host_index -1 for job #"

c0a7a7a4

Avoid slurmctld abort due to bad pointer · 43936335

Morris Jette authored Jan 31, 2012

Avoid slurmctld abort due to bad pointer when setting an advanced
reservation MAINT flag if it contains no nodes (only licenses).

43936335

31 Jan, 2012 4 commits
- BLUEGENE - fix for not allowing jobs if all midplanes are drained and all · 1e40f647
  Danny Auble authored Jan 31, 2012
```
blocks are in an error state.
```
  1e40f647
- Start news for v2.4.0-pre4 · 9a48840d
  Morris Jette authored Jan 31, 2012
  
  9a48840d
- Note nature of latest change · 7189ecaa
  Morris Jette authored Jan 31, 2012
  
  7189ecaa
- Fix to the multifactor priority plugin to calculate effective usage earlier · 7d9e3ed2
  Danny Auble authored Jan 31, 2012
```
to give a correct priority on the first decay cycle after a restart of the
slurmctld. Patch from Martin Perry, Bull.
```
  7d9e3ed2
28 Jan, 2012 1 commit
- BGQ - Fix issue where a system with missing cables could cause core dump. · 85d4f920
  Danny Auble authored Jan 27, 2012
  
  85d4f920