NEWS 6.57 KB
Newer Older
This file describes changes in recent versions of SLURM. It primarily
documents those changes that are of interest to users and admins. 

Mark Grondona's avatar
Mark Grondona committed
* Changes in SLURM 0.2.12
=========================
 -- Fixes for reported problems:
   - Fix for "waitpid: No child processes" when using TotalView (slurm/217).
   - Implemented temporary workaround for slurm/223: "Munge decode failed: 
     Munged communication error." 
   - Temporary fix for slurm/222: "elan3_create(0): Invalid argument."
 -- Fixed memory leaks in slurmctld (mostly due to reconfigure).
 -- More squeue/sinfo interface changes (see squeue(1), sinfo(1)).
 -- Sinfo now accepts list of node states to -t,--state option.
 -- Node "reason" field now available via sinfo command (see sinfo(1)).
 -- Wrapper source for srun (srun.wrapper.c) now installed and available
    for TotalView support.
 -- Improved retry login in user commands for periods when slurmctld
    primary is down and backup has not yet taken over.

* Changes in SLURM 0.2.11
=========================
 -- Changes in srun:
   - Fixed bug in signal handling that occaisonally resulted in orphaned 
     jobs when using Ctrl-C.
   - Return non-zero exit code when remote tasks are killed by a signal.
   - SIGALRM is now blocked by default.
 -- Added ``reason'' string for down, drained, or draining nodes. 
 -- Added -V,--version option to squeue and sinfo.
 -- Improved some error messages from user utilities.

* Changes in SLURM 0.2.10
=========================
 -- New slurm.conf configuration parameters:
   - WaitTime:    Default for srun -w,--wait parameter.
   - MaxJobCount: Maximum number of jobs SLURM can handle at one time.
   - MinJobAge:   Minimum time since completing before job is purged from 
                  slurmctld memory.
 -- Block user defined signals USR1 and USR2 in slurmd session manager.
 -- More squeue cleanup.
 -- Support for passing options to sinfo via environment variables.
 -- Added option to scontrol to find intersection of completing jobs and nodes.
 -- Added fix in auth_munge to prevent "Munged communication error" message.

* Changes in SLURM 0.2.9
========================
 -- Fixes for reported problems:
   - Argument to srun `-n' option was taken as octal if preceeded with a `0'.
 -- New format for Elan hosts config file (/etc/elanhosts. See README)
 -- Various fixes for managing COMPLETING jobs.
 -- Support for passing options to squeue via environment variables 
    (see squeue(1))

* Changes in SLURM 0.2.8
=========================
 -- Fix for bug in slurmd that could make debug messages appear in job output.
 -- Fix for bug in slurmctld retry count computation.
 -- Srun now times out slow launch threads.
 -- "Time Used" output in squeue now includes seconds.

* Changes in SLURM 0.2.7
=========================
 -- Fix for bug in Elan module that results in slurmd hang.
 -- Added completing job state to default list of states to print with squeue.

* Changes in SLURM 0.2.6
=========================
 -- More fixes for handling cleanup of slow terminating jobs.
 -- Fixed bug in srun that might leave nodes allocated after a Ctrl-C.

* Changes in SLURM 0.2.5
=========================
 -- Various fixes for cleanup of slow terminating or unkillable jobs.
 -- Fixed some small memory leaks in communications code.
 -- Added hack for synchronized exit of jobs on large node count.
 -- Long lists of nodes are no longer truncated in sinfo.
 -- Print more descriptive error message when tasks exit with nonzero status.
 -- Fixed bug in srun where unsuccessful launch attempts weren't detected.
 -- Elan network error resolver thread now runs from elan module in slurmd.
 -- Slurmctld uses consecutive Elan context and program description numbers
    instead of choosing them randomly.

* Changes in SLURM 0.2.4
==========================
 -- Fix for file descriptor leak in slurmctld.
 -- auth_munge plugin now prints credential info on decode failure.
 -- Minor changes to scancel interface.
 -- Filename format option "%J" now works again for srun --output and --error.
 
* Changes in SLURM 0.2.3
==========================
 -- Fix bug in srun when using per-task files for stderr.
 -- Better error reporting on failure to open per-task input/output files.
 -- Update auth_munge plugin for munge 0.1.
 -- Minor changes to squeue interface.
 -- New srun option `--hold' to submit job in "held" state.

* Changes in SLURM 0.2.2
==========================
 -- Fixes for reported problems:
   - Execution of script allocate mode fails in some cases. (gnats:161)
   - Errors using per-task input files with Elan support. (gnats:162)
   - srun doesn't handle all environment variables properly. (gnats:164)
 -- Parallel job is now terminated if a task is killed by a signal.
 -- Exit status of srun is set based on exit codes of tasks.
 -- Redesign of sinfo interface and options.
 -- Shutdown of slurmctld no longer propagates shutdown to all nodes.

Mark Grondona's avatar
Mark Grondona committed
* Changes in SLURM 0.2.1
===========================
 -- Fix bug where reconfigure request to slurmctld killed the daemon.

Mark Grondona's avatar
Mark Grondona committed
* Changes in SLURM 0.2.0
============================

 -- SlurmdTimeout of 0 means never set a non-responding node to DOWN.
 -- New srun option, -u,--unbuffered, for unbuffered stdout.
 -- Enhancements for sinfo
   - Non-responding nodes show "*" character appended instead of "NoResp+".
   - Node states show abbreviated variant by default
 -- Enhancements for scontrol.
   - Added "ping" command to show current state of SLURM controllers.
   - Job dump in scontrol shows user name as well as UID. 
   - Node state of DRAIN is appropriately mapped to DRAINING or DRAINED.
Mark Grondona's avatar
Mark Grondona committed
 -- Fix for bug where request for task count greater than partition limit
    was queued anyway.
 -- Fix for bugs in job end time handling.
Mark Grondona's avatar
Mark Grondona committed
 -- Modifications for error free builds on 64 bit architectures.
 -- Job cancel immediately deallocates nodes instead of waiting on srun.
 -- Attempt to create slurmd spool if it does not exist.
 -- Fixed signal handling bug in srun allocate mode.
 -- Earlier error detection in slurmd startup.
 -- "fatal: _shm_unlock: Numerical result out of range" bug fixed in slurmd.
 -- Config file parsing is now case insensitive.
 -- SLURM_NODELIST environment variable now set in allocate mode.

* Changes in SLURM 0.2.0-pre2
=============================

 -- Fix for reconfigure when public/private key path is changed.
 -- Shared memory fixes in slurmd. 
   - fix for infinite semaphore incrementation bug.
 -- Semaphore fixes in slurmctld.
 -- Slurmctld now remembers which nodes have registered after recover.
 -- Fixed reattach bug when tasks have exited.
 -- Change directory to /tmp in slurmd if daemonizing.
 -- Logfiles are reopened on reconfigure.

$Id$