NEWS 269 KB
Newer Older
    --no-allocate option is used.
 -- Re-enabled "chkconfig --add" in default RPMs.
 -- Backup controller setting proper PID into slurmctld.pid file.
 -- Backup controller restores QSW state each time it assumes control
 -- Backup controller purges old job records before assuming control
    to avoid resurrecting defunct jobs.
 -- Kill jobs on non-responding DRAINING nodes and make their state
    DRAINED.
 -- Save state upon completion of a job's last EPILOG_COMPLETION to 
    reduce possibility of inconsistent job and node records when the 
    controller is transitioning between primary and backup. 
 -- Change logging level of detailed communication errors to not print 
    them unless detailed debugging is requested.
 -- Increase number of concurrent controller server threads from 20 
    to 50 and restructure code to handle backlogs more efficiently.
 -- Partition state at controller startup is based upon slurm.conf 
    rather than previously saved state. Additional improvements to 
    avoid inconsistent job/node/partition states at restart. Job state 
    information is used to arbitrate conflicts.
 -- Orphaned file descriptors eliminated.
* Changes in SLURM 0.2.16
=========================
 -- Fixes for reported problems:
   - slurm/265: Early termination of srun could cause job to remain in queue.
   - slurm/268: Slurmctld could deadlock if there was a delay in the 
     termination of a large node-count job. An EPILOG_COMPLETE RPC was 
     added so that slurmd could notify slurmctld whenever the job 
     termination was completed.
   - slurm/270: Segfault in sinfo if a configured node lacked a partition.
   - slurm/278: Exit code in scontrol did not indicate failure.
 -- Fixed bug in slurmd that caused the daemon to occaisionally kill itself.
 -- Fixed bug in srun when running with --no-allocate and >1 process per node.
 -- Small fixes and updates for srun manual.
* Changes in SLURM 0.2.15
=========================
 -- Fixes for reported problems:
   - slurm/265: Job was orphaned when allocation response message could 
     not be sent. Job is now killed on allocation response message transmit 
     failure and socket error details are logged.
   - Fix for slurm/267: "Job epilog may run multiple times."
 -- Squeue job TIMELIMIT format changed from "h:mm" to "d:h:mm:ss".
 -- DPCS initiated jobs have steps execute properly without explicit 
    specification of node count.

* Changes in SLURM 0.2.14
=========================
 -- Fixes for reported problems:
   - slurm/194: "srun doesn't handle most options when run under an allocation."
   - slurm/244: "REQ: squeue shows requested size of pending jobs."
 -- SLURM_NODELIST environment variable now exported to all jobs, not
    only batch jobs.
 -- Nodelist displayed in squeue for completing jobs is now restricted to 
    completing nodes.
 -- Node "reason" field properly displayed in sinfo even with filtering. 
 -- ``slurm_tv_clean'' daemon now supports a log file.
 -- Batch jobs are now re-queued on launch failure.
 -- Controller confirms job scripts for batch jobs are still running on 
    node zero at node registration.
 -- Default RPMs no longer stop/start SLURM daemons on upgrade or install.

* Changes in SLURM 0.2.13
=========================
 -- Fixes for reported problems:
   - Fixed bug in slurmctld where "drained" nodes would go back into
     the "idle" state under some conditions (slurm/228).
   - Added possible fix for slurm/229: "slurmd occasionally fails
     to reap all children."
 -- Fixed memory leak in auth_munge plugin.
 -- Added fix to slurmctld to allow arbitrarily large job specifications
    to be saved and recovered in the state file.
 -- Allow "updates" in the configuration file of previously defined
    node state and reason. 
 -- On "forceful termination" of a running job step, srun now exits
    unconditionally, instead of waiting for all I/O.
 -- Slurmctld now uses pidfile to kill old daemon when a new one is started.
 -- Addition of new daemon "slurm_tv_clean" used to clean up jobs orphaned
    due to use of the TotalView parallel debugger.

Mark Grondona's avatar
Mark Grondona committed
* Changes in SLURM 0.2.12
=========================
 -- Fixes for reported problems:
   - Fix for "waitpid: No child processes" when using TotalView (slurm/217).
   - Implemented temporary workaround for slurm/223: "Munge decode failed: 
     Munged communication error." 
   - Temporary fix for slurm/222: "elan3_create(0): Invalid argument."
 -- Fixed memory leaks in slurmctld (mostly due to reconfigure).
 -- More squeue/sinfo interface changes (see squeue(1), sinfo(1)).
 -- Sinfo now accepts list of node states to -t,--state option.
 -- Node "reason" field now available via sinfo command (see sinfo(1)).
 -- Wrapper source for srun (srun.wrapper.c) now installed and available
    for TotalView support.
 -- Improved retry login in user commands for periods when slurmctld
    primary is down and backup has not yet taken over.

* Changes in SLURM 0.2.11
=========================
 -- Changes in srun:
   - Fixed bug in signal handling that occaisonally resulted in orphaned 
     jobs when using Ctrl-C.
   - Return non-zero exit code when remote tasks are killed by a signal.
   - SIGALRM is now blocked by default.
 -- Added ``reason'' string for down, drained, or draining nodes. 
 -- Added -V,--version option to squeue and sinfo.
 -- Improved some error messages from user utilities.

* Changes in SLURM 0.2.10
=========================
 -- New slurm.conf configuration parameters:
   - WaitTime:    Default for srun -w,--wait parameter.
   - MaxJobCount: Maximum number of jobs SLURM can handle at one time.
   - MinJobAge:   Minimum time since completing before job is purged from 
                  slurmctld memory.
 -- Block user defined signals USR1 and USR2 in slurmd session manager.
 -- More squeue cleanup.
 -- Support for passing options to sinfo via environment variables.
 -- Added option to scontrol to find intersection of completing jobs and nodes.
 -- Added fix in auth_munge to prevent "Munged communication error" message.

* Changes in SLURM 0.2.9
========================
 -- Fixes for reported problems:
   - Argument to srun `-n' option was taken as octal if preceded with a `0'.
 -- New format for Elan hosts config file (/etc/elanhosts. See README)
 -- Various fixes for managing COMPLETING jobs.
 -- Support for passing options to squeue via environment variables 
    (see squeue(1))

* Changes in SLURM 0.2.8
=========================
 -- Fix for bug in slurmd that could make debug messages appear in job output.
 -- Fix for bug in slurmctld retry count computation.
 -- Srun now times out slow launch threads.
 -- "Time Used" output in squeue now includes seconds.

* Changes in SLURM 0.2.7
=========================
 -- Fix for bug in Elan module that results in slurmd hang.
 -- Added completing job state to default list of states to print with squeue.

* Changes in SLURM 0.2.6
=========================
 -- More fixes for handling cleanup of slow terminating jobs.
 -- Fixed bug in srun that might leave nodes allocated after a Ctrl-C.

* Changes in SLURM 0.2.5
=========================
 -- Various fixes for cleanup of slow terminating or unkillable jobs.
 -- Fixed some small memory leaks in communications code.
 -- Added hack for synchronized exit of jobs on large node count.
 -- Long lists of nodes are no longer truncated in sinfo.
 -- Print more descriptive error message when tasks exit with nonzero status.
 -- Fixed bug in srun where unsuccessful launch attempts weren't detected.
 -- Elan network error resolver thread now runs from elan module in slurmd.
 -- Slurmctld uses consecutive Elan context and program description numbers
    instead of choosing them randomly.

* Changes in SLURM 0.2.4
==========================
 -- Fix for file descriptor leak in slurmctld.
 -- auth_munge plugin now prints credential info on decode failure.
 -- Minor changes to scancel interface.
 -- Filename format option "%J" now works again for srun --output and --error.
 
* Changes in SLURM 0.2.3
==========================
 -- Fix bug in srun when using per-task files for stderr.
 -- Better error reporting on failure to open per-task input/output files.
 -- Update auth_munge plugin for munge 0.1.
 -- Minor changes to squeue interface.
 -- New srun option `--hold' to submit job in "held" state.

* Changes in SLURM 0.2.2
==========================
 -- Fixes for reported problems:
   - Execution of script allocate mode fails in some cases. (gnats:161)
   - Errors using per-task input files with Elan support. (gnats:162)
   - srun doesn't handle all environment variables properly. (gnats:164)
 -- Parallel job is now terminated if a task is killed by a signal.
 -- Exit status of srun is set based on exit codes of tasks.
 -- Redesign of sinfo interface and options.
 -- Shutdown of slurmctld no longer propagates shutdown to all nodes.

Mark Grondona's avatar
Mark Grondona committed
* Changes in SLURM 0.2.1
===========================
 -- Fix bug where reconfigure request to slurmctld killed the daemon.

Mark Grondona's avatar
Mark Grondona committed
* Changes in SLURM 0.2.0
============================

 -- SlurmdTimeout of 0 means never set a non-responding node to DOWN.
 -- New srun option, -u,--unbuffered, for unbuffered stdout.
 -- Enhancements for sinfo
   - Non-responding nodes show "*" character appended instead of "NoResp+".
   - Node states show abbreviated variant by default
 -- Enhancements for scontrol.
   - Added "ping" command to show current state of SLURM controllers.
   - Job dump in scontrol shows user name as well as UID. 
   - Node state of DRAIN is appropriately mapped to DRAINING or DRAINED.
Mark Grondona's avatar
Mark Grondona committed
 -- Fix for bug where request for task count greater than partition limit
    was queued anyway.
 -- Fix for bugs in job end time handling.
Mark Grondona's avatar
Mark Grondona committed
 -- Modifications for error free builds on 64 bit architectures.
 -- Job cancel immediately deallocates nodes instead of waiting on srun.
 -- Attempt to create slurmd spool if it does not exist.
 -- Fixed signal handling bug in srun allocate mode.
 -- Earlier error detection in slurmd startup.
 -- "fatal: _shm_unlock: Numerical result out of range" bug fixed in slurmd.
 -- Config file parsing is now case insensitive.
 -- SLURM_NODELIST environment variable now set in allocate mode.
* Changes in SLURM 0.2.0-pre2
=============================
 -- Fix for reconfigure when public/private key path is changed.
 -- Shared memory fixes in slurmd. 
   - fix for infinite semaphore incrementation bug.
 -- Semaphore fixes in slurmctld.
 -- Slurmctld now remembers which nodes have registered after recover.
 -- Fixed reattach bug when tasks have exited.
 -- Change directory to /tmp in slurmd if daemonizing.
 -- Logfiles are reopened on reconfigure.