Newer
Older
-- Fix bug in sinfo to properly process specified state filter including
"*" suffix for non-responding nodes.
-- Create StateSaveLocation directory if changes via slurmctld reconfig
-- Fixes for reported problems:
- slurm/381: Hold jobs requesting more resources than partition limit.
- slurm/387: Jobs lost and nodes DOWN on slurmctld restart.
-- Add support for getting node's real memory size on AIX.
-- Sinfo sort partitions in slurm.conf order, new sort option ("#P").
-- Document how to gracefully change plugin values.
-- Slurmctld does not attempt to recover jobs when the switch plugin
value changes (decision reached when any job's switch state recovery
fails).
-- Node does not transition from COMPLETING to DOWN state due to
not responding. Wait for tasks to complete or admin to set DOWN.
-- Always chmod SlurmdSpoolDir to 755 (a umask of 007 was resulting
-- Return errors when trying to change configuration parameters
AuthType, SchedulerType, and SwitchType via "scontrol reconfig"
or SIGHUP. Document how to safely change these parameters.
-- Plugin-specific error number definitions and descriptive strings
moved from common into plugin modules.
-- Documentation for writing scheduler, switch, and job completion
logging plugins added.
-- Added job and node state descriptions to the squeue and sinfo man pages.
-- Backup slurmctld to generate core file on SIGABRT.
-- Backup slurmctld to re-read slurm.conf on SIGHUP.
-- Added -q,--quit-on-interrupt option to srun.
-- Elan switch plugin now starts neterr resolver thread on all Elan3
systems (QsNet and QsNetII).
-- Added some missing read locks for references for slurmctld's
configuration data structure
-- Modify processing of queued slurmctld message traffic to get better
throughput (resulted in job inactivity limit being reached improperly
when hundreds of jobs running simultaneously)
===============================
-- Fixes for reported problems:
- slurm/372: job state descriptions added to squeue man page
-- Switch plugin added. Add "SwitchType=switch/elan" to slurm.conf for
systems with Quadrics Elan3 or Elan4 switches.
-- Don't treat DOWN nodes with too few CPUs as a fatal error on Elan
-- Major re-write of html documents
-- Updates to node pinging for large numbers of unresponsive nodes
-- Explicitly set default action for SIGTERM (action on Thunder was
to ignore SIGTERM)
-- Sinfo "--exact" option only applies to fields actually displayed
-- Partition processor count not correctly computed for heterogeneous
clusters with FastSchedule=0 configuration
-- Only return DOWN nodes to service if the reason for them being in
that state is non-responsiveness and "ReturnToService=1" configuration
-- Partition processor count now correctly computed for heterogeneous
clusters with FastSchedule configured off
-- New macros and function to export SLURM version number
* Changes in SLURM 0.3.0.0-pre5
===============================
-- Fixes for reported problems:
- slurm/346: Support multiple colon-separated PluginDir values
-- Fix node state transition: DOWN to DRAINED (instead of DRAINING)
-- Fix a couple of minor slurmctld memory leaks
* Changes in SLURM 0.3.0.0-pre4
===============================
-- Fix bug where early launch failures (such as invalid UID/GID) resulted
in jobs not terminating properly.
-- Initial support for BNR committed (not yet functional).
-- QsNet: SLURM now uses /etc/elanhosts exclusively for converting
hostnames to ElanIDs.
* Changes in SLURM 0.3.0.0-pre3
===============================
-- Fixes for reported problems:
- slurm/328: Slurmd was restarting with a new shared memory segment and
losing track of jobs
- slurm/329: Job processing may be left running when one task dies
5080
5081
5082
5083
5084
5085
5086
5087
5088
5089
5090
5091
5092
5093
5094
5095
5096
5097
5098
5099
5100
5101
5102
5103
5104
5105
- slurm/333: Slurmd fails to launch a job and deletes a step, due to
a race condition in shared memory management
- slurm/334: Slurmd was getting a segv due to a race condition in shared
memory management
- slurm/342: Properly handle nodes being removed from configuration
even when there are partitions, nodes, or job steps still associated
with them
-- Srun properly terminates jobs/steps upon node failure (used to hang
waiting for I/O completion)
-- Job time limits enforced even if InactiveLimit configured as zero
-- Support the sending of an arbitrary signal to a batch script (but not
the processses in its job steps)
-- Re-read slurm configuration file whenever changed, needed by users
of SLURM APIs
-- Scancel was generating a assert failure
-- Slurmctld sends a launch response message upon scheduling of a queued
job (for immediate srun response)
-- Maui scheduler plugin added
-- Backfill scheduler plugin added
-- Batch scripts can now have arguments that are propogated
-- MPICH support added (via patch, not in SLURM CVS)
-- New SLURM environment variables added SLMR_CPUS_ON_NODE and
SLURM_LAUNCH_NODE_IPADDR, these provide support needed for LAM/MPI
(version 7.0.4+)
-- The TMPDIR directory is created as needed before job launch
-- Do not create duplicate SLURM environment variables with the same name
-- Insure proper enforcement of node sharing by job
-- Treat lack of SpoolDir or StateSaveDir as a fatal error
-- Quickstart.html guide expanded
-- Increase maximum jobs steps per node from 16 to 64
-- Delete correct shared memory segment on slurmd -c (clean start)
* Changes in SLURM 0.3.0.0-pre2
===============================
-- Fixes for reported problems:
- slurm/326: Properly clean-up jobs terminating on non-responding nodes
-- Move all configuration data structure into common/read_config, scontrol
now always shows default values if not specified in slurm.conf file
-- Remove the unused "Prioritize" configuration parameter
* Changes in SLURM 0.3.0.0-pre1
===============================
- slurm/252: "jobs left orphaned when using TotalView:" SLURM controller
now pings srun and kills defunct jobs.
- slurm/253: "srun fails to accept new IO connection."
- slurm/317: "Lack of default partition in config file causes errors."
- slurm/319: Socket errors on multiple simultaneous job launches fixed
- slurm/321: slurmd shared memory synchronization error.
-- Removed slurm_tv_clean daemon which has been obsoleted by slurm/252 fix.
-- New scontrol command ``delete'' and RPC added to delete a partition
-- Squeue can now print and sort by group id/name
-- Scancel has new option -q,--quiet to not report an error if a job
-- Add the excluded node list to job information reported.
-- RPC version mis-match now properly handled
-- New job completion plugin interface added for logging completed jobs.
-- Fixed lost digit in scontrol job priority specification.
-- Remove restriction in the number of consecutive node sets (no longer
needed after DPCS upgrade)
-- Incomplete state save write now properly handled.
-- Modified slurmd setrlimit error for greater clarity.
-- Slurmctld performs load-leveling across shared nodes.
-- New user function added slurm_get_end_time for user jobs.
-- Always compile srun with stabs debug section when TotalView support
is requested.
* Changes in SLURM 0.2.21
=========================
-- Fixes for reported problems:
- slurm/253: Try using different port if connect() fails (was rarely
failing when an existing defunct connection was in TIME_WAIT state)
- slurm/300: Possibly killing wrong job on slurmd restart
- slurm/312: Freeing non-allocated memory and killing slurmd
-- Assorted changes to support RedHat Enterprise Linux 3.0 and IA64
-- Initial Elan4 and libelanctrl support (--with-elan).
-- Slurmctld was sometimes inappropriately setting a job's priority
to 1 when a node was down (even if up nodes could be used for the
job when a running job completes)
-- Convert all user commands from use of popt library to getopt_long()
-- If TotalView support is requested, srun exports "totalview_jobid"
variable for `%J' expansion in TV bulk launch string.
-- Fix several locking bugs in slurmd IO layer.
-- Throttle back repetitious error messages in slurmd to avoid filling
* Changes in SLURM 0.2.20
=========================
-- Fixes for reported problems:
- slurm/298: Elan initialization error (Invalid vp 2147483674).
- slurm/299: srun fails to exit with multiple ^C's.
-- Temporarily prevent DPCS from allocating jobs with more than eight
sets of consecutive nodes. This was likely causing user applications
to fail with libelan errors. This will be removed after DPCS is updated.
-- Fix bug in popt use, was failing in some versions of Linux.
-- Resend KILL_JOB messages as needed to clear COMPLETING jobs.
-- Install dummy SIGCHLD handler in slurmd to fix problem on NPTL systems
where slurmd was not notified of terminated tasks.
* Changes in SLURM 0.2.19
=========================
-- Memory corruption bug fixed, it was causing slurmctld to seg-fault
* Changes in SLURM 0.2.18
=========================
-- Fixes for reported problems:
- slurm/287: slurm protocol timeouts when using TotalView.
- slurm/291: srun fails using ``-n 1'' under multi-node allocation.
- slurm/294: srun IO buffer reports ENOSPC.
-- Memory corruption bug fixed, it was causing slurmctld to seg-fault
-- Non-responding nodes now go from DRAINING to DRAINED state when
jobs complete
-- Do not schedule pending jobs while any job is actively COMPLETING
unless the submitted job specifically identifies its nodes (like DPCS)
-- Reset priority of jobs with priority==1 when a non-responding node
starts to respond again
-- Ignore jobs with priority==1 when establishing new baseline upon
slurmctld restart
-- Make slurmctld/message retry be timer based rather than queue based
for better scalability
-- Slurmctld logging is more concise, using hostlists more
-- srun --no-allocate used special job_id range to avoid conflicts
or premature job termination (purging by slurmctld)
-- New --jobid=id option in srun to initiate job step under an existing
allocation.
-- Support in srun for TotalView bulk launch.
* Changes in SLURM 0.2.17
=========================
-- Fixes for reported problems:
- slurm/279: Hold jobs that can't execute due to DOWN or DRAINED
nodes and release when nodes are returned to service.
- slurm/285: "srun killed due to SIGPIPE"
-- Support for running job steps on nodes relative to current
allocation via srun -r, --relative=n option.
-- SIGKILL no longer broadcasted to job via srun on task failure unless
--no-allocate option is used.
-- Re-enabled "chkconfig --add" in default RPMs.
-- Backup controller setting proper PID into slurmctld.pid file.
-- Backup controller restores QSW state each time it assumes control
-- Backup controller purges old job records before assuming control
to avoid resurrecting defunct jobs.
-- Kill jobs on non-responding DRAINING nodes and make their state
DRAINED.
-- Save state upon completion of a job's last EPILOG_COMPLETION to
reduce possibility of inconsistent job and node records when the
controller is transitioning between primary and backup.
-- Change logging level of detailed communication errors to not print
them unless detailed debugging is requested.
-- Increase number of concurrent controller server threads from 20
to 50 and restructure code to handle backlogs more efficiently.
-- Partition state at controller startup is based upon slurm.conf
rather than previously saved state. Additional improvements to
avoid inconsistent job/node/partition states at restart. Job state
information is used to arbitrate conflicts.
-- Orphaned file descriptors eliminated.
* Changes in SLURM 0.2.16
=========================
-- Fixes for reported problems:
- slurm/265: Early termination of srun could cause job to remain in queue.
- slurm/268: Slurmctld could deadlock if there was a delay in the
termination of a large node-count job. An EPILOG_COMPLETE RPC was
added so that slurmd could notify slurmctld whenever the job
termination was completed.
- slurm/270: Segfault in sinfo if a configured node lacked a partition.
- slurm/278: Exit code in scontrol did not indicate failure.
-- Fixed bug in slurmd that caused the daemon to occaisionally kill itself.
-- Fixed bug in srun when running with --no-allocate and >1 process per node.
-- Small fixes and updates for srun manual.
* Changes in SLURM 0.2.15
=========================
-- Fixes for reported problems:
- slurm/265: Job was orphaned when allocation response message could
not be sent. Job is now killed on allocation response message transmit
failure and socket error details are logged.
- Fix for slurm/267: "Job epilog may run multiple times."
-- Squeue job TIMELIMIT format changed from "h:mm" to "d:h:mm:ss".
-- DPCS initiated jobs have steps execute properly without explicit
specification of node count.
* Changes in SLURM 0.2.14
=========================
-- Fixes for reported problems:
- slurm/194: "srun doesn't handle most options when run under an allocation."
- slurm/244: "REQ: squeue shows requested size of pending jobs."
-- SLURM_NODELIST environment variable now exported to all jobs, not
only batch jobs.
-- Nodelist displayed in squeue for completing jobs is now restricted to
completing nodes.
-- Node "reason" field properly displayed in sinfo even with filtering.
-- ``slurm_tv_clean'' daemon now supports a log file.
-- Batch jobs are now re-queued on launch failure.
-- Controller confirms job scripts for batch jobs are still running on
node zero at node registration.
-- Default RPMs no longer stop/start SLURM daemons on upgrade or install.
* Changes in SLURM 0.2.13
=========================
-- Fixes for reported problems:
- Fixed bug in slurmctld where "drained" nodes would go back into
the "idle" state under some conditions (slurm/228).
- Added possible fix for slurm/229: "slurmd occasionally fails
to reap all children."
-- Fixed memory leak in auth_munge plugin.
-- Added fix to slurmctld to allow arbitrarily large job specifications
to be saved and recovered in the state file.
-- Allow "updates" in the configuration file of previously defined
node state and reason.
-- On "forceful termination" of a running job step, srun now exits
unconditionally, instead of waiting for all I/O.
-- Slurmctld now uses pidfile to kill old daemon when a new one is started.
-- Addition of new daemon "slurm_tv_clean" used to clean up jobs orphaned
due to use of the TotalView parallel debugger.
* Changes in SLURM 0.2.12
=========================
-- Fixes for reported problems:
- Fix for "waitpid: No child processes" when using TotalView (slurm/217).
- Implemented temporary workaround for slurm/223: "Munge decode failed:
Munged communication error."
- Temporary fix for slurm/222: "elan3_create(0): Invalid argument."
-- Fixed memory leaks in slurmctld (mostly due to reconfigure).
-- More squeue/sinfo interface changes (see squeue(1), sinfo(1)).
-- Sinfo now accepts list of node states to -t,--state option.
-- Node "reason" field now available via sinfo command (see sinfo(1)).
-- Wrapper source for srun (srun.wrapper.c) now installed and available
for TotalView support.
-- Improved retry login in user commands for periods when slurmctld
primary is down and backup has not yet taken over.
* Changes in SLURM 0.2.11
=========================
-- Changes in srun:
- Fixed bug in signal handling that occaisonally resulted in orphaned
jobs when using Ctrl-C.
- Return non-zero exit code when remote tasks are killed by a signal.
- SIGALRM is now blocked by default.
-- Added ``reason'' string for down, drained, or draining nodes.
-- Added -V,--version option to squeue and sinfo.
-- Improved some error messages from user utilities.
* Changes in SLURM 0.2.10
=========================
-- New slurm.conf configuration parameters:
- WaitTime: Default for srun -w,--wait parameter.
- MaxJobCount: Maximum number of jobs SLURM can handle at one time.
- MinJobAge: Minimum time since completing before job is purged from
slurmctld memory.
-- Block user defined signals USR1 and USR2 in slurmd session manager.
-- More squeue cleanup.
-- Support for passing options to sinfo via environment variables.
-- Added option to scontrol to find intersection of completing jobs and nodes.
-- Added fix in auth_munge to prevent "Munged communication error" message.
* Changes in SLURM 0.2.9
========================
-- Fixes for reported problems:
- Argument to srun `-n' option was taken as octal if preceded with a `0'.
-- New format for Elan hosts config file (/etc/elanhosts. See README)
-- Various fixes for managing COMPLETING jobs.
-- Support for passing options to squeue via environment variables
(see squeue(1))
* Changes in SLURM 0.2.8
=========================
-- Fix for bug in slurmd that could make debug messages appear in job output.
-- Fix for bug in slurmctld retry count computation.
-- Srun now times out slow launch threads.
-- "Time Used" output in squeue now includes seconds.
* Changes in SLURM 0.2.7
-- Fix for bug in Elan module that results in slurmd hang.
-- Added completing job state to default list of states to print with squeue.
* Changes in SLURM 0.2.6
=========================
-- More fixes for handling cleanup of slow terminating jobs.
-- Fixed bug in srun that might leave nodes allocated after a Ctrl-C.
* Changes in SLURM 0.2.5
=========================
-- Various fixes for cleanup of slow terminating or unkillable jobs.
-- Fixed some small memory leaks in communications code.
-- Added hack for synchronized exit of jobs on large node count.
-- Long lists of nodes are no longer truncated in sinfo.
-- Print more descriptive error message when tasks exit with nonzero status.
-- Fixed bug in srun where unsuccessful launch attempts weren't detected.
-- Elan network error resolver thread now runs from elan module in slurmd.
-- Slurmctld uses consecutive Elan context and program description numbers
instead of choosing them randomly.
* Changes in SLURM 0.2.4
==========================
-- Fix for file descriptor leak in slurmctld.
-- auth_munge plugin now prints credential info on decode failure.
-- Minor changes to scancel interface.
-- Filename format option "%J" now works again for srun --output and --error.
* Changes in SLURM 0.2.3
==========================
-- Fix bug in srun when using per-task files for stderr.
-- Better error reporting on failure to open per-task input/output files.
-- Update auth_munge plugin for munge 0.1.
-- Minor changes to squeue interface.
-- New srun option `--hold' to submit job in "held" state.
* Changes in SLURM 0.2.2
==========================
-- Fixes for reported problems:
- Execution of script allocate mode fails in some cases. (gnats:161)
- Errors using per-task input files with Elan support. (gnats:162)
- srun doesn't handle all environment variables properly. (gnats:164)
-- Parallel job is now terminated if a task is killed by a signal.
-- Exit status of srun is set based on exit codes of tasks.
-- Redesign of sinfo interface and options.
-- Shutdown of slurmctld no longer propagates shutdown to all nodes.
* Changes in SLURM 0.2.1
===========================
-- Fix bug where reconfigure request to slurmctld killed the daemon.
* Changes in SLURM 0.2.0
============================
-- SlurmdTimeout of 0 means never set a non-responding node to DOWN.
-- New srun option, -u,--unbuffered, for unbuffered stdout.
-- Enhancements for sinfo
- Non-responding nodes show "*" character appended instead of "NoResp+".
- Node states show abbreviated variant by default
-- Enhancements for scontrol.
- Added "ping" command to show current state of SLURM controllers.
- Job dump in scontrol shows user name as well as UID.
- Node state of DRAIN is appropriately mapped to DRAINING or DRAINED.
-- Fix for bug where request for task count greater than partition limit
was queued anyway.
-- Fix for bugs in job end time handling.
-- Modifications for error free builds on 64 bit architectures.
-- Job cancel immediately deallocates nodes instead of waiting on srun.
-- Attempt to create slurmd spool if it does not exist.
-- Fixed signal handling bug in srun allocate mode.
-- Earlier error detection in slurmd startup.
-- "fatal: _shm_unlock: Numerical result out of range" bug fixed in slurmd.
-- Config file parsing is now case insensitive.
-- SLURM_NODELIST environment variable now set in allocate mode.
* Changes in SLURM 0.2.0-pre2
=============================
-- Fix for reconfigure when public/private key path is changed.
-- Shared memory fixes in slurmd.
- fix for infinite semaphore incrementation bug.
-- Semaphore fixes in slurmctld.
-- Slurmctld now remembers which nodes have registered after recover.
-- Fixed reattach bug when tasks have exited.
-- Change directory to /tmp in slurmd if daemonizing.
-- Logfiles are reopened on reconfigure.