This file describes changes in recent versions of SLURM. It primarily documents those changes that are of interest to users and admins. * Changes in SLURM 0.4.0-pre6 ============================= -- Add new job reason value "JobHeld" for jobs with priority==0 * Changes in SLURM 0.4.0-pre5 ============================= -- select/bluegene plugin confirms db.properties file in $sysconfdir and copies it to StateSaveLocation (slurmctld's working directory) -- select/bluegene plugin confirms environment variable required for DB2 interaction are set (execute "db2profile" script before slurmctld) -- slurmd to always give jobs KillWait time between SIGTERM and SIGKILL at termination -- set job's start_time and end_time = now rather than leaving zero if they fail to execute -- modify srun to forward SIGTERM -- enable select/bluegene testing for DOWN nodes and switches -- select/bluegene plugin to delete orphan jobs, free BGLblocks and set owner as jobs terminate/start * Changes in SLURM 0.4.0-pre4 ============================= -- Fixes for reported problems: - slurm/512: Let job steps run on DRAINING nodes - slurm/513: Gracefully deal with UIDs missing from passwd file -- Add support for MPICH-GM (from takao.hatazaki@hp.com) -- Add support for NodeHostname in node configuration -- Make "scontrol show daemons" function properly on front-end system (e.g. Blue Gene) -- Fix srun bug when --input, --output and --error are all "none" -- Don't schedule jobs for user root if partition is DOWN -- Modify select/bluegene to honor job's required node list -- Modify user name logic to explicitly set UID=0 to "root", Suse Linux was not handling multiple users with UID=0 well. * Changes in SLURM 0.4.0-pre3 ============================= -- Send SIGTERM to batch script before SIGKILL for mpirun cleanup on Blue Gene/L -- Create new allocation as needed for debugger in case old allocation has been purged -- Add Blue Gene User Guide to html documents -- Fix srun bug that could cause seg fault with --no-shell option if not running under a debugger -- Propogate job's task count (if set) for batch job via SLURM_NPROCS. -- Add new job parameters for Blue Gene: geometry, rotate, mode (virtual or co-processor), communications type (mesh or torus), and partition ID. -- Exercise a bunch of new switch plugin functions for Federation switch support. -- Fix bug in scheduling jobs when a processor count is specified and FastSchedule=0 and the cluster is heterogeneous. * Changes in SLURM 0.4.0-pre2 ============================= -- NOTE: "startclean" when transitioning from version 0.4.0-pre1, JOBS ARE LOST -- Fixes for reported problems: - slurm/477: Signal of batch job script (scancel -b) fixed - slurm/481: Permit clearing of AllowGroups field for a partition - slurm/482: Adjust Elan base context number to match RMS range - slurm/489: Job completion logger was writing NULL to text file -- Preserve job's requested processor count info after job is initiated (for viewing by squeue and scontrol) -- srun cancels created job if job step creation fails -- Added a lots of Blue Gene/L support logic: slurmd executes on a single node to front-end the 512-CPU base-partitions (Blue Gene/L's nodes) -- Add node selection plugin infrastructure, relocate existing logic to select/linear, add configuration parameter SelectType -- Modify node hashing algorithm for better performance on Blue Gene/L -- Add ability to specify node ranges for 3-D rectangular prism * Changes in SLURM 0.4.0-pre1 ============================= -- NOTE: "startclean" when transitioning from version 0.3, JOBS ARE LOST -- Added support for job account information (arbitrary string) -- Added support for job dependencies (start job X after job Y completes) -- Added support for configuration parameter CheckpointType -- Added new job state "CANCELLED" -- Don't strip binaries, breaks parallel debuggers -- Fix bug in Munge authentication retry logic -- Change srun handling of interupts to work properly with TotalView -- Added "reason" field to job info showing why a job is waiting to run * Changes in SLURM 0.3.7 ======================== -- Fixes required for TotalView operability under RHEL3.0 (Reported by Dong Ahn ) - Do not create detached threads when running under parallel debugger. - Handle EINTR from sigwait(). * Changes in SLURM 0.3.6 ======================== -- Fixes for reported problems: - slurm/459: Properly support partition's "Shared=force" configuration. -- Resync node state to DRAINED or DRAINING on restart in case job and node state recovered are out of sync. -- Added jobcomp/script plugin (execute script on job completion, from Nathan Huff, North Dakota State University). -- Added new error code ESLURM_FRAGMENTED for immediate resource allocation requests which are refused due to completing job (formerly returned ESLURM_NOT_TOP_PRIORITY) -- Modified job completion logging plugin calling sequence. -- Added much of the infrastructure required for system checkpoint (APIs, RPCs, and NULL plugin) * Changes in SLURM 0.3.5 ======================== -- Fix "SLURM_RLIMIT_* not found in environment" error message when distributing large rlimit to jobs. -- Add support for slurm_spawn() and associated APIs (needed for IBM SP systems). -- Fix bug in update of node state to DRAINING/DRAINED when update request occurs prior to initial node registration. -- Fix bug in purging of batch jobs (active batch jobs were being improperly purged starting in version 0.3.0). -- When updating a node state to DRAINING/DRAINED a Reason must be provided. The user name and a timestamp will automatically be appended to that Reason. * Changes in SLURM 0.3.4 ======================== -- Fixes for reported problems: - slurm/404: Explicitly set pthread stack size to 1MB for srun -- Allow srun to respond to ctrl-c and kill queued job while waiting for allocation from controller. * Changes in SLURM 0.3.3 ======================== -- Fix slurmctld handling of heterogeneous processor count on elan switch (was setting DRAINED nodes in state DRAINING). -- Fix sinfo -R, --list-reasons to list all relevant node states. -- Fix slurmctld to honor srun's node configuration specifications with FastSchedule==0 configuration. -- Added srun option --debugger-test to confirm that slurm's debugger infrastructure is operational. -- Removed debugging hacks for srun.wrapper.c. Temporarily use RPM's debugedit utility if available for similar effect. * Changes in SLURM 0.3.2 ======================== -- The srun command wakes immeditely upon resource allocation (via new RPC) rather than polling. -- SLURM daemons log current version number at startup. -- If slurmd can't respond to ping (e.g. paging is keeping it from responding in a timely fashion) then send a registration RPC to slurmctld. -- Fix slurmd -M option to call mlockall() after daemonizing. -- Add "slurm_" prefix to slurm's hostlist_ function man pages. -- More AIX support added. -- Change get info calls from using show_all to more general show_flags with #define for SHOW_ALL flag. * Changes in SLURM 0.3.1 ======================== -- Set SLURM_TASKS_PER_NODE env var for batch jobs (and LAM/MPI). -- Fix for slurmd spinning when stdin buffers full (gnats:434) -- Change some slurmctld malloc sizes to reduce demand for realloc calls, improves performance and eliminates realloc failure on RH EL3 under extremely heavy workload apparently due to memory fragmentation. -- Fix scheduling logic for heterogeneous processor count. -- Modify security_2_2 test to function with release 0.3 -- Fix broken rpm build when libslurm not already installed. -- New slurmd option -M to mlock() slurmd process into memory. -- New srun option --no-shell causes srun to exit instead of spawning shell when using --allocate, -A. -- Modify srun --uid=user and --gid=group options to maintain invoking user's credentials until after nodes have been allocated to requested user/group (allows root to run jobs and allocate nodes for other users in a RootOnly partition). -- Fix node processing if state change requested via scontrol prior to initial node registration. * Changes in SLURM 0.3.0 ======================== -- Support for AIX added (a few bugs do remain). -- Fix memory leak in slurmctld, slurm_cred_create(). -- On ELF systems, export BNR_* functions from SLURM API. -- Add support for "hidden" partitions (applies to their nodes, jobs, and job steps as well). APIs and commands modified to optionally display hidden partitions. -- Modify partition's group_allow test to be based upon the user of the allocation rather than the user making the allocation request (user root for LCRM batch jobs). -- Restructure plugin directory structure. -- New --core=type option in srun for lightweight corefile support. (requires liblwcf). -- Let user root and SlurmUser exceed any partition limits. -- Srun treats "--time=0" as a request for an infinite time limit. * Changes in SLURM 0.3.0.0-pre10 ================================ -- Fix bugs in support of slurmctld "-f" option (specify different slurm.conf pathname). -- Remove slurmd "-f" option. -- Several documenation changes for slurm administrators. -- On ELF systems, export only slurm_* functions from slurm API and ensure plugins use only slurm_ prefixed functions (created aliases where necessary). -- New srun option -Q, --quiet to suppress informational messages. -- Fix bug in slurmctld's building of nodelist for job (failed if more than one numeric field in node name). -- Change "scontrol completing" and "sinfo" to use job's node bitmap to identify nodes associated with that particular job that are still processing job completion. This will work properly for shared nodes. -- Set SLURM_DISTRIBUTION environment varible for user tasks. -- Fix for file descriptor leak in slurmd. -- Propagate stacksize limit to jobs along with other resource limits that were previously ignored. * Changes in SLURM 0.3.0.0-pre9 =============================== -- Restructure how slurmctld state saves are performed for better scalability. -- New sinfo option "--list-reason" or "-R". Displays down or drained nodes along with their REASON field. * Changes in SLURM 0.3.0.0-pre8 =============================== -- Queue outgoing message traffic rather than immediately spawning pthreads (under heavy load this resulted in hundreds of pthreads using more memory than was available). -- Restructure slurmctld message agent for higher throughput. -- Add new sinfo options --responding and --dead (i.e. non-responding) for filtering node states. -- Fix bug in sinfo to properly process specified state filter including "*" suffix for non-responding nodes. -- Create StateSaveLocation directory if changes via slurmctld reconfig * Changes in SLURM 0.3.0.0-pre7 =============================== -- Fixes for reported problems: - slurm/381: Hold jobs requesting more resources than partition limit. - slurm/387: Jobs lost and nodes DOWN on slurmctld restart. -- Add support for getting node's real memory size on AIX. -- Sinfo sort partitions in slurm.conf order, new sort option ("#P"). -- Document how to gracefully change plugin values. -- Slurmctld does not attempt to recover jobs when the switch plugin value changes (decision reached when any job's switch state recovery fails). -- Node does not transition from COMPLETING to DOWN state due to not responding. Wait for tasks to complete or admin to set DOWN. -- Always chmod SlurmdSpoolDir to 755 (a umask of 007 was resulting in batch jobs failing). -- Return errors when trying to change configuration parameters AuthType, SchedulerType, and SwitchType via "scontrol reconfig" or SIGHUP. Document how to safely change these parameters. -- Plugin-specific error number definitions and descriptive strings moved from common into plugin modules. -- Documentation for writing scheduler, switch, and job completion logging plugins added. -- Added job and node state descriptions to the squeue and sinfo man pages. -- Backup slurmctld to generate core file on SIGABRT. -- Backup slurmctld to re-read slurm.conf on SIGHUP. -- Added -q,--quit-on-interrupt option to srun. -- Elan switch plugin now starts neterr resolver thread on all Elan3 systems (QsNet and QsNetII). -- Added some missing read locks for references for slurmctld's configuration data structure -- Modify processing of queued slurmctld message traffic to get better throughput (resulted in job inactivity limit being reached improperly when hundreds of jobs running simultaneously) * Changes in SLURM 0.3.0.0-pre6 =============================== -- Fixes for reported problems: - slurm/372: job state descriptions added to squeue man page -- Switch plugin added. Add "SwitchType=switch/elan" to slurm.conf for systems with Quadrics Elan3 or Elan4 switches. -- Don't treat DOWN nodes with too few CPUs as a fatal error on Elan -- Major re-write of html documents -- Updates to node pinging for large numbers of unresponsive nodes -- Explicitly set default action for SIGTERM (action on Thunder was to ignore SIGTERM) -- Sinfo "--exact" option only applies to fields actually displayed -- Partition processor count not correctly computed for heterogeneous clusters with FastSchedule=0 configuration -- Only return DOWN nodes to service if the reason for them being in that state is non-responsiveness and "ReturnToService=1" configuration -- Partition processor count now correctly computed for heterogeneous clusters with FastSchedule configured off -- New macros and function to export SLURM version number * Changes in SLURM 0.3.0.0-pre5 =============================== -- Fixes for reported problems: - slurm/346: Support multiple colon-separated PluginDir values -- Fix node state transition: DOWN to DRAINED (instead of DRAINING) -- Fix a couple of minor slurmctld memory leaks * Changes in SLURM 0.3.0.0-pre4 =============================== -- Fix bug where early launch failures (such as invalid UID/GID) resulted in jobs not terminating properly. -- Initial support for BNR committed (not yet functional). -- QsNet: SLURM now uses /etc/elanhosts exclusively for converting hostnames to ElanIDs. * Changes in SLURM 0.3.0.0-pre3 =============================== -- Fixes for reported problems: - slurm/328: Slurmd was restarting with a new shared memory segment and losing track of jobs - slurm/329: Job processing may be left running when one task dies - slurm/333: Slurmd fails to launch a job and deletes a step, due to a race condition in shared memory management - slurm/334: Slurmd was getting a segv due to a race condition in shared memory management - slurm/342: Properly handle nodes being removed from configuration even when there are partitions, nodes, or job steps still associated with them -- Srun properly terminates jobs/steps upon node failure (used to hang waiting for I/O completion) -- Job time limits enforced even if InactiveLimit configured as zero -- Support the sending of an arbitrary signal to a batch script (but not the processses in its job steps) -- Re-read slurm configuration file whenever changed, needed by users of SLURM APIs -- Scancel was generating a assert failure -- Slurmctld sends a launch response message upon scheduling of a queued job (for immediate srun response) -- Maui scheduler plugin added -- Backfill scheduler plugin added -- Batch scripts can now have arguments that are propogated -- MPICH support added (via patch, not in SLURM CVS) -- New SLURM environment variables added SLMR_CPUS_ON_NODE and SLURM_LAUNCH_NODE_IPADDR, these provide support needed for LAM/MPI (version 7.0.4+) -- The TMPDIR directory is created as needed before job launch -- Do not create duplicate SLURM environment variables with the same name -- Insure proper enforcement of node sharing by job -- Treat lack of SpoolDir or StateSaveDir as a fatal error -- Quickstart.html guide expanded -- Increase maximum jobs steps per node from 16 to 64 -- Delete correct shared memory segment on slurmd -c (clean start) * Changes in SLURM 0.3.0.0-pre2 =============================== -- Fixes for reported problems: - slurm/326: Properly clean-up jobs terminating on non-responding nodes -- Move all configuration data structure into common/read_config, scontrol now always shows default values if not specified in slurm.conf file -- Remove the unused "Prioritize" configuration parameter * Changes in SLURM 0.3.0.0-pre1 =============================== -- Fixes for reported problems: - slurm/252: "jobs left orphaned when using TotalView:" SLURM controller now pings srun and kills defunct jobs. - slurm/253: "srun fails to accept new IO connection." - slurm/317: "Lack of default partition in config file causes errors." - slurm/319: Socket errors on multiple simultaneous job launches fixed - slurm/321: slurmd shared memory synchronization error. -- Removed slurm_tv_clean daemon which has been obsoleted by slurm/252 fix. -- New scontrol command ``delete'' and RPC added to delete a partition -- Squeue can now print and sort by group id/name -- Scancel has new option -q,--quiet to not report an error if a job is already complete -- Add the excluded node list to job information reported. -- RPC version mis-match now properly handled -- New job completion plugin interface added for logging completed jobs. -- Fixed lost digit in scontrol job priority specification. -- Remove restriction in the number of consecutive node sets (no longer needed after DPCS upgrade) -- Incomplete state save write now properly handled. -- Modified slurmd setrlimit error for greater clarity. -- Slurmctld performs load-leveling across shared nodes. -- New user function added slurm_get_end_time for user jobs. -- Always compile srun with stabs debug section when TotalView support is requested. * Changes in SLURM 0.2.21 ========================= -- Fixes for reported problems: - slurm/253: Try using different port if connect() fails (was rarely failing when an existing defunct connection was in TIME_WAIT state) - slurm/300: Possibly killing wrong job on slurmd restart - slurm/312: Freeing non-allocated memory and killing slurmd -- Assorted changes to support RedHat Enterprise Linux 3.0 and IA64 -- Initial Elan4 and libelanctrl support (--with-elan). -- Slurmctld was sometimes inappropriately setting a job's priority to 1 when a node was down (even if up nodes could be used for the job when a running job completes) -- Convert all user commands from use of popt library to getopt_long() -- If TotalView support is requested, srun exports "totalview_jobid" variable for `%J' expansion in TV bulk launch string. -- Fix several locking bugs in slurmd IO layer. -- Throttle back repetitious error messages in slurmd to avoid filling log files. * Changes in SLURM 0.2.20 ========================= -- Fixes for reported problems: - slurm/298: Elan initialization error (Invalid vp 2147483674). - slurm/299: srun fails to exit with multiple ^C's. -- Temporarily prevent DPCS from allocating jobs with more than eight sets of consecutive nodes. This was likely causing user applications to fail with libelan errors. This will be removed after DPCS is updated. -- Fix bug in popt use, was failing in some versions of Linux. -- Resend KILL_JOB messages as needed to clear COMPLETING jobs. -- Install dummy SIGCHLD handler in slurmd to fix problem on NPTL systems where slurmd was not notified of terminated tasks. * Changes in SLURM 0.2.19 ========================= -- Memory corruption bug fixed, it was causing slurmctld to seg-fault * Changes in SLURM 0.2.18 ========================= -- Fixes for reported problems: - slurm/287: slurm protocol timeouts when using TotalView. - slurm/291: srun fails using ``-n 1'' under multi-node allocation. - slurm/294: srun IO buffer reports ENOSPC. -- Memory corruption bug fixed, it was causing slurmctld to seg-fault -- Non-responding nodes now go from DRAINING to DRAINED state when jobs complete -- Do not schedule pending jobs while any job is actively COMPLETING unless the submitted job specifically identifies its nodes (like DPCS) -- Reset priority of jobs with priority==1 when a non-responding node starts to respond again -- Ignore jobs with priority==1 when establishing new baseline upon slurmctld restart -- Make slurmctld/message retry be timer based rather than queue based for better scalability -- Slurmctld logging is more concise, using hostlists more -- srun --no-allocate used special job_id range to avoid conflicts or premature job termination (purging by slurmctld) -- New --jobid=id option in srun to initiate job step under an existing allocation. -- Support in srun for TotalView bulk launch. * Changes in SLURM 0.2.17 ========================= -- Fixes for reported problems: - slurm/279: Hold jobs that can't execute due to DOWN or DRAINED nodes and release when nodes are returned to service. - slurm/285: "srun killed due to SIGPIPE" -- Support for running job steps on nodes relative to current allocation via srun -r, --relative=n option. -- SIGKILL no longer broadcasted to job via srun on task failure unless --no-allocate option is used. -- Re-enabled "chkconfig --add" in default RPMs. -- Backup controller setting proper PID into slurmctld.pid file. -- Backup controller restores QSW state each time it assumes control -- Backup controller purges old job records before assuming control to avoid resurrecting defunct jobs. -- Kill jobs on non-responding DRAINING nodes and make their state DRAINED. -- Save state upon completion of a job's last EPILOG_COMPLETION to reduce possibility of inconsistent job and node records when the controller is transitioning between primary and backup. -- Change logging level of detailed communication errors to not print them unless detailed debugging is requested. -- Increase number of concurrent controller server threads from 20 to 50 and restructure code to handle backlogs more efficiently. -- Partition state at controller startup is based upon slurm.conf rather than previously saved state. Additional improvements to avoid inconsistent job/node/partition states at restart. Job state information is used to arbitrate conflicts. -- Orphaned file descriptors eliminated. * Changes in SLURM 0.2.16 ========================= -- Fixes for reported problems: - slurm/265: Early termination of srun could cause job to remain in queue. - slurm/268: Slurmctld could deadlock if there was a delay in the termination of a large node-count job. An EPILOG_COMPLETE RPC was added so that slurmd could notify slurmctld whenever the job termination was completed. - slurm/270: Segfault in sinfo if a configured node lacked a partition. - slurm/278: Exit code in scontrol did not indicate failure. -- Fixed bug in slurmd that caused the daemon to occaisionally kill itself. -- Fixed bug in srun when running with --no-allocate and >1 process per node. -- Small fixes and updates for srun manual. * Changes in SLURM 0.2.15 ========================= -- Fixes for reported problems: - slurm/265: Job was orphaned when allocation response message could not be sent. Job is now killed on allocation response message transmit failure and socket error details are logged. - Fix for slurm/267: "Job epilog may run multiple times." -- Squeue job TIMELIMIT format changed from "h:mm" to "d:h:mm:ss". -- DPCS initiated jobs have steps execute properly without explicit specification of node count. * Changes in SLURM 0.2.14 ========================= -- Fixes for reported problems: - slurm/194: "srun doesn't handle most options when run under an allocation." - slurm/244: "REQ: squeue shows requested size of pending jobs." -- SLURM_NODELIST environment variable now exported to all jobs, not only batch jobs. -- Nodelist displayed in squeue for completing jobs is now restricted to completing nodes. -- Node "reason" field properly displayed in sinfo even with filtering. -- ``slurm_tv_clean'' daemon now supports a log file. -- Batch jobs are now re-queued on launch failure. -- Controller confirms job scripts for batch jobs are still running on node zero at node registration. -- Default RPMs no longer stop/start SLURM daemons on upgrade or install. * Changes in SLURM 0.2.13 ========================= -- Fixes for reported problems: - Fixed bug in slurmctld where "drained" nodes would go back into the "idle" state under some conditions (slurm/228). - Added possible fix for slurm/229: "slurmd occasionally fails to reap all children." -- Fixed memory leak in auth_munge plugin. -- Added fix to slurmctld to allow arbitrarily large job specifications to be saved and recovered in the state file. -- Allow "updates" in the configuration file of previously defined node state and reason. -- On "forceful termination" of a running job step, srun now exits unconditionally, instead of waiting for all I/O. -- Slurmctld now uses pidfile to kill old daemon when a new one is started. -- Addition of new daemon "slurm_tv_clean" used to clean up jobs orphaned due to use of the TotalView parallel debugger. * Changes in SLURM 0.2.12 ========================= -- Fixes for reported problems: - Fix for "waitpid: No child processes" when using TotalView (slurm/217). - Implemented temporary workaround for slurm/223: "Munge decode failed: Munged communication error." - Temporary fix for slurm/222: "elan3_create(0): Invalid argument." -- Fixed memory leaks in slurmctld (mostly due to reconfigure). -- More squeue/sinfo interface changes (see squeue(1), sinfo(1)). -- Sinfo now accepts list of node states to -t,--state option. -- Node "reason" field now available via sinfo command (see sinfo(1)). -- Wrapper source for srun (srun.wrapper.c) now installed and available for TotalView support. -- Improved retry login in user commands for periods when slurmctld primary is down and backup has not yet taken over. * Changes in SLURM 0.2.11 ========================= -- Changes in srun: - Fixed bug in signal handling that occaisonally resulted in orphaned jobs when using Ctrl-C. - Return non-zero exit code when remote tasks are killed by a signal. - SIGALRM is now blocked by default. -- Added ``reason'' string for down, drained, or draining nodes. -- Added -V,--version option to squeue and sinfo. -- Improved some error messages from user utilities. * Changes in SLURM 0.2.10 ========================= -- New slurm.conf configuration parameters: - WaitTime: Default for srun -w,--wait parameter. - MaxJobCount: Maximum number of jobs SLURM can handle at one time. - MinJobAge: Minimum time since completing before job is purged from slurmctld memory. -- Block user defined signals USR1 and USR2 in slurmd session manager. -- More squeue cleanup. -- Support for passing options to sinfo via environment variables. -- Added option to scontrol to find intersection of completing jobs and nodes. -- Added fix in auth_munge to prevent "Munged communication error" message. * Changes in SLURM 0.2.9 ======================== -- Fixes for reported problems: - Argument to srun `-n' option was taken as octal if preceeded with a `0'. -- New format for Elan hosts config file (/etc/elanhosts. See README) -- Various fixes for managing COMPLETING jobs. -- Support for passing options to squeue via environment variables (see squeue(1)) * Changes in SLURM 0.2.8 ========================= -- Fix for bug in slurmd that could make debug messages appear in job output. -- Fix for bug in slurmctld retry count computation. -- Srun now times out slow launch threads. -- "Time Used" output in squeue now includes seconds. * Changes in SLURM 0.2.7 ========================= -- Fix for bug in Elan module that results in slurmd hang. -- Added completing job state to default list of states to print with squeue. * Changes in SLURM 0.2.6 ========================= -- More fixes for handling cleanup of slow terminating jobs. -- Fixed bug in srun that might leave nodes allocated after a Ctrl-C. * Changes in SLURM 0.2.5 ========================= -- Various fixes for cleanup of slow terminating or unkillable jobs. -- Fixed some small memory leaks in communications code. -- Added hack for synchronized exit of jobs on large node count. -- Long lists of nodes are no longer truncated in sinfo. -- Print more descriptive error message when tasks exit with nonzero status. -- Fixed bug in srun where unsuccessful launch attempts weren't detected. -- Elan network error resolver thread now runs from elan module in slurmd. -- Slurmctld uses consecutive Elan context and program description numbers instead of choosing them randomly. * Changes in SLURM 0.2.4 ========================== -- Fix for file descriptor leak in slurmctld. -- auth_munge plugin now prints credential info on decode failure. -- Minor changes to scancel interface. -- Filename format option "%J" now works again for srun --output and --error. * Changes in SLURM 0.2.3 ========================== -- Fix bug in srun when using per-task files for stderr. -- Better error reporting on failure to open per-task input/output files. -- Update auth_munge plugin for munge 0.1. -- Minor changes to squeue interface. -- New srun option `--hold' to submit job in "held" state. * Changes in SLURM 0.2.2 ========================== -- Fixes for reported problems: - Execution of script allocate mode fails in some cases. (gnats:161) - Errors using per-task input files with Elan support. (gnats:162) - srun doesn't handle all environment variables properly. (gnats:164) -- Parallel job is now terminated if a task is killed by a signal. -- Exit status of srun is set based on exit codes of tasks. -- Redesign of sinfo interface and options. -- Shutdown of slurmctld no longer propagates shutdown to all nodes. * Changes in SLURM 0.2.1 =========================== -- Fix bug where reconfigure request to slurmctld killed the daemon. * Changes in SLURM 0.2.0 ============================ -- SlurmdTimeout of 0 means never set a non-responding node to DOWN. -- New srun option, -u,--unbuffered, for unbuffered stdout. -- Enhancements for sinfo - Non-responding nodes show "*" character appended instead of "NoResp+". - Node states show abbreviated variant by default -- Enhancements for scontrol. - Added "ping" command to show current state of SLURM controllers. - Job dump in scontrol shows user name as well as UID. - Node state of DRAIN is appropriately mapped to DRAINING or DRAINED. -- Fix for bug where request for task count greater than partition limit was queued anyway. -- Fix for bugs in job end time handling. -- Modifications for error free builds on 64 bit architectures. -- Job cancel immediately deallocates nodes instead of waiting on srun. -- Attempt to create slurmd spool if it does not exist. -- Fixed signal handling bug in srun allocate mode. -- Earlier error detection in slurmd startup. -- "fatal: _shm_unlock: Numerical result out of range" bug fixed in slurmd. -- Config file parsing is now case insensitive. -- SLURM_NODELIST environment variable now set in allocate mode. * Changes in SLURM 0.2.0-pre2 ============================= -- Fix for reconfigure when public/private key path is changed. -- Shared memory fixes in slurmd. - fix for infinite semaphore incrementation bug. -- Semaphore fixes in slurmctld. -- Slurmctld now remembers which nodes have registered after recover. -- Fixed reattach bug when tasks have exited. -- Change directory to /tmp in slurmd if daemonizing. -- Logfiles are reopened on reconfigure. $Id$