Newer
Older
-- Added jobacct_gather/cgroup plugin. It is not advised to use this in
production as it isn't currently complete and doesn't provide an equivalent
substitution for jobacct_gather/linux yet. Work by Martin Perry, Bull.
* Changes in SLURM 2.4.0.pre4
=============================
-- Add logic to cache GPU file information (bitmap index mapping to device
file number) in the slurmd daemon and transfer that information to the
slurmstepd whenever a job step is initiated. This is needed to set the
appropriate CUDA_VISIBLE_DEVICES environment variable value when the
devices are not in strict numeric order (e.g. some GPUs are skipped).
Based upon work by Nicolas Bigaouette.
-- BGQ - Remove ability to make a sub-block with a geometry with one or more
of it's dimensions of length 3. There is a limitation in the IBM I/O
subsystem that is problematic with multiple sub-blocks with a dimension
of length 3, so we will disallow them to be able to be created. This
mean you if you ask the system for an allocation of 12 c-nodes you will
be given 16. If this is ever fix in BGQ you can remove this patch.
-- BLUEGENE - Better handling blocks that go into error state or deallocate
while jobs are running on them.
-- BGQ - fix for handling mix of steps running at same time some of which
are full allocation jobs, and others that are smaller.
-- BGQ - fix for core dump after running multiple sub-block jobs on static
blocks.
-- BGQ - fixed sync issue where if a job finishes in SLURM but not in mmcs
for a long time after the SLURM job has been flushed from the system
we don't have to worry about rebooting the block to sync the system.
-- BGQ - In scontrol/sview node counts are now displayed with
CnodeCount/CnodeErrCount so to point out there are cnodes in an error state
on the block. Draining the block and having it reboot when all jobs are
gone will clear up the cnodes in Software Failure.
-- Change default SchedulerParameters max_switch_wait field value from 60 to
300 seconds.
-- BGQ - catch errors from the kill option of the runjob client.
-- BLUEGENE - make it so the epilog runs until slurmctld tells it the job is
gone. Previously it had a timelimit which has proven to not be the right
thing.
-- FRONTEND - fix issue where if a compute node was in a down state and
an admin updates the node to idle/resume the compute nodes will go
instantly to idle instead of idle* which means no response.
-- Fix regression in 2.4.0.pre3 where number of submitted jobs limit wasn't
being honored for QOS.
-- Cray - Enable logging of BASIL communications with environment variables.
Set XML_LOG to enable logging. Set XML_LOG_LOC to specify path to log file
or "SLURM" to write to SlurmctldLogFile or unset for "slurm_basil_xml.log".
Patch from Steve Tronfinoff, CSCS.
-- FRONTEND - if a front end unexpectedly reboots kill all jobs but don't
mark front end node down.
-- FRONTEND - don't down a front end node if you have an epilog error
-- BLUEGENE - if a job has an epilog error don't down the midplane it was
running on.
-- BGQ - added new DebugFlag (NoRealTime) for only printing debug from
state change while the realtime server is running.
-- Fix multi-cluster mode with sview starting on a non-bluegene cluster going
to a bluegene cluster.
-- BLUEGENE - ability to show Rack Midplane name of midplanes in sview and
scontrol.
* Changes in SLURM 2.4.0.pre3
=============================
-- Let a job be submitted even if it exceeds a QOS limit. Job will be left
in a pending state until the QOS limit or job parameters change. Patch by
Phil Eckert, LLNL.
-- Add sacct support for the option "--name". Work by Yuri D'Elia, Center for
Biomedicine, EURAC Research, Italy.
-- Add an srun shepard process to cancel a job and/or step of the srun process
is killed abnormally (e.g. SIGKILL).
-- BGQ - handle deadlock issue when a nodeboard goes into an error state.
-- BGQ - more thorough handling of blocks with multiple jobs running on them.
-- Fix man2html process to compile in the build directory instead of the
source dir.
-- Behavior of srun --multi-prog modified so that any program arguments
specified on the command line will be appended to the program arguments
specified in the program configuration file.
-- Add new command, sdiag, which reports a variety of job scheduling
statistics. Based upon work by Alejandro Lucero Palau, BSC.
-- BLUEGENE - Added DefaultConnType to the bluegene.conf file. This makes it
so you can specify any connection type you would like (TORUS or MESH) as
the default in dynamic mode. Previously it always defaulted to TORUS.
-- Made squeue -n and -w options more consistent with salloc, sbatch, srun,
and scancel. Patch by Don Lipari, LLNL.
-- Have sacctmgr remove user records when no associations exist for that user.
-- Several header file changes for clean build with NetBSD. Patches from
Aleksej Saushev.
-- Fix for possible deadlock in accounting logic: Avoid calling
jobacct_gather_g_getinfo() until there is data to read from the socket.
-- Fix race condition that could generate "job_cnt_comp underflow" errors on
front-end architectures.
-- BGQ - Fix issue where a system with missing cables could cause core dump.
* Changes in SLURM 2.4.0.pre2
=============================
-- CRAY - Add support for GPU memory allocation using SLURM GRES (Generic
RESource) support. Work by Steve Trofinoff, CSCS.
-- Add support for job allocations with multiple job constraint counts. For
example: salloc -C "[rack1*2&rack2*4]" ... will allocate the job 2 nodes
from rack1 and 4 nodes from rack2. Support for only a single constraint
name been added to job step support.
-- BGQ - Remove old method for marking cnodes down.
-- BGQ - Remove BGP images from view in sview.
-- BGQ - print out failed cnodes in scontrol show nodes.
-- BGQ - Add srun option of "--runjob-opts" to pass options to the runjob
command.
-- FRONTEND - handle step launch failure better.
-- BGQ - Added a mutex to protect the now changing ba_system pointers.
-- BGQ - added new functionality for sub-block allocations - no preemption
for this yet though.
-- Add --name option to squeue to filter output by job name. Patch from Yuri
D'Elia.
-- BGQ - Added linking to runjob client libary which gives support to totalview
to use srun instead of runjob.
-- Add numeric range checks to scontrol update options. Patch from Phil
Eckert, LLNL.
-- Add ReconfigFlags configuration option to control actions of "scontrol
reconfig". Patch from Don Albert, Bull.
-- BGQ - handle reboots with multiple jobs running on a block.
-- BGQ - Add message handler thread to forward signals to runjob process.
* Changes in SLURM 2.4.0.pre1
=============================
-- BGQ - use the ba_geo_tables to figure out the blocks instead of the old
algorithm. The improves timing in the worst cases and simplifies the code
greatly.
-- BLUEGENE - Change to output tools labels from BP to Midplane
(i.e. BP List -> MidplaneList).
-- BLUEGENE - read MPs and BPs from the bluegene.conf
-- Modify srun's SIGINT handling logic timer (two SIGINTs within one second) to
be based microsecond rather than second timer.
-- Modify advance reservation to accept multiple specific block sizes rather
than a single node count.
-- Permit administrator to change a job's QOS to any value without validating
the job's owner has permission to use that QOS. Based upon patch by Phil
Eckert (LLNL).
-- Add trigger flag for a permanent trigger. The trigger will NOT be purged
after an event occurs, but only when explicitly deleted.
-- Interpret a reservation with Nodes=ALL and a Partition specification as
reserving all nodes within the specified partition rather than all nodes
on the system. Based upon patch by Phil Eckert (LLNL).
-- Add the ability to reboot all compute nodes after they become idle. The
RebootProgram configuration parameter must be set and an authorized user
must execute the command "scontrol reboot_nodes". Patch from Andriy
Grytsenko (Massive Solutions Limited).
-- Modify slurmdbd.conf parsing to accept DebugLevel strings (quiet, fatal,
info, etc.) in addition to numeric values. The parsing of slurm.conf was
modified in the same fashion for SlurmctldDebug and SlurmdDebug values.
The output of sview and "scontrol show config" was also modified to report
those values as strings rather than numeric values.
-- Changed default value of StateSaveLocation configuration parameter from
-- Prevent associations from being deleted if it has any jobs in running,
pending or suspended state. Previous code prevented this only for running
jobs.
-- If a job can not run due to QOS or association limits, then do not cancel
the job, but leave it pending in a system held state (priority = 1). The
job will run when its limits or the QOS/association limits change. Based
upon a patch by Phil Ekcert (LLNL).
-- BGQ - Added logic to keep track of cnodes in an error state inside of a
booted block.
-- Added the ability to update a node's NodeAddr and NodeHostName with
scontrol. Also enable setting a node's state to "future" using scontrol.
-- Add a node state flag of CLOUD and save/restore NodeAddr and NodeHostName
information for nodes with a flag of CLOUD.
-- Cray: Add support for job reservations with node IDs that are not in
numeric order. Fix for Bugzilla #5.
-- Fix association limit support for jobs queued for multiple partitions.
-- BLUEGENE - fix issue for sub-midplane systems to create a full system
block correctly.
-- BLUEGENE - Added option to the bluegene.conf to tell you are running on
a sub midplane system.
-- Added the UserID of the user issuing the RPC to the job_submit/lua
functions.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- If job time limit exceeds partition maximum, but job's minimum time limit
does not, set job's time limit to partition maximum at allocation time.
* Changes in SLURM 2.3.6
========================
-- Fix DefMemPerCPU for partition definitions.
-- Fix to create a reservation with licenses and no nodes.
-- Fix issue with assoc_mgr if a bad state file is given and the database
isn't up at the time the slurmctld starts, not running the
priority/multifactor plugin, and then the database is started up later.
-- Gres: If a gres has a count of one and an associated file then when doing
a reconfiguration, the node's bitmap was not cleared resulting in an
underflow upon job termination or removal from scheduling matrix by the
backfill scheduler.
-- Fix race condition in job dependency logic which can result in invalid
memory reference.
* Changes in SLURM 2.3.5
========================
-- Improve support for overlapping advanced reservations. Patch from
Bill Brophy, Bull.
-- Modify Makefiles for support of Debian hardening flags. Patch from
Simon Ruderich.
-- CRAY: Fix support for configuration with SlurmdTimeout=0 (never mark
node that is DOWN in ALPS as DOWN in SLURM).
-- Fixed the setting of SLURM_SUBMIT_DIR for jobs submitted by Moab (BZ#1467).
Patch by Don Lipari, LLNL.
-- Correction to init.d/slurmdbd exit code for status option. Patch by Bill
Brophy, Bull.
-- When the optional max_time is not specified for --switches=count, the site
max (SchedulerParameters=max_switch_wait=seconds) is used for the job.
Based on patch from Rod Schultz.
-- Fix bug in select/cons_res plugin when used with topology/tree and a node
range count in job allocation request.
-- Fixed moab_2_slurmdb.pl script to correctly work for end records.
-- Add support for new SchedulerParameters of max_depend_depth defining the
maximum number of jobs to test for circular dependencies (i.e. job A waits
for job B to start and job B waits for job A to start). Default value is
10 jobs.
-- Fix potential race condition if MinJobAge is very low (i.e. 1) and using
slurmdbd accounting and running large amounts of jobs (>50 sec). Job
information could be corrupted before it had a chance to reach the DBD.
-- Fix state restore of job limit set from admin value for min_cpus.
-- Fix clearing of limit values if an admin removes the limit for max cpus
and time limit where it was previously set by an admin.
-- Fix issue where log message is more than 256 chars and then has a format.
-- Fix sched/wiki2 to support job account name, gres, partition name, wckey,
or working directory that contains "#" (a job record separator). Also fix
for wckey or working directory that contains a double quote '\"'.
-- CRAY - fix for handling memory requests from user for an allocation.
-- Add support for switches parameter to the job_submit/lua plugin. Work by
Par Andersson, NSC.
-- Fix to job preemption logic to preempt multiple jobs at the same time.
-- Fix minor issue where uid and gid were switched in sview for submitting
batch jobs.
-- Fix possible illegal memory reference in slurmctld for job step with
relative option. Work by Matthieu Hautreux (CEA).
-- Reset priority of system held jobs when dependency is satisfied. Work by
Don Lipari, LLNL.
* Changes in SLURM 2.3.4
========================
-- Set DEFAULT flag in partition structure when slurmctld reads the
configuration file. Patch from Rémi Palancher.
-- Fix for possible deadlock in accounting logic: Avoid calling
jobacct_gather_g_getinfo() until there is data to read from the socket.
-- Fix typo in accounting when using reservations. Patch from Alejandro
Lucero Palau.
-- Fix to the multifactor priority plugin to calculate effective usage earlier
to give a correct priority on the first decay cycle after a restart of the
slurmctld. Patch from Martin Perry, Bull.
-- Permit user root to run a job step for any job as any user. Patch from
Didier Gazen, Laboratoire d'Aerologie.
-- BLUEGENE - fix for not allowing jobs if all midplanes are drained and all
blocks are in an error state.
-- Avoid slurmctld abort due to bad pointer when setting an advanced
reservation MAINT flag if it contains no nodes (only licenses).
-- Fix bug when requeued batch job is scheduled to run on a different node
zero, but attemts job launch on old node zero.
-- Fix bug in step task distribution when nodes are not configured in numeric
order. Patch from Hongjia Cao, NUDT.
-- Fix for srun allocating running within existing allocation with --exclude
option and --nnodes count small enough to remove more nodes. Patch from
Phil Eckert, LLNL.
-- Work around to handle certain combinations of glibc/kernel
(i.e. glibc-2.14/Linux-3.1) to correctly open the pty of the slurmstepd
as the job user. Patch from Mark Grondona, LLNL.
-- Modify linking to include "-ldl" only when needed. Patch from Aleksej
Saushev.
-- Fix smap regression to display nodes that are drained or down correctly.
-- Several bug fixes and performance improvements with related to batch
scripts containing very large numbers of arguments. Patches from Par
Andersson, NSC.
-- Fixed extremely hard to reproduce threading issue in assoc_mgr.
-- Correct "scontrol show daemons" output if there is more than one
ControlMachine configured.
-- Add node read lock where needed in slurmctld/agent code.
-- Added test for LUA library named "liblua5.1.so.0" in addition to
"liblua5.1.so" as needed by Debian. Patch by Remi Palancher.
-- Added partition default_time field to job_submit LUA plugin. Patch by
Remi Palancher.
-- Fix bug in cray/srun wrapper stdin/out/err file handling.
-- In cray/srun wrapper, only include aprun "-q" option when srun "--quiet"
option is used.
-- BLUEGENE - fix issue where if a small block was in error it could hold up
the queue when trying to place a larger than midplane job.
-- CRAY - ignore all interactive nodes and jobs on interactive nodes.
-- Add new job state reason of "FrontEndDown" which applies only to Cray and
IBM BlueGene systems.
-- Cray - Enable configure option of "--enable-salloc-background" to permit
the srun and salloc commands to be executed in the background. This does
NOT remove the ALPS limitation that only one job reservation can be created
for each Linux session ID.
-- Cray - For srun wrapper when creating a job allocation, set the default job
name to the executable file's name.
-- FRONTEND - if a front end unexpectedly reboots kill all jobs but don't
mark front end node down.
-- FRONTEND - don't down a front end node if you have an epilog error.
-- Cray - fix for if a frontend slurmd was started after the slurmctld had
already pinged it on startup the unresponding flag would be removed from
the frontend node.
-- Cray - Fix issue on smap not displaying grid correctly.
* Changes in SLURM 2.3.3
========================
-- Fix task/cgroup plugin error when used with GRES. Patch by Alexander
Bersenev (Institute of Mathematics and Mechanics, Russia).
-- Permit pending job exceeding a partition limit to run if its QOS flag is
modified to permit the partition limit to be exceeded. Patch from Bill
Brophy, Bull.
-- sacct search for jobs using filtering was ignoring wckey filter.
-- Fixed issue with QOS preemption when adding new QOS.
-- Fixed issue with comment field being used in a job finishing before it
starts in accounting.
-- Add slashes in front of derived exit code when modifying a job.
-- Handle numeric suffix of "T" for terabyte units. Patch from John Thiltges,
University of Nebraska-Lincoln.
-- Prevent resetting a held job's priority when updating other job parameters.
Patch from Alejandro Lucero Palau, BSC.
-- Improve logic to import a user's environment. Needed with --get-user-env
option used with Moab. Patch from Mark Grondona, LLNL.
-- Fix bug in sview layout if node count less than configured grid_x_width.
-- Modify PAM module to prefer to use SLURM library with same major release
number that it was built with.
-- Permit gres count configuration of zero.
-- Fix race condition where sbcast command can result in deadlock of slurmd
daemon. Patch by Don Albert, Bull.
-- Fix bug in srun --multi-prog configuration file to avoid printing duplicate
record error when "*" is used at the end of the file for the task ID.
-- Let operators see reservation data even if "PrivateData=reservations" flag
is set in slurm.conf. Patch from Don Albert, Bull.
-- Added new sbatch option "--export-file" as needed for latest version of
Moab. Patch from Phil Eckert, LLNL.
-- Fix for sacct printing CPUTime(RAW) where the the is greater than a 32 bit
number.
-- Fix bug in --switch option with topology resulting in bad switch count use.
Patch from Alejandro Lucero Palau (Barcelona Supercomputer Center).
-- Fix PrivateFlags bug when using Priority Multifactor plugin. If using sprio
all jobs would be returned even if the flag was set.
Patch from Bill Brophy, Bull.
-- Fix for possible invalid memory reference in slurmctld in job dependency
logic. Patch from Carles Fenoy (Barcelona Supercomputer Center).
* Changes in SLURM 2.3.2
========================
-- Add configure option of "--without-rpath" which builds SLURM tools without
the rpath option, which will work if Munge and BlueGene libraries are in
the default library search path and make system updates easier.
-- Fixed issue where if a job ended with ESLURMD_UID_NOT_FOUND and
ESLURMD_GID_NOT_FOUND where slurm would be a little over zealous
in treating missing a GID or UID as a fatal error.
-- Backfill scheduling - Add SchedulerParameters configuration parameter of
"bf_res" to control the resolution in the backfill scheduler's data about
when jobs begin and end. Default value is 60 seconds (used to be 1 second).
-- Cray - Remove the "family" specification from the GPU reservation request.
Morris Jette
committed
-- Updated set_oomadj.c, replacing deprecated oom_adj reference with
oom_score_adj
-- Fix resource allocation bug, generic resources allocation was ignoring the
job's ntasks_per_node and cpus_per_task parameters. Patch from Carles
Fenoy, BSC.
-- Avoid orphan job step if slurmctld is down when a job step completes.
-- Fix Lua link order, patch from Pär Andersson, NSC.
-- Set SLURM_CPUS_PER_TASK=1 when user specifies --cpus-per-task=1.
-- Fix for fatal error managing GRES. Patch by Carles Fenoy, BSC.
-- Fixed race condition when using the DBD in accounting where if a job
wasn't started at the time the eligible message was sent but started
before the db_index was returned information like start time would be lost.
-- Fix issue in accounting where normalized shares could be updated
incorrectly when getting fairshare from the parent.
-- Fixed if not enforcing associations but want QOS support for a default
qos on the cluster to fill that in correctly.
-- Fix in select/cons_res for "fatal: cons_res: sync loop not progressing"
with some configurations and job option combinations.
-- BLUEGNE - Fixed issue with handling HTC modes and rebooting.
-- Do not remove the backup slurmctld's pid file when it assumes control, only
when it actually shuts down. Patch from Andriy Grytsenko (Massive Solutions
Limited).
-- Avoid clearing a job's reason from JobHeldAdmin or JobHeldUser when it is
otherwise updated using scontrol or sview commands. Patch based upon work
by Phil Eckert (LLNL).
-- BLUEGENE - Fix for if changing the defined blocks in the bluegene.conf and
jobs happen to be running on blocks not in the new config.
-- Many cosmetic modifications to eliminate warning message from GCC version
4.6 compiler.
-- Fix for sview reservation tab when finding correct reservation.
-- Fix for handling QOS limits per user on a reconfig of the slurmctld.
-- Do not treat the absence of a gres.conf file as a fatal error on systems
configured with GRES, but set GRES counts to zero.
-- BLUEGENE - Update correctly the state in the reason of a block if an
admin sets the state to error.
-- BLUEGENE - handle reason of blocks in error more correctly between
restarts of the slurmctld.
-- BLUEGENE - Fix minor potential memory leak when setting block error reason.
-- BLUEGENE - Fix if running in Static/Overlap mode and full system block
is in an error state, won't deny jobs.
-- Fix for accounting where your cluster isn't numbered in counting order
(i.e. 1-9,0 instead of 0-9). The bug would cause 'sacct -N nodename' to
not give correct results on these systems.
-- Fix to GRES allocation logic when resources are associated with specific
CPUs on a node. Patch from Steve Trofinoff, CSCS.
-- Fix bugs in sched/backfill with respect to QOS reservation support and job
time limits. Patch from Alejandro Lucero Palau (Barcelona Supercomputer
Center).
-- BGQ - fix to set up corner correctly for sub block jobs.
-- Major re-write of the CPU Management User and Administrator Guide (web
page) by Martin Perry, Bull.
-- BLUEGENE - If removing blocks from system that once existed cleanup of old
block happens correctly now.
-- Prevent slurmctld crashing with configuration of MaxMemPerCPU=0.
-- Prevent job hold by operator or account coordinator of his own job from
being an Administrator Hold rather than User Hold by default.
-- Cray - Fix for srun.pl parsing to avoid adding spaces between option and
argument (e.g. "-N2" parsed properly without changing to "-N 2").
-- Major updates to cgroup support by Mark Grondona (LLNL) and Matthieu
Hautreux (CEA) and Sam Lang. Fixes timing problems with respect to the
task_epilog. Allows cgroup mount point to be configurable. Added new
configuration parameters MaxRAMPercent and MaxSwapPercent. Allow cgroup
configuration parameters that are precentages to be floating point.
-- Fixed issue where sview wasn't displaying correct nice value for jobs.
-- Fixed issue where sview wasn't displaying correct min memory per node/cpu
value for jobs.
-- Disable some SelectTypeParameters for select/linear that aren't compatible.
-- Move slurm_select_init to proper place to avoid loading multiple select
plugins in the slurmd.
-- BGQ - Include runjob_plugin.so in the bluegene rpm.
-- Report correct job "Reason" if needed nodes are DOWN, DRAINED, or
NOT_RESPONDING, "Resources" rather than "PartitionNodeLimit".
-- BLUEGENE - Fixed issues with running on a sub-midplane system.
-- Added some missing calls to allow older versions of SLURM to talk to newer.
-- Do not attempt to run HeathCheckProgram on powered down nodes. Patch from
Ramiro Alba, Centre Tecnològic de Tranferència de Calor, Spain.
* Changes in SLURM 2.3.0-2
==========================
-- Fix issue where if a job was pending and the slurmctld was restarted a
variable wasn't initialized in the job structure making it so that job
wouldn't run.
========================
-- BLUEGENE - make sure we only set the jobinfo_select start_loc on a job
when we are on a small block, not a regular one.
-- BGQ - fix issue where not copying the correct amount of memory.
-- BLUEGENE - fix clean start if jobs were running when the slurmctld was
shutdown and then the system size changed. This would probably only happen
if you were emulating a system.
-- Fix sview for calling a cray system from a non-cray system to get the
correct geometry of the system.
-- BLUEGENE - fix to correctly import pervious version of block state file.
-- BLUEGENE - handle loading better when doing a clean start with static
blocks.
-- Add sinfo format and sort option "%n" for NodeHostName and "%o" for
NodeAddr.
-- If a job is deferred due to partition limits, then re-test those limits
after a partition is modified. Patch from Don Lipari.
-- Fix bug which would crash slurmcld if job's owner (not root) tries to clear
a job's licenses by setting value to "".
-- Cosmetic fix for printing out debug info in the priority plugin.
-- In sview when switching from a bluegene machine to a regular linux cluster
and vice versa the node->base partition lists will be displayed if setup
in your .slurm/sviewrc file.
-- BLUEGENE - Fix for creating full system static block on a BGQ system.
-- BLUEGENE - Fix deadlock issue if toggling between Dynamic and Static block
allocation with jobs running on blocks that don't exist in the static
setup.
-- BLUEGENE - Modify code to only give HTC states to BGP systems and not
allow them on Q systems.
-- BLUEGENE - Make it possible for an admin to define multiple dimension
conn_types in a block definition.
-- BGQ - Alter tools to output multiple dimensional conn_type.
-- With sched/wiki or sched/wiki2 (Maui or Moab scheduler), insure that a
requeued job's priority is reset to zero.
-- BLUEGENE - fix to run steps correctly in a BGL/P emulated system.
-- Fixed issue where if there was a network issue between the slurmctld and
the DBD where both remained up but were disconnected the slurmctld would
get registered again with the DBD.
-- Fixed issue where if the DBD connection from the ctld goes away because of
a POLLERR the dbd_fail callback is called.
-- BLUEGENE - Fix to smap command-line mode display.
-- Change in GRES behavior for job steps: A job step's default generic
resource allocation will be set to that of the job. If a job step's --gres
value is set to "none" then none of the generic resources which have been
allocated to the job will be allocated to the job step.
-- Add srun environment value of SLURM_STEP_GRES to set default --gres value
for a job step.
-- Require SchedulerTimeSlice configuration parameter to be at least 5 seconds
to avoid thrashing slurmd daemon.
-- Cray - Fix to make nodes state in accounting consistent with state set by
ALPS.
-- Cray - A node DOWN to ALPS will be marked DOWN to SLURM only after reaching
SlurmdTimeout. In the interim, the node state will be NO_RESPOND. This
change makes behavior makes SLURM handling of the node DOWN state more
consistent with ALPS. This change effects only Cray systems.
-- Cray - Fix to work with 4.0.* instead of just 4.0.0
-- Cray - Modify srun/aprun wrapper to map --exclusive to -F exclusive and
--share to -F share. Note this does not consider the partition's Shared
configuration, so it is an imperfect mapping of options.
-- BLUEGENE - Added notice in the print config to tell if you are emulated
or not.
-- BLUEGENE - Fix job step scalability issue with large task count.
-- BGQ - Improved c-node selection when asked for a sub-block job that
cannot fit into the available shape.
-- BLUEGENE - Modify "scontrol show step" to show I/O nodes (BGL and BGP) or
c-nodes (BGQ) allocated to each step. Change field name from "Nodes=" to
"BP_List=".
-- Code cleanup on step request to get the correct select_jobinfo.
-- Memory leak fixed for rolling up accounting with down clusters.
-- BGQ - fix issue where if first job step is the entire block and then the
next parallel step is ran on a sub block, SLURM won't over subscribe cnodes.
-- Treat duplicate switch name in topology.conf as fatal error. Patch from Rod
Schultz, Bull
-- Minor update to documentation describing the AllowGroups option for a
partition in the slurm.conf.
-- Fix problem with _job_create() when not using qos's. It makes
_job_create() consistent with similar logic in select_nodes().
-- GrpCPURunMins in a QOS flushed out.
-- Fix for squeue -t "CONFIGURING" to actually work.
-- CRAY - Add cray.conf parameter of SyncTimeout, maximum time to defer job
scheduling if SLURM node or job state are out of synchronization with ALPS.
-- If salloc was run as interactive, with job control, reset the foreground
process group of the terminal to the process group of the parent pid before
exiting. Patch from Don Albert, Bull.
-- BGQ - set up the corner of a sub block correctly based on a relative
position in the block instead of absolute.
-- BGQ - make sure the recently added select_jobinfo of a step launch request
isn't sent to the slurmd where environment variables would be overwritten
incorrectly.
-- NOTE THERE HAVE BEEN NEW FIELDS ADDED TO THE JOB AND PARTITION STATE SAVE
FILES AND RPCS. PENDING AND RUNNING JOBS WILL BE LOST WHEN UPGRADING FROM
EARLIER VERSION 2.3 PRE-RELEASES AND RPCS WILL NOT WORK WITH EARLIER
VERSIONS.
-- select/cray: Add support for Accelerator information including model and
memory options.
-- Cray systems: Add support to suspend/resume salloc command to insure that
aprun does not get initiated when the job is suspended. Processes suspended
and resumed are determined by using process group ID and parent process ID,
so some processes may be missed. Since salloc runs as a normal user, it's
ability to identify processes associated with a job is limited.
-- Cray systems: Modify smap and sview to display all nodes even if multiple
nodes exist at each coordinate.
-- Improve efficiency of select/linear plugin with topology/tree plugin
configured, Patch by Andriy Grytsenko (Massive Solutions Limited).
-- For front-end architectures on which job steps are run (emulated Cray and
BlueGene systems only), fix bug that would free memory still in use.
-- Add squeue support to display a job's license information. Patch by Andy
Roosen (University of Deleware).
-- Add flag to the select APIs for job suspend/resume indicating if the action
is for gang scheduling or an explicit job suspend/resume by the user. Only
an explicit job suspend/resume will reset the job's priority and make
resources exclusively held by the job available to other jobs.
-- Fix possible invalid memory reference in sched/backfill. Patch by Andriy
Grytsenko (Massive Solutions Limited).
-- Add select_jobinfo to the task launch RPC. Based upon patch by Andriy
Grytsenko (Massive Solutions Limited).
-- Add DefMemPerCPU/Node and MaxMemPerCPU/Node to partition configuration.
This improves flexibility when gang scheduling only specific partitions.
-- Added new enums to print out when a job is held by a QOS instead of an
association limit.
-- Enhancements to sched/backfill performance with select/cons_res plugin.
Patch from Bjørn-Helge Mevik, University of Oslo.
-- Correct job run time reported by smap for suspended jobs.
-- Improve job preemption logic to avoid preempting more jobs than needed.
-- Add contribs/arrayrun tool providing support for job arrays. Contributed by
Bjørn-Helge Mevik, University of Oslo. NOTE: Not currently packaged as RPM
and manual file editing is required.
-- When suspending a job, wait 2 seconds instead of 1 second between sending
SIGTSTP and SIGSTOP. Some MPI implementation were not stopping within the
1 second delay.
-- Add support for managing devices based upon Linux cgroup container. Based
upon patch by Yiannis Georgiou, Bull.
-- Fix memory buffering bug if a AllowGroups parameter of a partition has 100
or more users. Patch by Andriy Grytsenko (Massive Solutions Limited).
-- Fix bug in generic resource tracking of gres associated with specific CPUs.
Resources were being over-allocated.
-- On systems with front-end nodes (IBM BlueGene and Cray) limit batch jobs to
only one CPU of these shared resources.
-- Set SLURM_MEM_PER_CPU or SLURM_MEM_PER_NODE environment variables for both
interactive (salloc) and batch jobs if the job has a memory limit. For Cray
systems also set CRAY_AUTO_APRUN_OPTIONS environment variable with the
memory limit.
-- Fix bug in select/cons_res task distribution logic when tasks-per-node=0.
Patch from Rod Schultz, Bull.
-- Restore node configuration information (CPUs, memory, etc.) for powered
down when slurmctld daemon restarts rather than waiting for the node to be
restored to service and getting the information from the node (NOTE: Only
relevent if FastSchedule=0).
-- For Cray systems with the srun2aprun wrapper, rebuild the srun man page
identifying the srun optioins which are valid on that system.
-- BlueGene: Permit users to specify a separate connection type for each
dimension (e.g. "--conn-type=torus,mesh,torus").
-- Add the ability for a user to limit the number of leaf switches in a job's
allocation using the --switch option of salloc, sbatch and srun. There is
also a new SchedulerParameters value of max_switch_wait, which a SLURM
administrator can used to set a maximum job delay and prevent a user job
from blocking lower priority jobs for too long. Based on work by Rod
Schultz, Bull.
* Changes in SLURM 2.3.0.pre6
=============================
-- NOTE: THERE HAS BEEN A NEW FIELD ADDED TO THE CONFIGURATION RESPONSE RPC
AS SHOWN BY "SCONTROL SHOW CONFIG". THIS FUNCTION WILL ONLY WORK WHEN THE
SERVER AND CLIENT ARE BOTH RUNNING SLURM VERSION 2.3.0.pre6
-- Modify job expansion logic to support licenses, generic resources, and
currently running job steps.
-- Added an rpath if using the --with-munge option of configure.
-- Add support for multiple sets of DEFAULT node, partition, and frontend
specifications in slurm.conf so that default values can be changed mulitple
times as the configuration file is read.
Danny Auble
committed
-- BLUEGENE - Improved logic to place small blocks in free space before freeing
larger blocks.
-- Add optional argument to srun's --kill-on-bad-exit so that user can set
its value to zero and override a SLURM configuration parameter of
KillOnBadExit.
-- Fix bug in GraceTime support for preempted jobs that prevented proper
operation when more than one job was being preempted. Based on patch from
Bill Brophy, Bull.
-- Fix for running sview from a non-bluegene cluster to a bluegene cluster.
Regression from pre5.
-- If job's TMPDIR environment is not set or is not usable, reset to "/tmp".
Patch from Andriy Grytsenko (Massive Solutions Limited).
-- Remove logic for defunct RPC: DBD_GET_JOBS.
-- Propagate DebugFlag changes by scontrol to the plugins.
-- Improve accuracy of REQUEST_JOB_WILL_RUN start time with respect to higher
priority pending jobs.
-- Add -R/--reservation option to squeue command as a job filter.
-- Add scancel support for --clusters option.
-- Note that scontrol and sprio can only support a single cluster at one time.
-- Add support to salloc for a new environment variable SALLOC_KILL_CMD.
-- Add scontrol ability to increment or decrement a job or step time limit.
-- Add support for SLURM_TIME_FORMAT environment variable to control time
stamp output format. Work by Gerrit Renker, CSCS.
-- Fix error handling in mvapich plugin that could cause srun to enter an
infinite loop under rare circumstances.
-- Add support for multiple task plugins. Patch from Andriy Grytsenko (Massive
Solutions Limited).
-- Addition of per-user node/cpu limits for QOS's. Patch from Aaron Knister,
UMBC.
-- Fix logic for multiple job resize operations.
-- BLUEGENE - many fixes to make things work correctly on a L/P system.
-- Fix bug in layout of job step with --nodelist option plus node count. Old
code could allocate too few nodes.
* Changes in SLURM 2.3.0.pre5
=============================
-- NOTE: THERE HAS BEEN A NEW FIELD ADDED TO THE JOB STATE FILE. UPGRADES FROM
VERSION 2.3.0-PRE4 WILL RESULT IN LOST JOBS UNLESS THE "orig_dependency"
FIELD IS REMOVED FROM JOB STATE SAVE/RESTORE LOGIC. ON CRAY SYSTEMS A NEW
"confirm_cookie" FIELD WAS ADDED AND HAS THE SAME EFFECT OF DISABLING JOB
STATE RESTORE.
-- BLUEGENE - Improve speed of start up when removing blocks at the beginning.
-- Correct init.d/slurm status to have non-zero exit code if ANY Slurm
damon that should be running on the node is not running. Patch from Rod
Schulz, Bull.
-- Improve accuracy of response to "srun --test-only jobid=#".
-- Fix bug in front-end configurations which reports job_cnt_comp underflow
errors after slurmctld restarts.
-- Eliminate "error from _trigger_slurmctld_event in backup.c" due to lack of
event triggers.
-- Fix logic in BackupController to properly recover front-end node state and
avoid purging active jobs.
-- Added man pages to html pages and the new cpu_management.html page.
Submitted by Martin Perry / Rod Schultz, Bull.
-- Job dependency information will only show the currently active dependencies
rather than the original dependencies. From Dan Rusak, Bull.
-- Add RPCs to get the SPANK environment variables from the slurmctld daemon.
Patch from Andrej N. Gritsenko.
-- Updated plugins/task/cgroup/task_cgroup_cpuset.c to support newer
HWLOC_API_VERSION.
-- Do not build select/bluegene plugin if C++ compiler is not installed.
-- Add new configure option --with-srun2aprun to build an srun command
which is a wrapper over Cray's aprun command and supports many srun
options. Without this option, the srun command will advise the user
to use the aprun command.
-- Change container ID supported by proctrack plugin from 32-bit to 64-bit.
-- Added contribs/cray/libalps_test_programs.tar.gz with tools to validate
SLURM's logic used to support Cray systems.
-- Create RPM for srun command that is a wrapper for the Cray/ALPS aprun
command. Dependent upon .rpmmacros parameter of "%_with_srun2aprun".
-- Add configuration parameter MaxStepCount to limit effect of bad batch
scripts.
-- Fix for handling a 2.3 system talking to a 2.2 slurmctld.
-- Add contribs/lua/job_submit.license.lua script. Update job_submit and Lua
related documentation.
-- Test if _make_batch_script() is called with a NULL script.
-- Increase hostlist support from 24k to 64k nodes.
-- Renamed the Accounting Storage database's "DerivedExitString" job field to
"Comment". Provided backward compatible support for "DerivedExitString" in
the sacctmgr tool.
-- Added the ability to save the job's comment field to the Accounting
Storage db (to the formerly named, "DerivedExitString" job field). This
behavior is enabled by a new slurm.conf parameter:
AccountingStoreJobComment.
-- Test if _make_batch_script() is called with a NULL script.
-- Increase hostlist support from 24k to 64k nodes.
-- Fix srun to handle signals correctly when waiting for a step creation.
-- Preserve the last job ID across slurmctld daemon restarts even if the job
state file can not be fully recovered.
Danny Auble
committed
-- Made the hostlist functions be able to arbitrarily handle any size
dimension no matter what the size of the cluster is in dimensions.
* Changes in SLURM 2.3.0.pre4
=============================
-- Add GraceTime to Partition and QOS data structures. Preempted jobs will be
given this time interval before termination. Work by Bill Brophy, Bull.
-- Add the ability for scontrol and sview to modify slurmctld DebugFlags
values.
-- Various Cray-specific patches:
- Fix a bug in distinguishing XT from XE.
- Avoids problems with empty nodenames on Cray.
- Check whether ALPS is hanging on to nodes, which happens if ALPS has not
yet cleaned up the node partition.
- Stops select/cray from clobbering node_ptr->reason.
- Perform 'safe' release of ALPS reservations using inventory and apkill.
- Compile-time sanity check for the apbasil and apkill files.
- Changes error handling in do_basil_release() (called by
select_g_job_fini()).
- Warn that salloc --no-shell option is not supported on Cray systems.
-- Add a reservation flag of "License_Only". If set, then jobs using the
reservation may use the licenses associated with it plus any compute nodes.
Otherwise the job is limited to the compute nodes associated with the
reservation.
-- Change slurm.conf node configuration parameter from "Procs" to "CPUs".
Both parameters will be supported for now.
Danny Auble
committed
-- BLUEGENE - fix for when user requests only midplane names with no count at
job submission time to process the node count correctly.
-- Fix job step resource allocation problem when both node and tasks counts
are specified. New logic selects nodes with larger CPU counts as needed.
Danny Auble
committed
-- BGQ - make it so srun wraps runjob (still under construction, but works
for most cases)
-- Permit a job's QOS and Comment field to both change in a single RPC. This
was previously disabled since Moab stored the QOS within the Comment field.
-- Add support for jobs to expand in size. Submit additional batch job with
the option "--dependency=expand:<jobid>". See web page "faq.html#job_size"
for details. Restrictions to be removed in the future.
Danny Auble
committed
-- Added --with-alps-emulation to configure, and also an optional cray.conf
to setup alps location and database information.
-- Modify PMI data types from 16-bits to 32-bits in order to support MPICH2
jobs with more than 65,536 tasks. Patch from Hongjia Cao, NUDT.
-- Set slurmd's soft process CPU limit equal to it's hard limit and notify the
user if the limit is not infinite.
-- Added proctrack/cgroup and task/cgroup plugins from Matthieu Hautreux, CEA.
-- Fix slurmctld restart logic that could leave nodes in UNKNOWN state for a
longer time than necessary after restart.
* Changes in SLURM 2.3.0.pre3
=============================
-- BGQ - Appears to work correctly in emulation mode, no sub blocks just yet.
-- Minor typos fixed
-- Various bug fixes for Cray systems.
-- Fix bug that when setting a compute node to idle state, it was failing to
set the systems up_node_bitmap.
-- BLUEGENE - code reorder
-- BLUEGENE - Now only one select plugin for all Bluegene systems.
-- Modify srun to set the SLURM_JOB_NAME environment variable when srun is
used to create a new job allocation. Not set when srun is used to create a
job step within an existing job allocation.
-- Modify init.d/slurm script to start multiple slurmd daemons per compute
node if so configured. Patch from Matthieu Hautreux, CEA.
-- Change license data structure counters from uint16_t to uint32_t to support
larger license counts.
* Changes in SLURM 2.3.0.pre2
=============================
-- Log a job's requeue or cancellation due to preemption to that job's stderr:
"*** JOB 65547 CANCELLED AT 2011-01-21T12:59:33 DUE TO PREEMPTION ***".
-- Added new job termination state of JOB_PREEMPTED, "PR" or "PREEMPTED" to
indicate job termination was due to preemption.
-- Optimize advanced reservations resource selection for computer topology.
The logic has been added to select/linear and select/cons_res, but will
not be enabled until the other select plugins are modified.
-- Disable deletion of partitions that have unfinished jobs (pending,
running or suspended states). Patch from Martin Perry, BULL.
-- In sview, disable the sorting of node records by name at startup for
clusters over 1000 nodes. Users can enable this by selecting the "Name"
tab. This change dramatically improves scalability of sview.
-- Report error when trying to change a node's state from scontrol for Cray
-- Do not attempt to read the batch script for non-batch jobs. This patch
eliminates some inappropriate error messages.
-- Preserve NodeHostName when reordering nodes due to system topology.
-- On Cray/ALPS systems do node inventory before scheduling jobs.
-- Disable some salloc options on Cray systems.
-- Disable scontrol's wait_job command on Cray systems.
-- Disable srun command on native Cray/ALPS systems.
Danny Auble
committed
-- Updated configure option "--enable-cray-emulation" (still under
development) to emulate a cray XT/XE system, and auto-detect a real
Cray XT/XE systems (removed no longer needed --enable-cray configure
option). Building on native Cray systems requires the
cray-MySQL-devel-enterprise rpm and expat XML parser library/headers.
* Changes in SLURM 2.3.0.pre1
=============================
-- Added that when a slurmctld closes the connection to the database it's
registered host and port are removed.
-- Added flag to slurmdbd.conf TrackSlurmctldDown where if set will mark idle
resources as down on a cluster when a slurmctld disconnects or is no
longer reachable.
-- Added support for more than one front-end node to run slurmd on
architectures where the slurmd does not execute on the compute nodes
(e.g. BlueGene). New configuration parameters FrontendNode and FrontendAddr
added. See "man slurm.conf" for more information.
-- With the scontrol show job command when using the --details option, show
a batch job's script.
-- Add ability to create reservations or partitions and submit batch jobs
using sview. Also add the ability to delete reservations and partitions.
-- Added new configuration parameter MaxJobId. Once reached, restart job ID
values at FirstJobId.
-- When restarting slurmctld with priority/basic, increment all job priorities
so the highest job priority becomes TOP_PRIORITY.
-- Prevent background salloc disconnecting terminal at termination. Patch by
Don Albert, Bull.
-- Fixed issue where preempt mode is skipped when creating a QOS. Patch by
Bill Brophy, Bull.
-- Fixed documention (html) for PriorityUsageResetPeriod to match that in the
man pages. Patch by Nancy Kritkausky, Bull.
* Changes in SLURM 2.2.7
========================
-- Eliminate zombie process created if salloc exits with stopped child
process. Patch from Gerrit Renker, CSCS.
-- With default configuration on non-Cray systems, enable salloc to be
spawned as a background process. Based upon work by Don Albert (Bull) and
Gerrit Renker (CSCS).
-- Fixed Regression from 2.2.4 in accounting where an inherited limit
would not be set correctly in the added child association.
-- Fixed issue with accounting when asking for jobs with a hostlist.
-- Avoid clearing a node's Arch, OS, BootTime and SlurmdStartTime when
"scontrol reconfig" is run. Patch from Martin Perry, Bull.
* Changes in SLURM 2.2.6
========================
-- Fix displaying of account coordinators with sacctmgr. Possiblity to show
deleted accounts. Only a cosmetic issue, since the accounts are already
deleted, and have no associations.
-- Prevent opaque ncurses WINDOW struct on OS X 10.6.
Danny Auble
committed
-- Fix issue with accounting when using PrivateData=jobs... users would not be
able to view there own jobs unless they were admin or coordinators which is
obviously wrong.
-- Fix bug in node stat if slurmctld is restarted while nodes are in the
process of being powered up. Patch from Andriy Grytsenko.
-- Change maximum batch script size from 128k to 4M.
-- Get slurmd -f option working. Patch from Andriy Grytsenko.
-- Fix for linking problem on OSX. Patches from Jon Bringhurst (LANL) and
Tyler Strickland.
-- Reset a job's priority to zero (suspended) when Moab requeues the job.
Patch from Par Andersson, NSC.
Danny Auble
committed
-- When enforcing accounting, fix polling for unknown uids for users after
the slurmctld started. Previously one would have to issue a reconfigure
to the slurmctld to have it look for new uids.
-- BLUEGENE - if a block goes into an error state. Fix issue where accounting
wasn't updated correctly when the block was resumed.
-- Synchronize power-save module better with scheduler. Patch from
Andriy Grytsenko (Massive Solutions Limited).
-- Avoid SEGV in association logic with user=NULL. Patch from
Andriy Grytsenko (Massive Solutions Limited).
-- Fixed issue in accounting where it was possible for a new
association/wckey to be set incorrectly as a default the new object
was added after an original default object already existed. Before
the slurmctld would need to be restarted to fix the issue.
-- Updated the Normalized Usage section in priority_multifactor.shtml.
-- Disable use of SQUEUE_FORMAT env var if squeue -l, -o, or -s option is
used. Patch from Aaron Knister (UMBC).
* Changes in SLURM 2.2.5
========================
-- Correct init.d/slurm status to have non-zero exit code if ANY Slurm
damon that should be running on the node is not running. Patch from Rod
Schulz, Bull.
-- Improve accuracy of response to "srun --test-only jobid=#".
-- Correct logic to properly support --ntasks-per-node option in the
select/cons_res plugin. Patch from Rod Schulz, Bull.
-- Fix bug in select/cons_res with respect to generic resource (gres)
scheduling which prevented some jobs from starting as soon as possible.
-- Fix memory leak in select/cons_res when backfill scheduling generic
resources (gres).
Danny Auble
committed
-- Fix for when configuring a node with more resources than in real life
and using task/affinity.
Danny Auble
committed
-- Fix so slurmctld will pack correctly 2.1 step information. (Only needed if
a 2.1 client is talking to a 2.2 slurmctld.)
-- Set powered down node's state to IDLE+POWER after slurmctld restart instead
of leaving in UNKNOWN+POWER. Patch from Andrej Gritsenko.
-- Fix bug where is srun's executable is not on it's current search path, but
can be found in the user's default search path. Modify slurmstepd to find
the executable. Patch from Andrej Gritsenko.
-- Make sview display correct cpu count for steps.
Danny Auble
committed
-- BLUEGENE - when running in overlap mode make sure to check the connection
type so you can create overlapping blocks on the exact same nodes with
different connection types (i.e. one torus, one mesh).
-- Fix memory leak if MPI ports are reserved (for OpenMPI) and srun's
--resv-ports option is used.
-- Fix some anomalies in select/cons_res task layout when using the
--cpus-per-task option. Patch from Martin Perry, Bull.
-- Improve backfill scheduling logic when job specifies --ntasks-per-node and
--mem-per-cpu options on a heterogeneous cluster. Patch from Bjorn-Helge
Mevik, University of Oslo.
-- Print warning message if srun specifies --cpus-per-task larger than used
to create job allocation.
-- Fix issue when changing a users name in accounting, if using wckeys would
execute correctly, but bad memcopy would core the DBD. No information
would be lost or corrupted, but you would need to restart the DBD.
* Changes in SLURM 2.2.4
========================
-- For batch jobs for which the Prolog fails, substitute the job ID for any
"%j" in the job's output or error file specification.
-- Add licenses field to the sview reservation information.
-- BLUEGENE - Fix for handling extremely overloaded system on Dynamic system
dealing with starting jobs on overlapping blocks. Previous fallout
was job would be requeued. (happens very rarely)
-- In accounting_storage/filetxt plugin, substitute spaces within job names,
step names, and account names with an underscore to insure proper parsing.
-- When building contribs/perlapi ignore both INSTALL_BASE and PERL_MM_OPT.
Use PREFIX instead to avoid build errors from multiple installation
specifications.
-- Add job_submit/cnode plugin to support resource reservations of less than
a full midplane on BlueGene computers. Treat cnodes as liceses which can
be reserved and are consumed by jobs. This reservation mechanism for less
than an entire midplane is still under development.
-- Clear a job's "reason" field when a held job is released.
-- When releasing a held job, calculate a new priority for it rather than
just setting the priority to 1.
Danny Auble
committed
-- Fix for sview started on a non-bluegene system to pick colors correctly
when talking to a real bluegene system.
-- Improve sched/backfill's expected start time calculation.
-- Prevent abort of sacctmgr for dump command with invalid (or no) filename.
Danny Auble
committed
-- Improve handling of job updates when using limits in accounting, and
updating jobs as a non-admin user.
-- Fix for "squeue --states=all" option. Bug would show no jobs.
-- Schedule jobs with reservations before those without reservations.
-- Fix squeue/scancel to query correctly against accounts of different case.
-- Abort an srun command when it's associated job gets aborted due to a
dependency that can not be satisfied.
-- In jobcomp plugins, report start time of zeroif pending job is cancelled.
Previously may report expected start time.
-- Fixed sacctmgr man to state correct variables.
-- Select nodes based upon their Weight when job allocation requests include
a constraint field with a count (e.g. "srun --constraint=gpu*2 -N4 a.out").
-- Add support for user names that are entirely numeric and do not treat them
as UID values. Patch from Dennis Leepow.
-- Patch to un/pack double values properly if negative value. Patch from
Danny Auble
committed
Dennis Leepow
-- Do not reset a job's priority when requeued or suspended.
-- Fix problemm that could let new jobs start on a node in DRAINED state.
Danny Auble
committed
-- Fix cosmetic sacctmgr issue where if the user you are trying to add
doesn't exist in the /etc/passwd file and the account you are trying
to add them to doesn't exist it would print (null) instead of the bad
account name.
-- Fix associations/qos for when adding back a previously deleted object
the object will be cleared of all old limits.
Danny Auble
committed
-- BLUEGENE - Added back a lock when creating dynamic blocks to be more thread
safe on larger systems with heavy load.
* Changes in SLURM 2.2.3
========================
-- Update srun, salloc, and sbatch man page description of --distribution
option. Patches from Rod Schulz, Bull.
-- Applied patch from Martin Perry to fix "Incorrect results for task/affinity
block second distribution and cpus-per-task > 1" bug.
-- Avoid setting a job's eligible time while held (priority == 0).
-- Substantial performance improvement to backfill scheduling. Patch from
Bjorn-Helge Mevik, University of Oslo.
-- Make timeout for communications to the slurmctld be based upon the
MessageTimeout configuration parameter rather than always 3 seconds.
Patch from Matthieu Hautreux, CEA.
-- Add new scontrol option of "show aliases" to report every NodeName that is
associated with a given NodeHostName when running multiple slurmd daemons
per compute node (typically used for testing purposes). Patch from
Matthieu Hautreux, CEA.
-- Fix for handling job names with a "'" in the name within MySQL accounting.
Patch from Gerrit Renker, CSCS.
-- Modify condition under which salloc execution delayed until moved to the
foreground. Patch from Gerrit Renker, CSCS.
Job control for interactive salloc sessions: only if ...
a) input is from a terminal (stdin has valid termios attributes),
b) controlling terminal exists (non-negative tpgid),
c) salloc is not run in allocation-only (--no-shell) mode,
d) salloc runs in its own process group (true in interactive
shells that support job control),
e) salloc has been configured at compile-time to support background
execution and is not currently in the background process group.
-- Abort salloc if no controlling terminal and --no-shell option is not used
("setsid salloc ..." is disabled). Patch from Gerrit Renker, CSCS.
-- Fix to gang scheduling logic which could cause jobs to not be suspended
or resumed when appropriate.
-- Applied patch from Martin Perry to fix "Slurmd abort when using task