NEWS 550 KB
Newer Older
David Bigagli's avatar
David Bigagli committed
This file describes changes in recent versions of Slurm. It primarily
documents those changes that are of interest to users and administrators.

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.6
==========================
 -- Added parsing of -H flag with scancel.
 -- Fix slurmsmwd build on 32-bit systems.
 -- acct_gather_filesystem/lustre - add support for Lustre 2.12 client.
 -- Fix per-partition TRES factors/priority
 -- Fix per-partition NICE priority
 -- Fix partition access check validation for multi-partition job submissions.
 -- Prevent segfault on empty response in 'scontrol show dwstat'.
 -- node_features/knl_cray plugin - Preserve node's active features if it has
    already booted when slurmctld daemon is reconfigured.
 -- Detect missing burst buffer script and reject job.
 -- GRES: Properly reset the topo_gres_cnt_alloc counter on slurmctld restart
    to prevent underflow.
 -- Avoid errors from packing accounting_storage_mysql.so when RPM is built
    with out mysql support.
 -- Remove deprecated -t option from slurmctld --help.
 -- acct_gather_filesystem/lustre - fix stats gathering.
 -- Enforce documented default usage start and end times when querying jobs from
    the database.
 -- Fix issues when querying running jobs from the database.
 -- Deny sacct request where start time is later than the end time requested.
 -- Fix sacct verbose about time and states queried.
 -- burst_buffer/cray - allow 'scancel --hurry <jobid>' to tear down a burst
    buffer that is currently staging data out.
 -- X11 forwarding - allow setup if the DISPLAY environment variable lacks
    a screen number. (Permit both "localhost:10.0" and "localhost:10".)
 -- docs - change HTML title to include the page title or man page name.
 -- X11 forwarding - fix an unnecessary error message when using the
    local_xauthority X11Parameters option.
 -- Add use_raw_hostname to X11Parameters.
 -- Fix smail so it passes job arrays to seff correctly.
 -- Don't check InactiveLimit for salloc --no-shell jobs.
 -- Add SALLOC_GRES and SBATCH_GRES as input to salloc/sbatch.
 -- Remove drain state when node doesn't reboot by ResumeTimeout.
 -- Fix considering "resuming" nodes in scheduling.
 -- Do not kill suspended jobs due to exceeding time limit.
 -- Add NoAddrCache CommunicationParameter.
 -- Don't ping powering up cloud nodes.
 -- Add cloud_dns SlurmctldParameter.
 -- Consider --sbindir configure option as the default path to find slurmstepd.
 -- Fix node state printing of DRAINED$
 -- Fix spamming dbd of down/drained nodes in maintenance reservation.
* Changes in Slurm 18.08.5-2
============================
 -- Fix Perl build for 32-bit systems.

Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.5
==========================
 -- Backfill - If a job has a time_limit guess the end time of a job better
    if OverTimeLimit is Unlimited.
 -- Fix "sacctmgr show events event=cluster"
 -- Fix sacctmgr show runawayjobs from sibling cluster
 -- Avoid bit offset of -1 in call to bit_nclear().
 -- Insure that "hbm" is a configured GresType on knl systems.
 -- Fix NodeFeaturesPlugins=node_features/knl_generic to allow other gres
    other than knl.
 -- cons_res: Prevent overflow on multiply.
 -- Better debug for bad values in gres.conf.
 -- Fix double accounting of energy at end of job.
 -- Read gres.conf for cloud nodes on slurmctld.
 -- Don't assume the first node of a job is the batch host when purging jobs
    from a node.
 -- Better debugging when a job doesn't have a job_resrcs ptr.
 -- Store ave watts in energy plugins.
 -- Add XCC plugin for reading Lenovo Power.
 -- Fix minor memory leak when scheduling rebootable nodes.
 -- Fix debug2 prefix for sched log.
 -- Fix printing correct SLURM_JOB_ACCOUNT_PACK_GROUP_* in env for a Het Job.
 -- sbatch - search current working directory first for job script.
 -- Make it so held jobs reset the AccrueTime and do not count against any
    AccrueTime limits.
 -- Add SchedulerParameters option of bf_hetjob_prio=[min|avg|max] to alter the
    job sorting algorithm for scheduling heterogeneous jobs.
 -- Fix initialization of assoc_mgr_locks and slurmctld_locks lock structures.
 -- Fix segfault with job arrays using X11 forwarding.
 -- Revert regression caused by e0ee1c7054 which caused negative values and
    values starting with a decimal to be invalid for PriorityWeightTRES and
    TRESBillingWeight.
 -- Fix possibility to update a job's reservation to none.
 -- Suppress connection errors to primary slurmdbd when backup dbd is active.
 -- Suppress connection errors to primary db when backup db kicks in
 -- Add missing fields for sacct --completion when using jobcomp/filetxt.
 -- Fix incorrect values set for UserCPU, SystemCPU, and TotalCPU sacct fields
    when JobAcctGatherType=jobacct_gather/cgroup.
Jason Booth's avatar
Jason Booth committed
 -- Fixed srun from double printing invalid option msg twice.
 -- Remove unused -b flag from getopt call in sbatch.
 -- Disable reporting of node TRES in sreport.
 -- Re-enabling features combined by OR within parenthesis for non-knl setups.
 -- Prevent sending duplicate requests to reboot a node before ResumeTimeout.
 -- Down nodes that don't reboot by ResumeTimeout.
 -- Update seff to reflect API change from rss_max to tres_usage_in_max.
 -- Add missing TRES constants from perl API.
 -- Fix issue where sacct would return incorrect array tasks when querying
    specific tasks.
 -- Add missing variables to slurmdb_stats_t in the perlapi.
 -- Fix nodes not getting reboot RPC when job requires reboot of nodes.
 -- Fix failing update the partition list of a job.
 -- Use slurm.conf gres ids instead of gres.conf names to get a gres type name.
 -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc.
    CVE-2019-6438.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.4
==========================
 -- burst_buffer/cray - avoid launching a job that would be immediately
    cancelled due to a DataWarp failure.
 -- Fix message sent to user to display preempted instead of time limit when
    a job is preempted.
 -- Fix memory leak when a failure happens processing a nodes gres config.
 -- Improve error message when failures happen processing a nodes gres config.
 -- When building rpms ignore redundant standard rpaths and insecure relative
    rpaths, for RHEL based distros which use "check-rpaths" tool.
 -- Don't skip jobs in scontrol hold.
 -- Avoid locking the job_list when unneeded.
 -- Allow --cpu-bind=verbose to be used with SLURM_HINT environment variable.
 -- Make it so fixing runaway jobs will not alter the same job requeued
    when not runaway.
 -- Avoid checking state when searching for runaway jobs.
 -- Remove redundant check for end time of job when searching for runaway jobs.
 -- Make sure that we properly check for runawayjobs where another job might
    have the same id (for example, if a job was requeued) by also checking the
    submit time.
 -- Add scontrol update job ResetAccrueTime to clear a job's time
    previously accrued for priority.
 -- cons_res: Delay exiting cr_job_test until after cores/cpus are calculated
    and distributed.
 -- Fix bug where binary in cwd would trump binary in PATH with test_exec.
 -- Fix check to test printf("%s\n", NULL); to not require
    -Wno-format-truncation CFLAG.
 -- Fix JobAcctGatherParams=UsePss to report the correct usage.
 -- Fix minor memory leak in pmix plugin.
 -- Fix minor memory leak in slurmctld when reading configuration.
 -- Handle return codes correctly from pthread_* functions.
 -- Fix minor memory leak when a slurmd is unable to contact a slurmctld
    when trying to register.
 -- Fix sreport sizesbyaccount report when using Flatview and accounts.
 -- Fix incorrect shift when dealing with node weights and scheduling.
 -- libslurm/perl - Fix segfault caused by incorrect hv_to_slurm_ctl_conf.
 -- Add qos and assoc options to confirmation dialogs.
 -- Handle updating identical license or partition information correctly.
 -- Makes sure accounts and QOS' are all lower case to match documentation
    when read in from the slurm.conf file.
 -- Don't consider partitions without enough nodes in reservation,
    main scheduler.
 -- Set SLURM_NTASKS correctly if having to determine from other options.
 -- Removed GCP scripts from contribs. Now located at:
    https://github.com/SchedMD/slurm-gcp.
 -- Don't check existence of srun --prolog or --epilog executables when set to
    "none" and SLURM_TEST_EXEC is used.
 -- Add "P" suffix support to job and step tres specifications.
 -- When doing a reconfigure handle QOS' GrpJobsAccrue correctly.
 -- Remove unneeded extra parentheses from sh5util.
 -- Fix jobacct_gather/cgroup to work correctly when more than one task is
    started on a node.
 -- If requesting --ntasks-per-node with no tasks set tasks correctly.
 -- Accept modifiers for TRES originally added in 6f0342e0358.
 -- Don't remove reservation on slurmctld restart if nodes are removed from
    configuration.
 -- Fix bad xfree in task/cgroup.
 -- Fix removing counters if a job array isn't subject to limits and is
    canceled while pending.
 -- Make sure SLURM_NTASKS_PER_NODE is set correctly when env is overwritten
    by the command line.
 -- Clean up step on a failed node correctly.
 -- mpi/pmix: Fixed the logging of collective state.
 -- mpi/pmix: Make multi-slurmd work correctly when using ring communication.
 -- mpi/pmix: Fix double invocation of the PMIx lib fence callback.
 -- mpi/pmix: Remove unneeded libpmix callback drop in tree-based coll.
 -- Fix race condition in route/topology when the slurmctld is reconfigured.
 -- In route/topology validate the slurmctld doesn't try to initialize the
    node system.
 -- Fix issue when requesting invalid gres.
 -- Validate job_ptr in backfill before restoring preempt state.
 -- Fix issue when job's environment is minimal and only contains variables
    Slurm is going to replace internally.
 -- When handling runaway jobs remove all usage before rollup to remove any
    time that wasn't existent instead of just updating lines that have time
    with a lesser time.
 -- salloc - set SLURM_NTASKS_PER_CORE and SLURM_NTASKS_PER_SOCKET in the
    environment if the corresponding command line options are used.
 -- slurmd - fix handling of the -f flag to specify alternate config file
    locations.
 -- Fix scheduling logic to avoid using nodes that require a reboot for KNL
    node change when possible.
 -- Fix scheduling logic bug. There should have been a test for _not_
    NODE_SET_REBOOT to continue.
 -- Fix a scheuling logic bug with respect to XOR operation support when there
    are down nodes.
 -- If there is a constraint construct of the form "[...&...]"
    then an error is generated if more than one of those specifications
    contains KNL NUMA or MCDRAM modes.
 -- Fix stepd segfault race if slurmctld hasn't registered with the launching
    slurmd yet delivering it's TRES list.
 -- Add SchedulerParameters option of bf_ignore_newly_avail_nodes to avoid
    scheduling lower priority jobs on resources that become available during
    the backfill scheduling cycle when bf_continue is enabled.
 -- Decrement message_connections in stepd code on error path correctly.
 -- Decrease an error message to be debug.
 -- Fix missing suffixes in squeue.
 -- pam_slurm_adopt - send an error message to the user if no Slurm jobs
    can be located on the node.
 -- Run SlurmctldPrimaryOffProg when the primary slurmctld process shuts down.
 -- job_submit/lua: Add several slurmctld return codes.
 -- job_submit/lua: Add user/group info to jobs.
 -- Fix formatting issues when printing uint64_t.
 -- Bump RLIMIT_NOFILE for daemons in systemd services.
 -- Expand %x in job name in 'scontrol show job'.
 -- salloc/sbatch/srun - print warning if mutually exclusive options of --mem
    and --mem-per-cpu are both set.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.3
==========================
 -- Fix regression in 18.08.1 that caused dbd messages to not be queued up
 -- Fix regression in 18.08.1 that can cause a slurmctld crash when splitting
    job array elements.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.2
==========================
 -- Correctly initialize variable in env_array_user_default().
 -- Remove race condition when signaling starting step.
 -- Fix issue where 17.11 job's using GRES in didn't initialize new 18.08
    structures after unpack.
 -- Stop removing nodes once the minimum CPU or node count for the job is
    reached in the cons_res plugin.
 -- Process any changes to MinJobAge and SlurmdTimeout in the slurmctld when
    it is reconfigured to determine changes in its background timers.
 -- Use previous SlurmdTimeout in the slurmctld after a reconfigure to
    determine the time a node has been down.
 -- Fix multi-cluster srun between clusters with different SelectType plugins.
 -- Fix removing job licenses on reconfig/restart when configured license
    counts are 0.
 -- If a job requested multiple licenses and one license was removed then on
    a reconfigure/restart all of the licenses -- including the valid ones
    would be removed.
 -- Fix issue where job's license string wasn't updated after a restart when
    licenses were removed or added.
 -- Add allow_zero_lic to SchedulerParameters.
 -- Avoid scheduling tasks in excess of ArrayTaskThrottle when canceling tasks
    of an array.
 -- Fix jobs that request memory per node and task count that can't be
    scheduled right away.
 -- Avoid infinite loop with jobacct_gather/linux when pids wrap around
    /proc/sys/kernel/pid_max.
 -- Fix --parsable2 output for sacct and sstat commands to remove a stray
    trailing delimiter.
 -- When modifying a user's name in sacctmgr enforce PreserveCaseUser.
 -- When adding a coordinator or user that was once deleted enforce
    PreserveCaseUser.
 -- Correctly handle scenarios where a partitions MaxMemPerCPU is less than
    a jobs --mem-per-cpu and also -c is greater than 1.
 -- Set AccrueTime correctly when MaxJobsAccrue is disabled and BeginTime has
    not been established.
 -- Correctly account for job arrays for new {Max/Grp}JobsAccrue limits.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.1
==========================
 -- Remove commented-out parts of man pages related to cons_tres work in 19.05,
    as these were showing up on the web version due to a syntax error.
 -- Prevent slurmctld performance issues in main background loop if multiple
    backup controllers are unavailable.
 -- Add missing user read association lock in burst_buffer/cray during init().
 -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
 -- Fix creation of step hwloc xml file for after cpuset cgroup has been
    created.
 -- Add userspace as a valid default governor.
 -- Add timers to group_cache_lookup so if going slow advise
    LaunchParameters=send_gids.
 -- Fix SLURM_STEP_GRES=none to work correctly.
 -- Fix potential memory leak when a failure happens unpacking a ctld_multi_msg.
 -- Fix potential double free when a faulure happens when unpacking a
    node_registration_status_msg.
 -- Fix sacctmgr show runaways.
 -- Removed non-POSIX append operator from configure script for non-bash
    support.
 -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
 -- Fix sacct to not print huge reserve times when the job was never eligible.
 -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a
    burst buffer.
 -- burst_buffer/cray - Update burst buffers when an association or qos
    is removed from the system.
 -- Remove documentation for deprecated Cray/ALPS systems. Please switch to
    Native Cray mode instead.
 -- Completely copy features when copying the list in the slurmctld.
 -- PMIX - Fix issue with packing processes when using an arbitrary task
    distribution.
 -- Fix hostlists to be able to handle nodenames with '-' in them surrounded
    by integers.
 -- Fix correct job CPU count allocated.
 -- Fix sacctmgr setting GrpJobs limit when setting GrpJobsAccrue limit.
 -- Change the defaults to MemLimitEnforce=no and NoOverMemoryKill
    (See RELEASE_NOTES).
 -- Prevent abort when using Cray node features plugin on non-knl.
 -- Add ability to reboot down nodes with scontrol reboot_nodes.
 -- Protect against sending to the slurmdbd if the connection has gone away.
 -- Fix invalid read when not using backup slurmctlds.
 -- Prevent acct coordinators from changing default acct on add user.
 -- Don't allow scontrol top do modify job priorities when priority == 1.
 -- slurmsmwd - change parsing code to handle systems with the svid or inst
    fields set in xtconsumer output.
 -- Fix infinite loop in slurmctld if GRES is specified without a count.
 -- sacct: Print error when unknown arguments are found.
 -- Fix checking missing return codes when unpacking structures.
 -- Fix slurm.spec-legacy including slurmsmwd
 -- More explicit error message when cgroup oom-kill events detected.
 -- When updating an association and are unable to find parent association
    initialize old fairshare association pointer correctly.
 -- Wrap slurm_cond_signal() calls with mutexes where needed.
 -- Fix correct timeout with resends in slurm_send_only_node_msg.
 -- Fix pam_slurm_adopt to honor action_adopt_failure.
 -- Have the slurmd recreate the hwloc xml file for the full system on restart.
 -- sdiag - correct the units for the gettimeofday() stat to microseconds.
 -- Set SLURM_CLUSTER_NAME environment variable in MailProg to the ClusterName.
 -- smail - use SLURM_CLUSTER_NAME environment variable.
 -- job_submit/lua - expose argc/argv options through lua interface.
 -- slurmdbd - prevent false-positive warning about innodb settings having
    been set too low if they're actually set over 2GB.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.0
==========================
 -- Fix segfault on job arrays when starting controller without dbd up.
 -- Fix pmi2 to build with gcc 8.0+.
 -- Remove the development snapshot of select/cons_tres plugin.
 -- Fix slurmd -C to not print benign error from xcpuinfo.
 -- Fix potential double locks in the assoc_mgr.
 -- Fix sacct truncate flag behavior Truncated pending jobs will always
    return a start and end time set to the window end time, so elapsed
    time is 0.
 -- Fix extern step hanging forever when canceled right after creation.
 -- sdiag - add slurmctld agent count.
 -- Remove requirement to have cgroup_allowed_devices_file.conf in order to
    constrain devices. By default all devices are allowed and GRES, that are
    associated with a device file, that are not requested are restricted.
 -- Fix proper alignment of clauses when determining if more nodes are needed
    for an allocation.
 -- Fix race condition when canceling a federation job that just started
    running.
 -- Prevent extra resources from being allocated when combining certain flags.
 -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing
    when using --hint=nomultithread.
 -- Fix left over socket file when step is ending and using pmi2 with
    %n or %h in the spool dir.
 -- Don't remove hwloc full system xml file when shutting down the slurmd.
 -- Fix segfault that could happen with a het job when it was canceled while
    starting.
 -- Fix scan-build false-positive warning about invalid memory access in the
    _ping_controller() function.
 -- Add control_inx value to trigger_info_msg_t to permit future work in the
    trigger management code to distinguish which of multiple backup controllers
    has changed state.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 18.08.0rc1
==============================
 -- Add TimelimitRaw sacct output field to display timelimit numbers.
 -- Fix job array preemption during backfill scheduling.
 -- Fix scontrol -o show assoc output.
 -- Add support for sacct --whole-hetjob=[yes|no] option.
 -- Make salloc handle node requests the same as sbatch.
 -- Add shutdown_on_reboot SlurmdParameter to control whether the Slurmd will
    shutdown itself down or not when a reboot request is received.
 -- Add cancel_reboot scontrol option to cancel pending reboot of nodes.
 -- Make Users case insensitive in the database based on
    Parameters=PreserveCaseUser in the slurmdbd.conf.
 -- Improve scheduling when dealing with node_features that could have a
    boot delay.
 -- Fix issue if a step launch fails we don't get a bunch of '(null)' strings
    in the step record for usage.
 -- Changed the default AuthType for slurmdbd to auth/munge.
 -- Make it so libpmi.so doesn't link to libslurm.so.$apiversion.
 -- Added 'remote-fs.target' to After directive of slurmd.service file.
 -- Fix filetxt plugin to handle it when you aren't running a jobacct_gather
    plugin.
 -- Remove drain on node when reboot nextstate used.
 -- Speed up pack of job's qos.
 -- Fix race condition when trying to update reservation in the database.
 -- For the PrologFlags slurm.conf option, make NoHold mutually exclusive with
    Contain and/or X11 options.
 -- Revise the handling of SlurmctldSyslogLevel and SlurmdSyslogLevel options
    in slurm.conf and DebugLevelSyslog in slurmdbd.conf.
 -- Gate reading the cgroup.conf file.
 -- Gate reading the acct_gather_* plugins.
 -- Add sacctmgr options to prevent/manage job queue stuffing:
    - GrpJobsAccrue=<max_jobs>
      Maximum number of pending jobs in aggregate able to accrue age priority
      for this association and all associations which are children of this
      association. To clear a previously set value use the modify command with
      a new value of -1.
    - MaxJobsAccrue=<max_jobs>
      Maximum number of pending jobs able to accrue age priority at any given
      time for the given association. This is overridden if set directly on a
      user. Default is the cluster's limit. To clear a previously set value use
      the modify command with a new value of -1.
    - MinPrioThreshold
      Minimum priority required to reserve resources when scheduling.
* Changes in Slurm 18.08.0pre2
==============================
 -- Remove support for "ChosLoc" configuration parameter.
 -- Configuration parameters "ControlMachine", "ControlAddr", "BackupController"
    and "BackupAddr" replaced by an ordered list of "SlurmctldHost" records
    with the optional address appended to the name enclosed in parenthesis.
    For example: "SlurmctldHost=head(12.34.56.78)". An arbitrary number of
    backup servers can be configured.
 -- When a pending job's state includes "UnavailableNodes" do not include the
    nodes in FUTURE state.
 -- Remove --immediate option from sbatch.
 -- Add infrastructure for per-job and per-step TRES parameters: tres-per-job,
    tres-per-node, tres-per-socket, tres-per-task, cpus-per-tres, mem-per-tres,
    tres-bind and tres-freq. These new parameters are not currently used, but
    have been added to the appropriate RPCs.
Morris Jette's avatar
Morris Jette committed
 -- Add DefCpuPerGpu and DefMemPerGpu to global and per-partition configuration
Morris Jette's avatar
Morris Jette committed
    parameters. Shown in scontrol/sview as "JobDefaults=...". NOTE: These
    options are for future use and currently have no effect.
 -- Fix for setting always the correct status on job update in mysql
 -- Add ValidateMode configuration parameter to knl_cray.conf for static
    MCDRAM/NUMA configurations.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Fix security issue in accounting_storage/mysql plugin by always escaping
    strings within the slurmdbd. CVE-2018-7033.
 -- Disable local PTY output processing when using 'srun --unbuffered'. This
    prevents the PTY subsystem from inserting extraneous \r characters into
    the output stream.
 -- Change the column name for the %U (User ID) field in squeue to 'UID'.
 -- CRAY - Add CheckGhalQuiesce to the CommunicationParameters.
 -- When a process is core dumping, avoid terminating other processes in that
    task group. This fixes a problem with writing out incomplete OpenMP core
    files.
 -- CPU frequency management enhancements: If scaling_available_frequencies
    file is not available, then derive values from scaling_min_freq and
    scaling_max_freq values. If cpuinfo_cur_freq file is not available then
    try to use scaling_cur_freq.
 -- Add pending jobs count to sdiag output.
 -- Fix update job function. There were some incosistencies on the behavior
    that caused time limits to be modified when swapping QOS, bad permissions
    check for a coordinator and AllowQOS and DenyQOS were not enforced on
    job update.
 -- Add configuration paramerers SlurmctldPrimaryOnProg and
    SlurmctldPrimaryOffProg, which define programs to execute when a slurmctld
    daemon becomes the primary server or goes from primary to backup mode.
 -- Add configuration paramerers SlurmctldAddr for use with virtual IP to manage
    backup slurmctld daemons.
 -- Explicitly shutdown the slurmd process when instructed to reboot.
 -- Add ability to create/update partition with TRESBillingWeights through
    scontrol.
 -- Calcuate TRES billing values at submission so that billing limits can be
    enforced at submission with QOS DenyOnLimit.
 -- Add node_features plugin function "node_features_p_reboot_weight()" to
    return the node weight to be used for a compute node that requires reboot
    for use (e.g. to change the NUMA mode of a KNL node).
 -- Add NodeRebootWeight parameter to knl.conf configuration file.
Tim Wickberg's avatar
Tim Wickberg committed
 -- Fix insecure handling of job requested gid field. CVE-2018-10995.
 -- Fix srun to return highest signal of any task.
Morris Jette's avatar
Morris Jette committed
 -- Completely remove "gres" field from step record. Use "tres_per_node",
    "tres_per_socket", etc.
 -- Add "Links" parameter to gres.conf configuration file.
 -- Force slurm_mktime() to set tm_isdst to -1 so anyone using the function
    doesn't forget to set it.
 -- burst_buffer.conf - Add SetExecHost flag to enable burst buffer access
    from the login node for interactive jobs.
 -- Append ", with requeued tasks" to job array "end" emails if any tasks in the
    array were requeued. This is a hint to use "sacct --duplicates" to see the
    whole picture of the array job.
 -- Add ResumeFailProgram slurm.conf option to specify a program that is called
    when a node fails to respond by ResumeTimeout.
 -- Add new job pending reason of "ReqNodeNotAvail, reserved for maintenance".
 -- Remove AdminComment += syntax from 'scontrol update job'.
 -- sched/backfill: Reset job time limit if needed for deadline scheduling.
 -- For heterogeneous job component with required nodes, explicitly exclude
    those nodes from all other job components.
 -- Add name of partition used to output of srun --test-only output (valuable
    for jobs submitted to multiple partitions).
 -- If MailProg is not configured and "/bin/mail" (the default) does not exist,
    but "/usr/bin/mail" does exist then use "/usr/bin/mail" as a default value.
 -- sdiag output now reports outgoing slurmctld message queue contents.
 -- Fix issue in performance when reading slurm conf having nodes with features.
 -- Make it so the slurmdbd's pid file gets created before initing
    the database.
 -- Improve escaping special characters on user commands when specifying paths.
 -- Fix directory names with special char '\' that are not handled correctly.
 -- Add salloc/sbatch/srun option of --gres-flags=disable-binding to disable
    filtering of CPUs with respect to generic resource locality. This option is
    currently required to use more CPUs than are bound to a GRES (i.e. if a GPU
    is bound to the CPUs on one socket, but resources on more than one socket
    are required to run the job). This option may permit a job to be allocated
    resources sooner than otherwise possible, but may result in lower job
    performance.
 -- SlurmDBD - Print warning if MySQL/MariaDB internal tuning is not at least
    half of the recommended values.
 -- Move libpmi from src/api to contribs/pmi.
 -- Add ability to specify a node reason when rebooting nodes with "scontrol
    reboot".
 -- Add nextstate option to "scontrol reboot" to dictate state of node after
    reboot.
 -- Consider "resuming" (nextstate=resume) nodes as available in backfill
    future scheduling and don't replace "resuming" nodes in reservations.
 -- Add the use of a xml file to help performance when using hwloc.
* Changes in Slurm 18.08.0pre1
==============================
 -- Add new burst buffer state of "teardown-fail" to indicate the burst buffer
    teardown operation is failing on specific buffers. This changes the numeric
    value of the BB_STATE_COMPLETE type. Any Slurm version 17.02 or 17.11 tool
    used to report burst buffer state information will report a state of "66"
    rather than "complete" for burst buffers which have been deleted, but still
    exist in the slurmctld daemon's tables (a very short-lived situation).
 -- Multiple backup slurmctld daemons can be configured:
    * Specify "BackupController#=<hostname> and "BackupAddr#=<address>" to
      identify up to 9 backup servers.
    * Output format of "scontrol ping" and the daemon status at the end of
      "scontrol status" is modified to report up status of the primary and all
      backup servers.
    * "scontrol takeover [#]" command can now identify the SlurmctldHost
      index number. Default value is "1" (the first backup configured
      SlurmctldHost).
Morris Jette's avatar
Morris Jette committed
 -- Enable jobs with zero node count for creation and/or deletion of persistent
    burst buffers.
    * The partition default MinNodes configuration parameter is now 0
      (previously 1 node).
    * Zero size jobs disabled for job arrays and heterogeneous jobs, but
      supported for salloc, sbatch and srun commands.
 -- Add "scontrol show dwstat" command to display Cray burst buffer status.
 -- Add "GetSysStatus" option to burst_buffer.conf file. For burst_buffer/cray
    this would indicate the location of the "dwstat" command.
 -- Add node and partition configuration options of "CpuBind" to control default
    task binding. Modify the scontrol to report and modify these parameters.
 -- Add "NumaCpuBind" option to knl.conf file to automatically change a node's
    CpuBind parameter based upon changes to a node's NUMA mode.
Morris Jette's avatar
Morris Jette committed
 -- Add sbatch "--batch" option to identify features required on batch node.
    For example "sbatch --batch=haswell ...".
 -- Add "BatchFeatures" field to output of "scontrol show job".
 -- Add support for "--bb" option to sbatch command.
 -- Add new SystemComment field to job data structure and database. Currently
    used for Burst Buffer error logs.
 -- Expand reservation "flags" field from 32 to 64 bits.
Dominik Bartkiewicz's avatar
Dominik Bartkiewicz committed
 -- Add job state flag of "SIGNALING" to avoid race condition with multiple
    SIGSTOP/SIGCONT signals for the same job being active at the same time.
 -- Properly handle srun --will-run option when there are jobs in COMPLETING
    state.
 -- Properly report who is signaling a step.
 -- Don't combine updated reservation records in sreport's reservation report.
 -- node_features plugin - Add suport for XOR & XAND of job constraints (node
    feature specifications).
 -- Add support for parenthesis in a job's constraint specification to group
    like options together. For example
    --constraint="[(knl&snc4&flat)*4&haswell*1]" might be used to specify that
    four nodes with the features "knl", "snc4" and "flat" plus one node with
    the feature "haswell" are required.
 -- Improvements to how srun searches for the executible when using cwd.
 -- Now programs can be checked before execution if test_exec is set when using
    multi-prog option.
 -- Report NodeFeatures plugin configuration with scontrol and sview commands.
 -- Add acct_gather_profile/influxdb plugin.
 -- Add new job state of SO/STAGE_OUT indicating that burst buffer stage-out
    operation is in progress.
 -- Correct SLURM_NTASKS and SLURM_NPROCS environment variable for heterogeneous
    job step. Report values representing full allocation.
 -- Expand advanced reservation feature specification to support parenthesis and
    counts of nodes with specified features. Nodes with the feature currently
    active will be prefered.
 -- Defer job signaling until prolog is completed
 -- Have the primary slurmctld wait until the backup has completely shutdown
    before taking control.
 -- Fix issue where unpacking job state after TRES count changed could lead to
    invalid reads.
 -- Heterogeneous job steps allocations supported with
    * Open MPI (with Slurm's PMI2 and PMIx plugins) and
    * Intel MPI (with Slurm's PMI2 plugin)
 -- Remove redundant function arguments from task plugins:
    * Remove "job_id" field from task_p_slurmd_batch_request() function.
    * Remove "job_id" field from task_p_slurmd_launch_request() function.
    * Remove "job_id" field from task_p_slurmd_reserve_resources() function.
 -- Change function name from node_features_p_changible_feature() to
    node_features_p_changeable_feature in node_features plugin.
 -- Add Slurm configuration file check logic using "slurmctld -t" command.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.14
===========================

* Changes in Slurm 17.11.13-2
=============================
 -- Fix Perl build for 32-bit systems.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.13
===========================
 -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc.
    CVE-2019-6438.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.12
===========================
 -- Fix regression in 17.11.10 that caused dbd messages to not be queued up
    when the dbd was down.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.11
===========================
 -- Correctly initialize variable in env_array_user_default().
 -- Correctly handle scenarios where a partitions MaxMemPerCPU is less than
    a jobs --mem-per-cpu and also -c is greater than 1.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.10
===========================
 -- Move priority_sort_part_tier from slurmctld to libslurm to make it possible
    to run the regression tests 24.* without changing that code since it links
    directly to the priority plugin where that function isn't defined.
 -- Fix issue where job time limits can increase to max walltime when updating
    a job with scontrol.
 -- Fix invalid protocol_version manipulation on big endian platforms causing
    srun and sattach to fail.
 -- Fix for QOS, Reservation and Alias env variables in srun.
 -- mpi/pmi2 - Backport 6a702158b49c4 from 18.08 to avoid dangerous detached
 -- When allowing heterogeneous steps make sure we copy all the options to
    avoid copying strings that may be overwritten.
 -- Print correctly when sh5util finds and empty file.
 -- Fix sh5util to not seg fault on exit.
 -- Fix sh5util to check correctly for H5free_memory.
 -- Adjust OOM monitoring function in task/cgroup to prevent problems in
    regression suite from leaked file descriptors.
 -- Fix issue with gres when defined with a type and no count
    (i.e. gres=gpu/tesla) it would get a count of 0.
 -- Allow sstat to talk to slurmd's that are new in protocol version.
 -- Permit database names over 33 characters in accounting_storage/mysql.
 -- Fix negative values when profiling.
Danny Auble's avatar
Danny Auble committed
 -- Fix srun segfault caused by invalid memory reads on the env.
 -- Fix segfault on job arrays when starting controller without dbd up.
 -- Fix pmi2 to build with gcc 8.0+.
 -- Fix proper alignment of clauses when determining if more nodes are needed
    for an allocation.
 -- Fix race condition when canceling a federation job that just started
    running.
 -- Prevent extra resources from being allocated when combining certain flags.
 -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing
    when using --hint=nomultithread.
 -- Fix left over socket file when step is ending and using pmi2 with
    %n or %h in the spool dir.
 -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'.
 -- Fix sacct to not print huge reserve times when the job was never eligible.
 -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a
    burst buffer.
 -- burst_buffer/cray - Update burst buffers when an association or qos
    is removed from the system.
 -- If failed over to a backup controller, ensure the agent thread is launched
    to handle deferred tasks.
 -- Fix correct job CPU count allocated.
 -- Protect against sending to the slurmdbd if the connection has gone away.
 -- Fix checking missing return codes when unpacking structures.
 -- Fix slurm.spec-legacy including slurmsmwd
 -- More explicit error message when cgroup oom-kill events detected.
 -- When updating an association and are unable to find parent association
    initialize old fairshare association pointer correctly.
 -- Wrap slurm_cond_signal() calls with mutexes where needed.
 -- Fix correct timeout with resends in slurm_send_only_node_msg.
 -- Fix pam_slurm_adopt to honor action_adopt_failure.
 -- job_submit/lua - expose argc/argv options through lua interface.
Tim Wickberg's avatar
Tim Wickberg committed

* Changes in Slurm 17.11.9-2
============================
 -- Fix printing of node state "drain + reboot" (and other node state flags).
 -- Fix invalid read (segfault) when sorting multi-partition jobs.
 -- Move several new error() messages to debug() to keep them out of users'
    srun output.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.9
==========================
 -- Fix segfault in slurmctld when a job's node bitmap is NULL during a
    scheduling cycle.  Primarily caused by EnforcePartLimits=ALL.
 -- Remove erroneous unlock in acct_gather_energy/ipmi.
 -- Enable support for hwloc version 2.0.1.
Jason Booth's avatar
Jason Booth committed
 -- Fix 'srun -q' (--qos) option handling.
 -- Fix socket communication issue that can lead to lost task completition
    messages, which will cause a permanently stuck srun process.
 -- Handle creation of TMPDIR if environment variable is set or changed in
    a task prolog script.
 -- Avoid node layout fragmentation if running with a fixed CPU count but
    without Sockets and CoresPerSocket defined.
 -- burst_buffer/cray - Fix datawarp swap default pool overriding jobdw.
 -- Fix incorrect job priority assignment for multi-partition job with
    different PriorityTier settings on the partitions.
 -- Fix sinfo to print correct node state.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.8
==========================
 -- Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path.
 -- Do not allocate nodes that were marked down due to the node not responding
    by ResumeTimeout.
 -- task/cray plugin - search for "mems" cgroup information in the file
    "cpuset.mems" then fall back to the file "mems".
 -- Fix ipmi profile debug uninitialized variable.
 -- Improve detection of Lua package on older RHEL distributions.
Danny Auble's avatar
Danny Auble committed
 -- PMIx: fixed the direct connect inline msg sending.
 -- MYSQL: Fix issue not handling all fields when loading an archive dump.
 -- Allow a job_submit plugin to change the admin_comment field during
    job_submit_plugin_modify().
 -- job_submit/lua - fix access into reservation table.
 -- MySQL - Prevent deadlock caused by archive logic locking reads.
 -- Don't enforce MaxQueryTimeRange when requesting specific jobs.
 -- Modify --test-only logic to properly support jobs submitted to more than
    one partition.
 -- Prevent slurmctld from abort when attempting to set non-existing
 -- Add new job dependency type of "afterburstbuffer". The pending job will be
    delayed until the first job completes execution and it's burst buffer
    stage-out is completed.
 -- Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd
    and avoid race condition calling task before proctrack can introduce.
 -- Prevent reboot of a busy KNL node when requesting inactive features.
 -- Revert to previous behavior when requesting memory per cpu/node introduced
    in 17.11.7.
 -- Fix to reinitialize previously adjusted job members to their original value
    when validating the job memory in multi-partition requests.
 -- Fix _step_signal() from always returning SLURM_SUCCESS.
 -- Combine active and available node feature change logs on one line rather
    than one line per node for performance reasons.
 -- Prevent occasionally leaking freezer cgroups.
Danny Auble's avatar
Danny Auble committed
 -- Fix potential segfault when closing the mpi/pmi2 plugin.
 -- Fix issues with --exclusive=[user|mcs] to work correctly
    with preemption or when job requests a specific list of hosts.
 -- Make code compile with hdf5 1.10.2+
 -- mpi/pmix: Fixed the collectives canceling.
 -- SlurmDBD: improve error message handling on archive load failure.
 -- Fix incorrect locking when deleting reservations.
 -- Fix incorrect locking when setting up the power save module.
 -- Fix setting format output length for squeue when showing array jobs.
Brian Christiansen's avatar
Brian Christiansen committed
 -- Add xstrstr function.
 -- Fix printing out of --hint options in sbatch, salloc --help.
 -- Prevent possible divide by zero in _validate_time_limit().
 -- Add Delegate=yes to the slurmd.service file to prevent systemd from
    interfering with the jobs' cgroup hierarchies.
 -- Change the backlog argument to the listen() syscall within srun to 4096
    to match elsewhere in the code, and avoid communication problems at scale.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.7
==========================
 -- Fix for possible slurmctld daemon abort with NULL pointer.
 -- Fix different issues when requesting memory per cpu/node.
 -- PMIx - override default paths at configure time if --with-pmix is used.
 -- Have sprio display jobs before eligible time when
    PriorityFlags=ACCRUE_ALWAYS is set.
 -- Make sure locks are always in place when calling _post_qos_list().
 -- Notify srun and ctld when unkillable stepd exits.
 -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in
    the jobacct_gather fini() interfaces introduced in 17.11.6.
 -- Fix slurmstepd deadlock in PMIx startup.
 -- task/cgroup - fix invalid free() if the hwloc library does not return a
    string as expected.
 -- Fix insecure handling of job requested gid field. CVE-2018-10995.
 -- Add --without x11 option to rpmbuild in slurm.spec.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.6
==========================
 -- CRAY - Add slurmsmwd to the contribs/cray dir.
 -- sview - fix crash when closing any search dialog.
 -- Fix initialization of variable in stepd when using native x11.
 -- Fix reading slurm_io_init_msg to handle partial messages.
 -- Fix scontrol create res segfault when wrong user/account parameters given.
Felip Moll's avatar
Felip Moll committed
 -- Fix documentation for sacct on parameter -X (--allocations)
 -- Change TRES Weights debug messages to debug3.
 -- FreeBSD - assorted fixes to restore build.
Felip Moll's avatar
Felip Moll committed
 -- Fix for not tracking environment variables from unrelated different jobs.
 -- PMIX - Added the direct connect authentication.
    When upgrading this may cause issues with jobs using pmix starting on mixed
    slurmstepd versions where some are less than 17.11.6.
Morris Jette's avatar
Morris Jette committed
 -- Prevent the backup slurmctld from losing the active/available node
    features list on takeover.
Felip Moll's avatar
Felip Moll committed
 -- Add documentation for fix IDLE*+POWER due to capmc stuck in Cray systems.
 -- Fix missing mutex unlock when prolog is failing on a node, leading to a
    hung slurmd.
 -- Fix locking around Cray CCM prolog/epilog.
 -- Add missing fed_mgr read locks.
 -- Fix issue incorrectly setting a job time_start to 0 while requeueing.
 -- smail - remove stray '-s' from mail subject line.
 -- srun - prevent segfault if ClusterName setting is unset but
    SLURM_WORKING_CLUSTER environment variable is defined.
 -- In configurator.html web pages change default configuration from
    task/none to task/affinity plugin and from select/linear plugin to
    select/cons_res plus CR_Core.
 -- Allow jobs to run beyond a FLEX reservation end time.
 -- Fix problem with wrongly set as Reservation job state_reason.
 -- Prevent bit_ffs() from returnig value out of bitmap range.
 -- Improve performance of 'squeue -u' when PrivateData=jobs is enabled.
 -- Make UnavailableNodes value in job reason be correct for each job.
 -- Fix 'squeue -o %s' on Cray systems.
 -- Fix incorrect error thrown when cancelling part of a job array.
 -- Fix error code and scheduling problem for --exclusive=[user|mcs].
 -- Fix build when lz4 is in a non-standard location.
 -- Be able to force power_down of cloud node even if in power_save state.
 -- Allow cloud nodes to be recognized in Slurm when booted out of band.
 -- Fixes race condition in _pack_job_gres() when is called multiple times.
 -- Increase duration of "sleep" command used to keep extern step alive.
 -- Remove unsafe usage of pthread_cancel in slurmstepd that can lead to
    to deadlock in glibc.
 -- Fix total TRES Billing on partitions.
 -- Don't tear down a BB if a node fails and --no-kill or resize of a job
    happens.
 -- Remove unsafe usage of pthread_cancel in pmix plugin that can lead to
    to deadlock in glibc.
 -- Fix fatal in controller when loading completed trigger
 -- Ignore reservation overlap at submission time.
 -- GRES type model and QOS limits documentation added
 -- slurmd - fix ABRT on SIGINT after reconfigure with MemSpecLimit set.
 -- PMIx - move two error messages on retry to debug level, and only display
    the error after the retry count has been exceeded.
 -- Increase number of tries when sending responses to srun.
 -- Fix checkpointing requeued/completing jobs in a bad state which caused a
    segfault on restart.
 -- Fix srun on ppc64 platforms.
 -- Prevent slurmd from starting steps if the Prolog returns an error when using
    PrologFlags=alloc.
 -- priority/multifactor - prevent segfault running sprio if a partition has
    just been deleted and PriorityFlags=CALCULATE_RUNNING is turned on.
 -- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code value.
 -- job_submit/lua - print an error if the script calls log.user in
    job_modify() instead of returning it to the next submitted job erroneously.
Felip Moll's avatar
Felip Moll committed
 -- select/linear - handle job resize correctly.
 -- select/cons_res - improve handling of --cores-per-socket requests.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.5
==========================
 -- Fix cloud nodes getting stuck in DOWN+POWER_UP+NO_RESPOND state after not
    responding by ResumeTimeout.
 -- Add job's array_task_cnt and user_name along with partitions
    [max|def]_mem_per_[cpu|node], max_cpus_per_node, and max_share with the
    SHARED_FORCE definition to the job_submit/lua plugin.
 -- srun - fix for SLURM_JOB_NUM_NODES env variable assignment.
 -- sacctmgr - fix runaway jobs identification.
 -- Fix for setting always the correct status on job update in mysql.
 -- Fix issue if running with an association manager cache (slurmdbd was down
    when slurmctld was started) you could loose QOS usage information.
 -- CRAY - Fix spec file to work correctly.
 -- Set scontrol exit code to 1 if attempting to update a node state to DRAIN
    or DOWN without specifying a reason.
 -- Fix race condition when running with an association manager cache
    (slurmdbd was down when slurmctld was started).
 -- Print out missing SLURM_PERSIST_INIT slurmdbd message type.
 -- Fix two build errors related to use of the O_CLOEXEC flag with older glibc.
 -- Add Google Cloud Platform integration scripts into contribs directory.
 -- Fix minor potential memory leak in backfill plugin.
 -- Add missing node flags (maint/power/etc) to node states.
 -- Fix issue where job time limits may end up at 1 minute when using the
    NoReserve flag on their QOS.
 -- Fix security issue in accounting_storage/mysql plugin by always escaping
    strings within the slurmdbd. CVE-2018-7033.
 -- Soften messages about best_fit topology to debug2 to avoid alarm.
 -- Fix issue in sreport reservation utilization report to handle more
    allocated time than 100% (Flex reservations).
 -- When a job is requesting a Flex reservation prefer the reservation's nodes
    over any other nodes.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.4
==========================
 -- Add fatal_abort() function to be able to get core dumps if we hit an
    "impossible" edge case.
 -- Link slurmd against all libraries that slurmstepd links to.
 -- Fix limits enforce order when they're set at partition and other levels.
 -- Add slurm_load_single_node() function to the Perl API.
 -- slurm.spec - change dependency for --with lua to use pkgconfig.
 -- Fix small memory leaks in node_features plugins on reconfigure.
 -- slurmdbd - only permit requests to update resources from operators or
    administrators.
 -- Fix handling of partial writes in io_init_msg_write_to_fd() which can
    lead to job step launch failure under higher cluster loads.
 -- MYSQL - Fix to handle quotes in a given work_dir of a job.
 -- sbcast - fix a race condition that leads to "Unspecified error".
 -- Log that support for the ChosLoc configuration parameter will end in Slurm
    version 18.08.
 -- Fix backfill performance issue where bf_min_prio_reserve was not respected.
 -- Fix MaxQueryTimeRange checks.
 -- Print MaxQueryTimeRange in "sacctmgr show config".
 -- Correctly check return codes when creating a step to check if needing to
    wait to retry or not.
 -- Fix issue where a job could be denied by Reason=MaxMemPerLimit when not
    requesting any tasks.
Felip Moll's avatar
Felip Moll committed
 -- In perl tools, fix for regexp that caused extra incorrectly shown results.
 -- Add some extra locks in fed_mgr to be extra safe.
 -- Minor memory leak fixes in the fed_mgr on slurmctld shutdown.
 -- Make sreport job reports also report duplicate jobs correctly.
 -- Fix issues restoring certain Partition configuration elements, especially
    when ReconfigFlags=KeepPartInfo is enabled.
 -- Don't add TRES whose value is NO_VAL64 when building string line.
 -- Fix removing array jobs from hash in slurmctld.
 -- Print out missing user messages from jobsubmit plugin when srun/salloc are
    waiting for an allocation.
 -- Handle --clusters=all as case insensitive.
 -- Only check requested clusters in federation when using --test-only
    submission option.
 -- In the federation, make it so you can cancel stranded sibling jobs.
 -- Silence an error from PSS memory stat collection process.
 -- Requeue jobs allocated to nodes requested to DRAIN or FAIL if nodes are
    POWER_SAVE or POWER_UP, preventing jobs to start on NHC-failed nodes.
 -- Make MAINT and OVERLAP resvervation flags order agnostic on overlap test.
 -- Preserve node features when slurmctld daemons reconfigured including active
    and available KNL features.
 -- Prevent creation of multiple io_timeout threads within srun, which can
    lead to fatal() messages when those unexpected and additional mutexes are
    destroyed when srun shuts down.
 -- burst_buffer/cray - Prevent use of "#DW create_persistent" and
    "#DW destroy_persistent" directives available in Cray CLE6.0UP06. This
    will be supported in Slurm version 18.08. Use "#BB" directives until then.
 -- Fix task/cgroup affinity to behave correctly.
 -- FreeBSD - fix build on systems built with WITHOUT_KERBEROS.
 -- Fix to restore pn_min_memory calculated result to correctly enforce
    MaxMemPerCPU setting on a partition when the job uses --mem.
 -- slurmdbd - prevent infinite loop if a QOS is set to preempt itself.
 -- Fix issue with log rotation for slurmstepd processes.
* Changes in Slurm 17.11.3-2
Tim Wickberg's avatar
Tim Wickberg committed
==========================
 -- Revert node_features changes in 17.11.3 that lead to various segfaults on
    slurmctld startup.
Danny Auble's avatar
Danny Auble committed
* Changes in Slurm 17.11.3
==========================
 -- Send SIG_UME correctly to a step.
 -- Sort sreport's reservation report by cluster, time_start, resv_name instead
    of cluster, resv_name, time_start.
 -- Avoid setting node in COMPLETING state indefinitely if the job initiating
    the node reboot is cancelled while the reboot in in progress.
 -- Scheduling fix for changing node features without any NodeFeatures plugins.
 -- Improve logic when summarizing job arrays mail notifications.
 -- Add scontrol -F/--future option to display nodes in FUTURE state.
 -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE.
 -- When a job array is preempting make it so tasks in the array don't wait
    to preempt other possible jobs.
 -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free
    in slurmstepd.
 -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld
    reconfigured.
 -- node_feature/knl_cray - Fix memory leak that can occur during normal
    operation.
 -- Fix srun environment variables for --prolog script.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix potential deadlock in the slurmctld when using list_for_each.
 -- Fix for possible memory corruption in srun when running heterogeneous job
    steps.
 -- Fix job array dependency with "aftercorr" option and some task arrays in
    the first job fail. This fix lets all task array elements that can run
    proceed rather than stopping all subsequent task array elements.
 -- Fix output file containing "%t" (task ID) for heterogeneous job step to
    be based upon global task ID rather than task ID for that component of the
    heterogeneous job step.
 -- MYSQL - Fix potential abort when attempting to make an account a parent of
    itself.
 -- Fix potentially uninitialized variable in slurmctld.
 -- MYSQL - Fix issue for multi-dimensional machines when using sacct to
    find jobs that ran on specific nodes.
 -- Reject --acctg-freq at submit if invalid.
 -- Added info string on sh5util when deleting an empty file.
 -- Correct dragonfly topology support when job allocation specifies desired
    switch count.
 -- Fix minor memory leak on an sbcast error path.
 -- Fix issues when starting the backup slurmdbd.
 -- Revert uid check when requesting a jobid from a pid.
 -- task/cgroup - add support to detect OOM_KILL cgroup events.
 -- Fix whole node allocation cpu counts when --hint=nomultihtread.
 -- Allow execution of task prolog/epilog when uid has access
    rights by a secondary group id.
 -- Validate command existence on the srun *[pro|epi]log options
    if LaunchParameter test_exec is set.
 -- Fix potential memory leak if clean starting and the TRES didn't change
    from when last started.
 -- Fix for association MaxWall enforcement when none is given at submission.
 -- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld.
 -- burst_buffer/cray: Attempts by job to create persistent burst buffer when
    one already exists owned by a different user will be logged and the job
    held.
 -- CRAY - Remove race in the core_spec where we add the slurmstepd to the
    job container where if the step was canceled would also cancel the stepd
    erroneously.
 -- Make sure the slurmstepd blocks signals like SIGTERM correctly.
 -- SPANK - When slurm_spank_init_post_opt() fails return error correctly.
 -- When revoking a sibling job in the federation we want to send a start
    message before purging the job record to get the uid of the revoked job.
 -- Make JobAcctGatherParams options case-insensitive. Previously, UsePss
    was the only correct capitialization; UsePSS or usepss were silently
    ignored.
 -- Prevent pthread_atfork handlers from being added unnecessarily after
    'scontrol reconfigure', which can eventually lead to a crash if too
    many handlers have been registered.
 -- Better debug messages when MaxSubmitJobs is hit.
 -- Docs - update squeue man page to describe all possible job states.
 -- Prevent orphaned step_extern steps when a job is cancelled while the
    prolog is still running.
Tim Wickberg's avatar
Tim Wickberg committed
* Changes in Slurm 17.11.2
==========================
 -- jobcomp/elasticsearch - append Content-Type to the HTTP header.
 -- MYSQL - Fix potential abort of slurmdbd when job has no TRES.
 -- Add advanced reservation flag of "REPLACE_DOWN" to replace DOWN or DRAINED
    nodes.
 -- slurm.spec-legacy - add missing libslurmfull.so to slurm.files.
 -- Fix squeue job ID filtering for pending job array records.
 -- Fix potential deadlock in _run_prog() in power save code.
 -- MYSQL - Add dynamic_offset in the database to force range for auto
    increment ids for the tres_table.
 -- MYSQL - Fix fallout from MySQL auto increment bug, see RELEASE_NOTES,
    only affects current 17.11 users tracking licenses or GRES in the database.