This file describes changes in recent versions of Slurm. It primarily documents those changes that are of interest to users and administrators. * Changes in Slurm 18.08.8 ========================== -- Update "xauth list" to use the same 10000ms timeout as the other xauth commands. -- Fix issue in gres code to handle a gres cnt of 0. -- Don't purge jobs if backfill is running. -- Verify job is pending add/removing accrual time. -- Don't abort when the job doesn't have an association that was removed before the job was able to make it to the database. -- Set state_reason if select_nodes() fails job for QOS or Account. -- Avoid seg_fault on referencing association without a valid_qos bitmap. -- If Association/QOS is removed on a pending job set that job as ineligible. -- When changing a jobs account/qos always make sure you remove the old limits. -- Don't reset a FAIL_QOS or FAIL_ACCOUNT job reason until the qos or account changed. -- Restore "sreport -T ALL" functionality. -- Correctly typecast signals being sent through the api. -- Properly initialize structures throughout Slurm. -- Sync "numtask" squeue format option for jobs and steps to "numtasks". -- Fix sacct -PD to avoid CA before start jobs. -- Fix potential deadlock with backup slurmctld. -- Fixed issue with jobs not appearing in sacct after dependency satisfied. -- Fix showing non-eligible jobs when asking with -j and not -s. -- Fix issue with backfill scheduler scheduling tasks of an array when not the head job. * Changes in Slurm 18.08.7 ========================== -- Set debug statement to debug2 to avoid benign error messages. -- Add SchedulerParameters option of bf_hetjob_immediate to attempt to start a heterogeneous job as soon as all of its components are determined able to do so. -- Fix underflow causing decay thread to exit. -- Fix main scheduler not considering hetjobs when building the job queue. -- Fix regression for sacct to display old jobs without a start time. -- Fix setting correct number of gres topology bits. -- Update hetjobs pending state reason when appropriate. -- Fix accounting_storage/filetxt's understanding of TRES. -- Set Accrue time when not enforcing limits. -- Fix srun segfault when requesting a hetjob with test_exec or bcast options. -- Hide multipart priorities log message behind Priority debug flag. -- sched/backfill - Make hetjobs sensitive to bf_max_job_start. -- Fix slurmctld segfault due to job's partition pointer NULL dereference. -- Fix issue with OR'ed job dependencies. -- Add new job's bit_flags of INVALID_DEPEND to prevent rebuilding a job's dependency string when it has at least one invalid and purged dependency. -- Promote federation unsynced siblings log message from debug to info. -- burst_buffer/cray - fix slurmctld SIGABRT due to illegal read/writes. -- burst_buffer/cray - fix memory leak due to unfreed job script content. -- node_features/knl_cray - fix script_argv use-after-free. -- burst_buffer/cray - fix script_argv use-after-free. -- Fix invalid reads of size 1 due to non null-terminated string reads. -- Add extra debug2 logs to identify why BadConstraints reason is set. * Changes in Slurm 18.08.6-2 ============================ -- Remove deadlock situation when logging and --enable-debug is used. -- Fix RPM packaging for accounting_storage/mysql. * Changes in Slurm 18.08.6 ========================== -- Added parsing of -H flag with scancel. -- Fix slurmsmwd build on 32-bit systems. -- acct_gather_filesystem/lustre - add support for Lustre 2.12 client. -- Fix per-partition TRES factors/priority -- Fix per-partition NICE priority -- Fix partition access check validation for multi-partition job submissions. -- Prevent segfault on empty response in 'scontrol show dwstat'. -- node_features/knl_cray plugin - Preserve node's active features if it has already booted when slurmctld daemon is reconfigured. -- Detect missing burst buffer script and reject job. -- GRES: Properly reset the topo_gres_cnt_alloc counter on slurmctld restart to prevent underflow. -- Avoid errors from packing accounting_storage_mysql.so when RPM is built with out mysql support. -- Remove deprecated -t option from slurmctld --help. -- acct_gather_filesystem/lustre - fix stats gathering. -- Enforce documented default usage start and end times when querying jobs from the database. -- Fix issues when querying running jobs from the database. -- Deny sacct request where start time is later than the end time requested. -- Fix sacct verbose about time and states queried. -- burst_buffer/cray - allow 'scancel --hurry ' to tear down a burst buffer that is currently staging data out. -- X11 forwarding - allow setup if the DISPLAY environment variable lacks a screen number. (Permit both "localhost:10.0" and "localhost:10".) -- docs - change HTML title to include the page title or man page name. -- X11 forwarding - fix an unnecessary error message when using the local_xauthority X11Parameters option. -- Add use_raw_hostname to X11Parameters. -- Fix smail so it passes job arrays to seff correctly. -- Don't check InactiveLimit for salloc --no-shell jobs. -- Add SALLOC_GRES and SBATCH_GRES as input to salloc/sbatch. -- Remove drain state when node doesn't reboot by ResumeTimeout. -- Fix considering "resuming" nodes in scheduling. -- Do not kill suspended jobs due to exceeding time limit. -- Add NoAddrCache CommunicationParameter. -- Don't ping powering up cloud nodes. -- Add cloud_dns SlurmctldParameter. -- Consider --sbindir configure option as the default path to find slurmstepd. -- Fix node state printing of DRAINED$ -- Fix spamming dbd of down/drained nodes in maintenance reservation. -- Avoid buffer overflow in time_str2secs. -- Calculate suspended time for suspended steps. -- Add null check for step_ptr->step_node_bitmap in _pick_step_nodes. -- Fix multi-cluster srun issue after 'scontrol reconfigure' was called. -- Fix accessing response_cluster_rec outside of write locks. -- Fix Lua user messages not showing up on rejected submissions. -- Fix printing multi-line error messages on rejected submissions. * Changes in Slurm 18.08.5-2 ============================ -- Fix Perl build for 32-bit systems. * Changes in Slurm 18.08.5 ========================== -- Backfill - If a job has a time_limit guess the end time of a job better if OverTimeLimit is Unlimited. -- Fix "sacctmgr show events event=cluster" -- Fix sacctmgr show runawayjobs from sibling cluster -- Avoid bit offset of -1 in call to bit_nclear(). -- Insure that "hbm" is a configured GresType on knl systems. -- Fix NodeFeaturesPlugins=node_features/knl_generic to allow other gres other than knl. -- cons_res: Prevent overflow on multiply. -- Better debug for bad values in gres.conf. -- Fix double accounting of energy at end of job. -- Read gres.conf for cloud nodes on slurmctld. -- Don't assume the first node of a job is the batch host when purging jobs from a node. -- Better debugging when a job doesn't have a job_resrcs ptr. -- Store ave watts in energy plugins. -- Add XCC plugin for reading Lenovo Power. -- Fix minor memory leak when scheduling rebootable nodes. -- Fix debug2 prefix for sched log. -- Fix printing correct SLURM_JOB_ACCOUNT_PACK_GROUP_* in env for a Het Job. -- sbatch - search current working directory first for job script. -- Make it so held jobs reset the AccrueTime and do not count against any AccrueTime limits. -- Add SchedulerParameters option of bf_hetjob_prio=[min|avg|max] to alter the job sorting algorithm for scheduling heterogeneous jobs. -- Fix initialization of assoc_mgr_locks and slurmctld_locks lock structures. -- Fix segfault with job arrays using X11 forwarding. -- Revert regression caused by e0ee1c7054 which caused negative values and values starting with a decimal to be invalid for PriorityWeightTRES and TRESBillingWeight. -- Fix possibility to update a job's reservation to none. -- Suppress connection errors to primary slurmdbd when backup dbd is active. -- Suppress connection errors to primary db when backup db kicks in -- Add missing fields for sacct --completion when using jobcomp/filetxt. -- Fix incorrect values set for UserCPU, SystemCPU, and TotalCPU sacct fields when JobAcctGatherType=jobacct_gather/cgroup. -- Fixed srun from double printing invalid option msg twice. -- Remove unused -b flag from getopt call in sbatch. -- Disable reporting of node TRES in sreport. -- Re-enabling features combined by OR within parenthesis for non-knl setups. -- Prevent sending duplicate requests to reboot a node before ResumeTimeout. -- Down nodes that don't reboot by ResumeTimeout. -- Update seff to reflect API change from rss_max to tres_usage_in_max. -- Add missing TRES constants from perl API. -- Fix issue where sacct would return incorrect array tasks when querying specific tasks. -- Add missing variables to slurmdb_stats_t in the perlapi. -- Fix nodes not getting reboot RPC when job requires reboot of nodes. -- Fix failing update the partition list of a job. -- Use slurm.conf gres ids instead of gres.conf names to get a gres type name. -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc. CVE-2019-6438. * Changes in Slurm 18.08.4 ========================== -- burst_buffer/cray - avoid launching a job that would be immediately cancelled due to a DataWarp failure. -- Fix message sent to user to display preempted instead of time limit when a job is preempted. -- Fix memory leak when a failure happens processing a nodes gres config. -- Improve error message when failures happen processing a nodes gres config. -- When building rpms ignore redundant standard rpaths and insecure relative rpaths, for RHEL based distros which use "check-rpaths" tool. -- Don't skip jobs in scontrol hold. -- Avoid locking the job_list when unneeded. -- Allow --cpu-bind=verbose to be used with SLURM_HINT environment variable. -- Make it so fixing runaway jobs will not alter the same job requeued when not runaway. -- Avoid checking state when searching for runaway jobs. -- Remove redundant check for end time of job when searching for runaway jobs. -- Make sure that we properly check for runawayjobs where another job might have the same id (for example, if a job was requeued) by also checking the submit time. -- Add scontrol update job ResetAccrueTime to clear a job's time previously accrued for priority. -- cons_res: Delay exiting cr_job_test until after cores/cpus are calculated and distributed. -- Fix bug where binary in cwd would trump binary in PATH with test_exec. -- Fix check to test printf("%s\n", NULL); to not require -Wno-format-truncation CFLAG. -- Fix JobAcctGatherParams=UsePss to report the correct usage. -- Fix minor memory leak in pmix plugin. -- Fix minor memory leak in slurmctld when reading configuration. -- Handle return codes correctly from pthread_* functions. -- Fix minor memory leak when a slurmd is unable to contact a slurmctld when trying to register. -- Fix sreport sizesbyaccount report when using Flatview and accounts. -- Fix incorrect shift when dealing with node weights and scheduling. -- libslurm/perl - Fix segfault caused by incorrect hv_to_slurm_ctl_conf. -- Add qos and assoc options to confirmation dialogs. -- Handle updating identical license or partition information correctly. -- Makes sure accounts and QOS' are all lower case to match documentation when read in from the slurm.conf file. -- Don't consider partitions without enough nodes in reservation, main scheduler. -- Set SLURM_NTASKS correctly if having to determine from other options. -- Removed GCP scripts from contribs. Now located at: https://github.com/SchedMD/slurm-gcp. -- Don't check existence of srun --prolog or --epilog executables when set to "none" and SLURM_TEST_EXEC is used. -- Add "P" suffix support to job and step tres specifications. -- When doing a reconfigure handle QOS' GrpJobsAccrue correctly. -- Remove unneeded extra parentheses from sh5util. -- Fix jobacct_gather/cgroup to work correctly when more than one task is started on a node. -- If requesting --ntasks-per-node with no tasks set tasks correctly. -- Accept modifiers for TRES originally added in 6f0342e0358. -- Don't remove reservation on slurmctld restart if nodes are removed from configuration. -- Fix bad xfree in task/cgroup. -- Fix removing counters if a job array isn't subject to limits and is canceled while pending. -- Make sure SLURM_NTASKS_PER_NODE is set correctly when env is overwritten by the command line. -- Clean up step on a failed node correctly. -- mpi/pmix: Fixed the logging of collective state. -- mpi/pmix: Make multi-slurmd work correctly when using ring communication. -- mpi/pmix: Fix double invocation of the PMIx lib fence callback. -- mpi/pmix: Remove unneeded libpmix callback drop in tree-based coll. -- Fix race condition in route/topology when the slurmctld is reconfigured. -- In route/topology validate the slurmctld doesn't try to initialize the node system. -- Fix issue when requesting invalid gres. -- Validate job_ptr in backfill before restoring preempt state. -- Fix issue when job's environment is minimal and only contains variables Slurm is going to replace internally. -- When handling runaway jobs remove all usage before rollup to remove any time that wasn't existent instead of just updating lines that have time with a lesser time. -- salloc - set SLURM_NTASKS_PER_CORE and SLURM_NTASKS_PER_SOCKET in the environment if the corresponding command line options are used. -- slurmd - fix handling of the -f flag to specify alternate config file locations. -- Fix scheduling logic to avoid using nodes that require a reboot for KNL node change when possible. -- Fix scheduling logic bug. There should have been a test for _not_ NODE_SET_REBOOT to continue. -- Fix a scheuling logic bug with respect to XOR operation support when there are down nodes. -- If there is a constraint construct of the form "[...&...]" then an error is generated if more than one of those specifications contains KNL NUMA or MCDRAM modes. -- Fix stepd segfault race if slurmctld hasn't registered with the launching slurmd yet delivering it's TRES list. -- Add SchedulerParameters option of bf_ignore_newly_avail_nodes to avoid scheduling lower priority jobs on resources that become available during the backfill scheduling cycle when bf_continue is enabled. -- Decrement message_connections in stepd code on error path correctly. -- Decrease an error message to be debug. -- Fix missing suffixes in squeue. -- pam_slurm_adopt - send an error message to the user if no Slurm jobs can be located on the node. -- Run SlurmctldPrimaryOffProg when the primary slurmctld process shuts down. -- job_submit/lua: Add several slurmctld return codes. -- job_submit/lua: Add user/group info to jobs. -- Fix formatting issues when printing uint64_t. -- Bump RLIMIT_NOFILE for daemons in systemd services. -- Expand %x in job name in 'scontrol show job'. -- salloc/sbatch/srun - print warning if mutually exclusive options of --mem and --mem-per-cpu are both set. * Changes in Slurm 18.08.3 ========================== -- Fix regression in 18.08.1 that caused dbd messages to not be queued up when the dbd was down. -- Fix regression in 18.08.1 that can cause a slurmctld crash when splitting job array elements. * Changes in Slurm 18.08.2 ========================== -- Correctly initialize variable in env_array_user_default(). -- Remove race condition when signaling starting step. -- Fix issue where 17.11 job's using GRES in didn't initialize new 18.08 structures after unpack. -- Stop removing nodes once the minimum CPU or node count for the job is reached in the cons_res plugin. -- Process any changes to MinJobAge and SlurmdTimeout in the slurmctld when it is reconfigured to determine changes in its background timers. -- Use previous SlurmdTimeout in the slurmctld after a reconfigure to determine the time a node has been down. -- Fix multi-cluster srun between clusters with different SelectType plugins. -- Fix removing job licenses on reconfig/restart when configured license counts are 0. -- If a job requested multiple licenses and one license was removed then on a reconfigure/restart all of the licenses -- including the valid ones would be removed. -- Fix issue where job's license string wasn't updated after a restart when licenses were removed or added. -- Add allow_zero_lic to SchedulerParameters. -- Avoid scheduling tasks in excess of ArrayTaskThrottle when canceling tasks of an array. -- Fix jobs that request memory per node and task count that can't be scheduled right away. -- Avoid infinite loop with jobacct_gather/linux when pids wrap around /proc/sys/kernel/pid_max. -- Fix --parsable2 output for sacct and sstat commands to remove a stray trailing delimiter. -- When modifying a user's name in sacctmgr enforce PreserveCaseUser. -- When adding a coordinator or user that was once deleted enforce PreserveCaseUser. -- Correctly handle scenarios where a partitions MaxMemPerCPU is less than a jobs --mem-per-cpu and also -c is greater than 1. -- Set AccrueTime correctly when MaxJobsAccrue is disabled and BeginTime has not been established. -- Correctly account for job arrays for new {Max/Grp}JobsAccrue limits. * Changes in Slurm 18.08.1 ========================== -- Remove commented-out parts of man pages related to cons_tres work in 19.05, as these were showing up on the web version due to a syntax error. -- Prevent slurmctld performance issues in main background loop if multiple backup controllers are unavailable. -- Add missing user read association lock in burst_buffer/cray during init(). -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'. -- Fix creation of step hwloc xml file for after cpuset cgroup has been created. -- Add userspace as a valid default governor. -- Add timers to group_cache_lookup so if going slow advise LaunchParameters=send_gids. -- Fix SLURM_STEP_GRES=none to work correctly. -- Fix potential memory leak when a failure happens unpacking a ctld_multi_msg. -- Fix potential double free when a faulure happens when unpacking a node_registration_status_msg. -- Fix sacctmgr show runaways. -- Removed non-POSIX append operator from configure script for non-bash support. -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'. -- Fix sacct to not print huge reserve times when the job was never eligible. -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a burst buffer. -- burst_buffer/cray - Update burst buffers when an association or qos is removed from the system. -- Remove documentation for deprecated Cray/ALPS systems. Please switch to Native Cray mode instead. -- Completely copy features when copying the list in the slurmctld. -- PMIX - Fix issue with packing processes when using an arbitrary task distribution. -- Fix hostlists to be able to handle nodenames with '-' in them surrounded by integers. -- Fix correct job CPU count allocated. -- Fix sacctmgr setting GrpJobs limit when setting GrpJobsAccrue limit. -- Change the defaults to MemLimitEnforce=no and NoOverMemoryKill (See RELEASE_NOTES). -- Prevent abort when using Cray node features plugin on non-knl. -- Add ability to reboot down nodes with scontrol reboot_nodes. -- Protect against sending to the slurmdbd if the connection has gone away. -- Fix invalid read when not using backup slurmctlds. -- Prevent acct coordinators from changing default acct on add user. -- Don't allow scontrol top do modify job priorities when priority == 1. -- slurmsmwd - change parsing code to handle systems with the svid or inst fields set in xtconsumer output. -- Fix infinite loop in slurmctld if GRES is specified without a count. -- sacct: Print error when unknown arguments are found. -- Fix checking missing return codes when unpacking structures. -- Fix slurm.spec-legacy including slurmsmwd -- More explicit error message when cgroup oom-kill events detected. -- When updating an association and are unable to find parent association initialize old fairshare association pointer correctly. -- Wrap slurm_cond_signal() calls with mutexes where needed. -- Fix correct timeout with resends in slurm_send_only_node_msg. -- Fix pam_slurm_adopt to honor action_adopt_failure. -- Have the slurmd recreate the hwloc xml file for the full system on restart. -- sdiag - correct the units for the gettimeofday() stat to microseconds. -- Set SLURM_CLUSTER_NAME environment variable in MailProg to the ClusterName. -- smail - use SLURM_CLUSTER_NAME environment variable. -- job_submit/lua - expose argc/argv options through lua interface. -- slurmdbd - prevent false-positive warning about innodb settings having been set too low if they're actually set over 2GB. * Changes in Slurm 18.08.0 ========================== -- Fix segfault on job arrays when starting controller without dbd up. -- Fix pmi2 to build with gcc 8.0+. -- Remove the development snapshot of select/cons_tres plugin. -- Fix slurmd -C to not print benign error from xcpuinfo. -- Fix potential double locks in the assoc_mgr. -- Fix sacct truncate flag behavior Truncated pending jobs will always return a start and end time set to the window end time, so elapsed time is 0. -- Fix extern step hanging forever when canceled right after creation. -- sdiag - add slurmctld agent count. -- Remove requirement to have cgroup_allowed_devices_file.conf in order to constrain devices. By default all devices are allowed and GRES, that are associated with a device file, that are not requested are restricted. -- Fix proper alignment of clauses when determining if more nodes are needed for an allocation. -- Fix race condition when canceling a federation job that just started running. -- Prevent extra resources from being allocated when combining certain flags. -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing when using --hint=nomultithread. -- Fix left over socket file when step is ending and using pmi2 with %n or %h in the spool dir. -- Don't remove hwloc full system xml file when shutting down the slurmd. -- Fix segfault that could happen with a het job when it was canceled while starting. -- Fix scan-build false-positive warning about invalid memory access in the _ping_controller() function. -- Add control_inx value to trigger_info_msg_t to permit future work in the trigger management code to distinguish which of multiple backup controllers has changed state. * Changes in Slurm 18.08.0rc1 ============================== -- Add TimelimitRaw sacct output field to display timelimit numbers. -- Fix job array preemption during backfill scheduling. -- Fix scontrol -o show assoc output. -- Add support for sacct --whole-hetjob=[yes|no] option. -- Make salloc handle node requests the same as sbatch. -- Add shutdown_on_reboot SlurmdParameter to control whether the Slurmd will shutdown itself down or not when a reboot request is received. -- Add cancel_reboot scontrol option to cancel pending reboot of nodes. -- Make Users case insensitive in the database based on Parameters=PreserveCaseUser in the slurmdbd.conf. -- Improve scheduling when dealing with node_features that could have a boot delay. -- Fix issue if a step launch fails we don't get a bunch of '(null)' strings in the step record for usage. -- Changed the default AuthType for slurmdbd to auth/munge. -- Make it so libpmi.so doesn't link to libslurm.so.$apiversion. -- Added 'remote-fs.target' to After directive of slurmd.service file. -- Fix filetxt plugin to handle it when you aren't running a jobacct_gather plugin. -- Remove drain on node when reboot nextstate used. -- Speed up pack of job's qos. -- Fix race condition when trying to update reservation in the database. -- For the PrologFlags slurm.conf option, make NoHold mutually exclusive with Contain and/or X11 options. -- Revise the handling of SlurmctldSyslogLevel and SlurmdSyslogLevel options in slurm.conf and DebugLevelSyslog in slurmdbd.conf. -- Gate reading the cgroup.conf file. -- Gate reading the acct_gather_* plugins. -- Add sacctmgr options to prevent/manage job queue stuffing: - GrpJobsAccrue= Maximum number of pending jobs in aggregate able to accrue age priority for this association and all associations which are children of this association. To clear a previously set value use the modify command with a new value of -1. - MaxJobsAccrue= Maximum number of pending jobs able to accrue age priority at any given time for the given association. This is overridden if set directly on a user. Default is the cluster's limit. To clear a previously set value use the modify command with a new value of -1. - MinPrioThreshold Minimum priority required to reserve resources when scheduling. * Changes in Slurm 18.08.0pre2 ============================== -- Remove support for "ChosLoc" configuration parameter. -- Configuration parameters "ControlMachine", "ControlAddr", "BackupController" and "BackupAddr" replaced by an ordered list of "SlurmctldHost" records with the optional address appended to the name enclosed in parenthesis. For example: "SlurmctldHost=head(12.34.56.78)". An arbitrary number of backup servers can be configured. -- When a pending job's state includes "UnavailableNodes" do not include the nodes in FUTURE state. -- Remove --immediate option from sbatch. -- Add infrastructure for per-job and per-step TRES parameters: tres-per-job, tres-per-node, tres-per-socket, tres-per-task, cpus-per-tres, mem-per-tres, tres-bind and tres-freq. These new parameters are not currently used, but have been added to the appropriate RPCs. -- Add DefCpuPerGpu and DefMemPerGpu to global and per-partition configuration parameters. Shown in scontrol/sview as "JobDefaults=...". NOTE: These options are for future use and currently have no effect. -- Fix for setting always the correct status on job update in mysql -- Add ValidateMode configuration parameter to knl_cray.conf for static MCDRAM/NUMA configurations. -- Fix security issue in accounting_storage/mysql plugin by always escaping strings within the slurmdbd. CVE-2018-7033. -- Disable local PTY output processing when using 'srun --unbuffered'. This prevents the PTY subsystem from inserting extraneous \r characters into the output stream. -- Change the column name for the %U (User ID) field in squeue to 'UID'. -- CRAY - Add CheckGhalQuiesce to the CommunicationParameters. -- When a process is core dumping, avoid terminating other processes in that task group. This fixes a problem with writing out incomplete OpenMP core files. -- CPU frequency management enhancements: If scaling_available_frequencies file is not available, then derive values from scaling_min_freq and scaling_max_freq values. If cpuinfo_cur_freq file is not available then try to use scaling_cur_freq. -- Add pending jobs count to sdiag output. -- Fix update job function. There were some incosistencies on the behavior that caused time limits to be modified when swapping QOS, bad permissions check for a coordinator and AllowQOS and DenyQOS were not enforced on job update. -- Add configuration paramerers SlurmctldPrimaryOnProg and SlurmctldPrimaryOffProg, which define programs to execute when a slurmctld daemon becomes the primary server or goes from primary to backup mode. -- Add configuration paramerers SlurmctldAddr for use with virtual IP to manage backup slurmctld daemons. -- Explicitly shutdown the slurmd process when instructed to reboot. -- Add ability to create/update partition with TRESBillingWeights through scontrol. -- Calcuate TRES billing values at submission so that billing limits can be enforced at submission with QOS DenyOnLimit. -- Add node_features plugin function "node_features_p_reboot_weight()" to return the node weight to be used for a compute node that requires reboot for use (e.g. to change the NUMA mode of a KNL node). -- Add NodeRebootWeight parameter to knl.conf configuration file. -- Fix insecure handling of job requested gid field. CVE-2018-10995. -- Fix srun to return highest signal of any task. -- Completely remove "gres" field from step record. Use "tres_per_node", "tres_per_socket", etc. -- Add "Links" parameter to gres.conf configuration file. -- Force slurm_mktime() to set tm_isdst to -1 so anyone using the function doesn't forget to set it. -- burst_buffer.conf - Add SetExecHost flag to enable burst buffer access from the login node for interactive jobs. -- Append ", with requeued tasks" to job array "end" emails if any tasks in the array were requeued. This is a hint to use "sacct --duplicates" to see the whole picture of the array job. -- Add ResumeFailProgram slurm.conf option to specify a program that is called when a node fails to respond by ResumeTimeout. -- Add new job pending reason of "ReqNodeNotAvail, reserved for maintenance". -- Remove AdminComment += syntax from 'scontrol update job'. -- sched/backfill: Reset job time limit if needed for deadline scheduling. -- For heterogeneous job component with required nodes, explicitly exclude those nodes from all other job components. -- Add name of partition used to output of srun --test-only output (valuable for jobs submitted to multiple partitions). -- If MailProg is not configured and "/bin/mail" (the default) does not exist, but "/usr/bin/mail" does exist then use "/usr/bin/mail" as a default value. -- sdiag output now reports outgoing slurmctld message queue contents. -- Fix issue in performance when reading slurm conf having nodes with features. -- Make it so the slurmdbd's pid file gets created before initing the database. -- Improve escaping special characters on user commands when specifying paths. -- Fix directory names with special char '\' that are not handled correctly. -- Add salloc/sbatch/srun option of --gres-flags=disable-binding to disable filtering of CPUs with respect to generic resource locality. This option is currently required to use more CPUs than are bound to a GRES (i.e. if a GPU is bound to the CPUs on one socket, but resources on more than one socket are required to run the job). This option may permit a job to be allocated resources sooner than otherwise possible, but may result in lower job performance. -- SlurmDBD - Print warning if MySQL/MariaDB internal tuning is not at least half of the recommended values. -- Move libpmi from src/api to contribs/pmi. -- Add ability to specify a node reason when rebooting nodes with "scontrol reboot". -- Add nextstate option to "scontrol reboot" to dictate state of node after reboot. -- Consider "resuming" (nextstate=resume) nodes as available in backfill future scheduling and don't replace "resuming" nodes in reservations. -- Add the use of a xml file to help performance when using hwloc. * Changes in Slurm 18.08.0pre1 ============================== -- Add new burst buffer state of "teardown-fail" to indicate the burst buffer teardown operation is failing on specific buffers. This changes the numeric value of the BB_STATE_COMPLETE type. Any Slurm version 17.02 or 17.11 tool used to report burst buffer state information will report a state of "66" rather than "complete" for burst buffers which have been deleted, but still exist in the slurmctld daemon's tables (a very short-lived situation). -- Multiple backup slurmctld daemons can be configured: * Specify "BackupController#= and "BackupAddr#=
" to identify up to 9 backup servers. * Output format of "scontrol ping" and the daemon status at the end of "scontrol status" is modified to report up status of the primary and all backup servers. * "scontrol takeover [#]" command can now identify the SlurmctldHost index number. Default value is "1" (the first backup configured SlurmctldHost). -- Enable jobs with zero node count for creation and/or deletion of persistent burst buffers. * The partition default MinNodes configuration parameter is now 0 (previously 1 node). * Zero size jobs disabled for job arrays and heterogeneous jobs, but supported for salloc, sbatch and srun commands. -- Add "scontrol show dwstat" command to display Cray burst buffer status. -- Add "GetSysStatus" option to burst_buffer.conf file. For burst_buffer/cray this would indicate the location of the "dwstat" command. -- Add node and partition configuration options of "CpuBind" to control default task binding. Modify the scontrol to report and modify these parameters. -- Add "NumaCpuBind" option to knl.conf file to automatically change a node's CpuBind parameter based upon changes to a node's NUMA mode. -- Add sbatch "--batch" option to identify features required on batch node. For example "sbatch --batch=haswell ...". -- Add "BatchFeatures" field to output of "scontrol show job". -- Add support for "--bb" option to sbatch command. -- Add new SystemComment field to job data structure and database. Currently used for Burst Buffer error logs. -- Expand reservation "flags" field from 32 to 64 bits. -- Add job state flag of "SIGNALING" to avoid race condition with multiple SIGSTOP/SIGCONT signals for the same job being active at the same time. -- Properly handle srun --will-run option when there are jobs in COMPLETING state. -- Properly report who is signaling a step. -- Don't combine updated reservation records in sreport's reservation report. -- node_features plugin - Add suport for XOR & XAND of job constraints (node feature specifications). -- Add support for parenthesis in a job's constraint specification to group like options together. For example --constraint="[(knl&snc4&flat)*4&haswell*1]" might be used to specify that four nodes with the features "knl", "snc4" and "flat" plus one node with the feature "haswell" are required. -- Improvements to how srun searches for the executible when using cwd. -- Now programs can be checked before execution if test_exec is set when using multi-prog option. -- Report NodeFeatures plugin configuration with scontrol and sview commands. -- Add acct_gather_profile/influxdb plugin. -- Add new job state of SO/STAGE_OUT indicating that burst buffer stage-out operation is in progress. -- Correct SLURM_NTASKS and SLURM_NPROCS environment variable for heterogeneous job step. Report values representing full allocation. -- Expand advanced reservation feature specification to support parenthesis and counts of nodes with specified features. Nodes with the feature currently active will be prefered. -- Defer job signaling until prolog is completed -- Have the primary slurmctld wait until the backup has completely shutdown before taking control. -- Fix issue where unpacking job state after TRES count changed could lead to invalid reads. -- Heterogeneous job steps allocations supported with * Open MPI (with Slurm's PMI2 and PMIx plugins) and * Intel MPI (with Slurm's PMI2 plugin) -- Remove redundant function arguments from task plugins: * Remove "job_id" field from task_p_slurmd_batch_request() function. * Remove "job_id" field from task_p_slurmd_launch_request() function. * Remove "job_id" field from task_p_slurmd_reserve_resources() function. -- Change function name from node_features_p_changible_feature() to node_features_p_changeable_feature in node_features plugin. -- Add Slurm configuration file check logic using "slurmctld -t" command. * Changes in Slurm 17.11.14 =========================== * Changes in Slurm 17.11.13-2 ============================= -- Fix Perl build for 32-bit systems. * Changes in Slurm 17.11.13 =========================== -- Add mitigation for a potential heap overflow on 32-bit systems in xmalloc. CVE-2019-6438. * Changes in Slurm 17.11.12 =========================== -- Fix regression in 17.11.10 that caused dbd messages to not be queued up when the dbd was down. * Changes in Slurm 17.11.11 =========================== -- Correctly initialize variable in env_array_user_default(). -- Correctly handle scenarios where a partitions MaxMemPerCPU is less than a jobs --mem-per-cpu and also -c is greater than 1. * Changes in Slurm 17.11.10 =========================== -- Move priority_sort_part_tier from slurmctld to libslurm to make it possible to run the regression tests 24.* without changing that code since it links directly to the priority plugin where that function isn't defined. -- Fix issue where job time limits can increase to max walltime when updating a job with scontrol. -- Fix invalid protocol_version manipulation on big endian platforms causing srun and sattach to fail. -- Fix for QOS, Reservation and Alias env variables in srun. -- mpi/pmi2 - Backport 6a702158b49c4 from 18.08 to avoid dangerous detached thread. -- When allowing heterogeneous steps make sure we copy all the options to avoid copying strings that may be overwritten. -- Print correctly when sh5util finds and empty file. -- Fix sh5util to not seg fault on exit. -- Fix sh5util to check correctly for H5free_memory. -- Adjust OOM monitoring function in task/cgroup to prevent problems in regression suite from leaked file descriptors. -- Fix issue with gres when defined with a type and no count (i.e. gres=gpu/tesla) it would get a count of 0. -- Allow sstat to talk to slurmd's that are new in protocol version. -- Permit database names over 33 characters in accounting_storage/mysql. -- Fix negative values when profiling. -- Fix srun segfault caused by invalid memory reads on the env. -- Fix segfault on job arrays when starting controller without dbd up. -- Fix pmi2 to build with gcc 8.0+. -- Fix proper alignment of clauses when determining if more nodes are needed for an allocation. -- Fix race condition when canceling a federation job that just started running. -- Prevent extra resources from being allocated when combining certain flags. -- Fix problem in task/affinity plugin that can lead to slurmd fatal()'ing when using --hint=nomultithread. -- Fix left over socket file when step is ending and using pmi2 with %n or %h in the spool dir. -- Fix incorrect spacing for PartitionName lines in 'scontrol write config'. -- Fix sacct to not print huge reserve times when the job was never eligible. -- burst_buffer/cray - Add missing locks around assoc_mgr when timing out a burst buffer. -- burst_buffer/cray - Update burst buffers when an association or qos is removed from the system. -- If failed over to a backup controller, ensure the agent thread is launched to handle deferred tasks. -- Fix correct job CPU count allocated. -- Protect against sending to the slurmdbd if the connection has gone away. -- Fix checking missing return codes when unpacking structures. -- Fix slurm.spec-legacy including slurmsmwd -- More explicit error message when cgroup oom-kill events detected. -- When updating an association and are unable to find parent association initialize old fairshare association pointer correctly. -- Wrap slurm_cond_signal() calls with mutexes where needed. -- Fix correct timeout with resends in slurm_send_only_node_msg. -- Fix pam_slurm_adopt to honor action_adopt_failure. -- job_submit/lua - expose argc/argv options through lua interface. * Changes in Slurm 17.11.9-2 ============================ -- Fix printing of node state "drain + reboot" (and other node state flags). -- Fix invalid read (segfault) when sorting multi-partition jobs. -- Move several new error() messages to debug() to keep them out of users' srun output. * Changes in Slurm 17.11.9 ========================== -- Fix segfault in slurmctld when a job's node bitmap is NULL during a scheduling cycle. Primarily caused by EnforcePartLimits=ALL. -- Remove erroneous unlock in acct_gather_energy/ipmi. -- Enable support for hwloc version 2.0.1. -- Fix 'srun -q' (--qos) option handling. -- Fix socket communication issue that can lead to lost task completition messages, which will cause a permanently stuck srun process. -- Handle creation of TMPDIR if environment variable is set or changed in a task prolog script. -- Avoid node layout fragmentation if running with a fixed CPU count but without Sockets and CoresPerSocket defined. -- burst_buffer/cray - Fix datawarp swap default pool overriding jobdw. -- Fix incorrect job priority assignment for multi-partition job with different PriorityTier settings on the partitions. -- Fix sinfo to print correct node state. * Changes in Slurm 17.11.8 ========================== -- Fix incomplete RESPONSE_[RESOURCE|JOB_PACK]_ALLOCATION building path. -- Do not allocate nodes that were marked down due to the node not responding by ResumeTimeout. -- task/cray plugin - search for "mems" cgroup information in the file "cpuset.mems" then fall back to the file "mems". -- Fix ipmi profile debug uninitialized variable. -- Improve detection of Lua package on older RHEL distributions. -- PMIx: fixed the direct connect inline msg sending. -- MYSQL: Fix issue not handling all fields when loading an archive dump. -- Allow a job_submit plugin to change the admin_comment field during job_submit_plugin_modify(). -- job_submit/lua - fix access into reservation table. -- MySQL - Prevent deadlock caused by archive logic locking reads. -- Don't enforce MaxQueryTimeRange when requesting specific jobs. -- Modify --test-only logic to properly support jobs submitted to more than one partition. -- Prevent slurmctld from abort when attempting to set non-existing qos as def_qos_id. -- Add new job dependency type of "afterburstbuffer". The pending job will be delayed until the first job completes execution and it's burst buffer stage-out is completed. -- Reorder proctrack/task plugin load in the slurmstepd to match that of slurmd and avoid race condition calling task before proctrack can introduce. -- Prevent reboot of a busy KNL node when requesting inactive features. -- Revert to previous behavior when requesting memory per cpu/node introduced in 17.11.7. -- Fix to reinitialize previously adjusted job members to their original value when validating the job memory in multi-partition requests. -- Fix _step_signal() from always returning SLURM_SUCCESS. -- Combine active and available node feature change logs on one line rather than one line per node for performance reasons. -- Prevent occasionally leaking freezer cgroups. -- Fix potential segfault when closing the mpi/pmi2 plugin. -- Fix issues with --exclusive=[user|mcs] to work correctly with preemption or when job requests a specific list of hosts. -- Make code compile with hdf5 1.10.2+ -- mpi/pmix: Fixed the collectives canceling. -- SlurmDBD: improve error message handling on archive load failure. -- Fix incorrect locking when deleting reservations. -- Fix incorrect locking when setting up the power save module. -- Fix setting format output length for squeue when showing array jobs. -- Add xstrstr function. -- Fix printing out of --hint options in sbatch, salloc --help. -- Prevent possible divide by zero in _validate_time_limit(). -- Add Delegate=yes to the slurmd.service file to prevent systemd from interfering with the jobs' cgroup hierarchies. -- Change the backlog argument to the listen() syscall within srun to 4096 to match elsewhere in the code, and avoid communication problems at scale. * Changes in Slurm 17.11.7 ========================== -- Fix for possible slurmctld daemon abort with NULL pointer. -- Fix different issues when requesting memory per cpu/node. -- PMIx - override default paths at configure time if --with-pmix is used. -- Have sprio display jobs before eligible time when PriorityFlags=ACCRUE_ALWAYS is set. -- Make sure locks are always in place when calling _post_qos_list(). -- Notify srun and ctld when unkillable stepd exits. -- Fix slurmstepd deadlock in stepd cleanup caused by race condition in the jobacct_gather fini() interfaces introduced in 17.11.6. -- Fix slurmstepd deadlock in PMIx startup. -- task/cgroup - fix invalid free() if the hwloc library does not return a string as expected. -- Fix insecure handling of job requested gid field. CVE-2018-10995. -- Add --without x11 option to rpmbuild in slurm.spec. * Changes in Slurm 17.11.6 ========================== -- CRAY - Add slurmsmwd to the contribs/cray dir. -- sview - fix crash when closing any search dialog. -- Fix initialization of variable in stepd when using native x11. -- Fix reading slurm_io_init_msg to handle partial messages. -- Fix scontrol create res segfault when wrong user/account parameters given. -- Fix documentation for sacct on parameter -X (--allocations) -- Change TRES Weights debug messages to debug3. -- FreeBSD - assorted fixes to restore build. -- Fix for not tracking environment variables from unrelated different jobs. -- PMIX - Added the direct connect authentication. When upgrading this may cause issues with jobs using pmix starting on mixed slurmstepd versions where some are less than 17.11.6. -- Prevent the backup slurmctld from losing the active/available node features list on takeover. -- Add documentation for fix IDLE*+POWER due to capmc stuck in Cray systems. -- Fix missing mutex unlock when prolog is failing on a node, leading to a hung slurmd. -- Fix locking around Cray CCM prolog/epilog. -- Add missing fed_mgr read locks. -- Fix issue incorrectly setting a job time_start to 0 while requeueing. -- smail - remove stray '-s' from mail subject line. -- srun - prevent segfault if ClusterName setting is unset but SLURM_WORKING_CLUSTER environment variable is defined. -- In configurator.html web pages change default configuration from task/none to task/affinity plugin and from select/linear plugin to select/cons_res plus CR_Core. -- Allow jobs to run beyond a FLEX reservation end time. -- Fix problem with wrongly set as Reservation job state_reason. -- Prevent bit_ffs() from returnig value out of bitmap range. -- Improve performance of 'squeue -u' when PrivateData=jobs is enabled. -- Make UnavailableNodes value in job reason be correct for each job. -- Fix 'squeue -o %s' on Cray systems. -- Fix incorrect error thrown when cancelling part of a job array. -- Fix error code and scheduling problem for --exclusive=[user|mcs]. -- Fix build when lz4 is in a non-standard location. -- Be able to force power_down of cloud node even if in power_save state. -- Allow cloud nodes to be recognized in Slurm when booted out of band. -- Fixes race condition in _pack_job_gres() when is called multiple times. -- Increase duration of "sleep" command used to keep extern step alive. -- Remove unsafe usage of pthread_cancel in slurmstepd that can lead to to deadlock in glibc. -- Fix total TRES Billing on partitions. -- Don't tear down a BB if a node fails and --no-kill or resize of a job happens. -- Remove unsafe usage of pthread_cancel in pmix plugin that can lead to to deadlock in glibc. -- Fix fatal in controller when loading completed trigger -- Ignore reservation overlap at submission time. -- GRES type model and QOS limits documentation added -- slurmd - fix ABRT on SIGINT after reconfigure with MemSpecLimit set. -- PMIx - move two error messages on retry to debug level, and only display the error after the retry count has been exceeded. -- Increase number of tries when sending responses to srun. -- Fix checkpointing requeued/completing jobs in a bad state which caused a segfault on restart. -- Fix srun on ppc64 platforms. -- Prevent slurmd from starting steps if the Prolog returns an error when using PrologFlags=alloc. -- priority/multifactor - prevent segfault running sprio if a partition has just been deleted and PriorityFlags=CALCULATE_RUNNING is turned on. -- job_submit/lua - add ESLURM_INVALID_TIME_LIMIT return code value. -- job_submit/lua - print an error if the script calls log.user in job_modify() instead of returning it to the next submitted job erroneously. -- select/linear - handle job resize correctly. -- select/cons_res - improve handling of --cores-per-socket requests. * Changes in Slurm 17.11.5 ========================== -- Fix cloud nodes getting stuck in DOWN+POWER_UP+NO_RESPOND state after not responding by ResumeTimeout. -- Add job's array_task_cnt and user_name along with partitions [max|def]_mem_per_[cpu|node], max_cpus_per_node, and max_share with the SHARED_FORCE definition to the job_submit/lua plugin. -- srun - fix for SLURM_JOB_NUM_NODES env variable assignment. -- sacctmgr - fix runaway jobs identification. -- Fix for setting always the correct status on job update in mysql. -- Fix issue if running with an association manager cache (slurmdbd was down when slurmctld was started) you could loose QOS usage information. -- CRAY - Fix spec file to work correctly. -- Set scontrol exit code to 1 if attempting to update a node state to DRAIN or DOWN without specifying a reason. -- Fix race condition when running with an association manager cache (slurmdbd was down when slurmctld was started). -- Print out missing SLURM_PERSIST_INIT slurmdbd message type. -- Fix two build errors related to use of the O_CLOEXEC flag with older glibc. -- Add Google Cloud Platform integration scripts into contribs directory. -- Fix minor potential memory leak in backfill plugin. -- Add missing node flags (maint/power/etc) to node states. -- Fix issue where job time limits may end up at 1 minute when using the NoReserve flag on their QOS. -- Fix security issue in accounting_storage/mysql plugin by always escaping strings within the slurmdbd. CVE-2018-7033. -- Soften messages about best_fit topology to debug2 to avoid alarm. -- Fix issue in sreport reservation utilization report to handle more allocated time than 100% (Flex reservations). -- When a job is requesting a Flex reservation prefer the reservation's nodes over any other nodes. * Changes in Slurm 17.11.4 ========================== -- Add fatal_abort() function to be able to get core dumps if we hit an "impossible" edge case. -- Link slurmd against all libraries that slurmstepd links to. -- Fix limits enforce order when they're set at partition and other levels. -- Add slurm_load_single_node() function to the Perl API. -- slurm.spec - change dependency for --with lua to use pkgconfig. -- Fix small memory leaks in node_features plugins on reconfigure. -- slurmdbd - only permit requests to update resources from operators or administrators. -- Fix handling of partial writes in io_init_msg_write_to_fd() which can lead to job step launch failure under higher cluster loads. -- MYSQL - Fix to handle quotes in a given work_dir of a job. -- sbcast - fix a race condition that leads to "Unspecified error". -- Log that support for the ChosLoc configuration parameter will end in Slurm version 18.08. -- Fix backfill performance issue where bf_min_prio_reserve was not respected. -- Fix MaxQueryTimeRange checks. -- Print MaxQueryTimeRange in "sacctmgr show config". -- Correctly check return codes when creating a step to check if needing to wait to retry or not. -- Fix issue where a job could be denied by Reason=MaxMemPerLimit when not requesting any tasks. -- In perl tools, fix for regexp that caused extra incorrectly shown results. -- Add some extra locks in fed_mgr to be extra safe. -- Minor memory leak fixes in the fed_mgr on slurmctld shutdown. -- Make sreport job reports also report duplicate jobs correctly. -- Fix issues restoring certain Partition configuration elements, especially when ReconfigFlags=KeepPartInfo is enabled. -- Don't add TRES whose value is NO_VAL64 when building string line. -- Fix removing array jobs from hash in slurmctld. -- Print out missing user messages from jobsubmit plugin when srun/salloc are waiting for an allocation. -- Handle --clusters=all as case insensitive. -- Only check requested clusters in federation when using --test-only submission option. -- In the federation, make it so you can cancel stranded sibling jobs. -- Silence an error from PSS memory stat collection process. -- Requeue jobs allocated to nodes requested to DRAIN or FAIL if nodes are POWER_SAVE or POWER_UP, preventing jobs to start on NHC-failed nodes. -- Make MAINT and OVERLAP resvervation flags order agnostic on overlap test. -- Preserve node features when slurmctld daemons reconfigured including active and available KNL features. -- Prevent creation of multiple io_timeout threads within srun, which can lead to fatal() messages when those unexpected and additional mutexes are destroyed when srun shuts down. -- burst_buffer/cray - Prevent use of "#DW create_persistent" and "#DW destroy_persistent" directives available in Cray CLE6.0UP06. This will be supported in Slurm version 18.08. Use "#BB" directives until then. -- Fix task/cgroup affinity to behave correctly. -- FreeBSD - fix build on systems built with WITHOUT_KERBEROS. -- Fix to restore pn_min_memory calculated result to correctly enforce MaxMemPerCPU setting on a partition when the job uses --mem. -- slurmdbd - prevent infinite loop if a QOS is set to preempt itself. -- Fix issue with log rotation for slurmstepd processes. * Changes in Slurm 17.11.3-2 ========================== -- Revert node_features changes in 17.11.3 that lead to various segfaults on slurmctld startup. * Changes in Slurm 17.11.3 ========================== -- Send SIG_UME correctly to a step. -- Sort sreport's reservation report by cluster, time_start, resv_name instead of cluster, resv_name, time_start. -- Avoid setting node in COMPLETING state indefinitely if the job initiating the node reboot is cancelled while the reboot in in progress. -- Scheduling fix for changing node features without any NodeFeatures plugins. -- Improve logic when summarizing job arrays mail notifications. -- Add scontrol -F/--future option to display nodes in FUTURE state. -- Fix REASONABLE_BUF_SIZE to actually be 3/4 of MAX_BUF_SIZE. -- When a job array is preempting make it so tasks in the array don't wait to preempt other possible jobs. -- Change free_buffer to FREE_NULL_BUFFER to prevent possible double free in slurmstepd. -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld reconfigured. -- node_feature/knl_cray - Fix memory leak that can occur during normal operation. -- Fix srun environment variables for --prolog script. -- Fix job array dependency with "aftercorr" option and some task arrays in the first job fail. This fix lets all task array elements that can run proceed rather than stopping all subsequent task array elements. -- Fix potential deadlock in the slurmctld when using list_for_each. -- Fix for possible memory corruption in srun when running heterogeneous job steps. -- Fix job array dependency with "aftercorr" option and some task arrays in the first job fail. This fix lets all task array elements that can run proceed rather than stopping all subsequent task array elements. -- Fix output file containing "%t" (task ID) for heterogeneous job step to be based upon global task ID rather than task ID for that component of the heterogeneous job step. -- MYSQL - Fix potential abort when attempting to make an account a parent of itself. -- Fix potentially uninitialized variable in slurmctld. -- MYSQL - Fix issue for multi-dimensional machines when using sacct to find jobs that ran on specific nodes. -- Reject --acctg-freq at submit if invalid. -- Added info string on sh5util when deleting an empty file. -- Correct dragonfly topology support when job allocation specifies desired switch count. -- Fix minor memory leak on an sbcast error path. -- Fix issues when starting the backup slurmdbd. -- Revert uid check when requesting a jobid from a pid. -- task/cgroup - add support to detect OOM_KILL cgroup events. -- Fix whole node allocation cpu counts when --hint=nomultihtread. -- Allow execution of task prolog/epilog when uid has access rights by a secondary group id. -- Validate command existence on the srun *[pro|epi]log options if LaunchParameter test_exec is set. -- Fix potential memory leak if clean starting and the TRES didn't change from when last started. -- Fix for association MaxWall enforcement when none is given at submission. -- Add a job's allocated licenses to the [Pro|Epi]logSlurmctld. -- burst_buffer/cray: Attempts by job to create persistent burst buffer when one already exists owned by a different user will be logged and the job held. -- CRAY - Remove race in the core_spec where we add the slurmstepd to the job container where if the step was canceled would also cancel the stepd erroneously. -- Make sure the slurmstepd blocks signals like SIGTERM correctly. -- SPANK - When slurm_spank_init_post_opt() fails return error correctly. -- When revoking a sibling job in the federation we want to send a start message before purging the job record to get the uid of the revoked job. -- Make JobAcctGatherParams options case-insensitive. Previously, UsePss was the only correct capitialization; UsePSS or usepss were silently ignored. -- Prevent pthread_atfork handlers from being added unnecessarily after 'scontrol reconfigure', which can eventually lead to a crash if too many handlers have been registered. -- Better debug messages when MaxSubmitJobs is hit. -- Docs - update squeue man page to describe all possible job states. -- Prevent orphaned step_extern steps when a job is cancelled while the prolog is still running. * Changes in Slurm 17.11.2 ========================== -- jobcomp/elasticsearch - append Content-Type to the HTTP header. -- MYSQL - Fix potential abort of slurmdbd when job has no TRES. -- Add advanced reservation flag of "REPLACE_DOWN" to replace DOWN or DRAINED nodes. -- slurm.spec-legacy - add missing libslurmfull.so to slurm.files. -- Fix squeue job ID filtering for pending job array records. -- Fix potential deadlock in _run_prog() in power save code. -- MYSQL - Add dynamic_offset in the database to force range for auto increment ids for the tres_table. -- MYSQL - Fix fallout from MySQL auto increment bug, see RELEASE_NOTES, only affects current 17.11 users tracking licenses or GRES in the database. -- Refactor logging logic to avoid possible memory corruption on non-x86 architectures. -- Fix memory leak when getting jobs from the slurmdbd. -- Fix incorrect logic behind MemorySwappiness, and only set the value when specified in the configuration. * Changes in Slurm 17.11.1-2 ============================ -- MYSQL - Make index for pack_job_id * Changes in Slurm 17.11.1 ========================== -- Fix --with-shared-libslurm option to work correctly. -- Make it so only daemons log errors on configuration option duplicates. -- Fix for ConstrainDevices=yes to work correctly. -- Fix to purge old jobs using burst buffer if slurmctld daemon restarted after the job's burst buffer work was already completed. -- Make logging prefix for slurmstepd to happen as soon as possible. -- mpi/pmix: Fix the job registration for the PMIx v2.1. -- Fix uid check for signaling a step with anything but SIGKILL. -- Return ESLURM_TRANSITION_STATE_NO_UPDATE instead of EAGAIN when trying to signal a step that is still running a prolog. -- Update Cray slurm_playbook.yaml with latest recommended version. -- Only say a prolog is done running after the extern step is launched. -- Wait to start a batch step until the prolog and extern step are fully ran/launched. Only matters if running with PrologFlags=[contain|alloc]. -- Truncate a range for SlurmctldPort to FD_SETSIZE elements and throw an error, otherwise network traffic may be lost due to poll() not detecting traffic. -- Fix for srun --pack-group option that can reuse/corrupt memory. -- Fix handling ultra long hostlists in a hostfile. -- X11: fix xauth regex to handle '-' in hostnames again. -- Fix potential node reboot timeout problem for "scontrol reboot" command. -- Add ability for squeue to sort jobs by submit time. -- CRAY - Switch to standard pid files on Cray systems. -- Update jobcomp records on duplicate inserts. -- If unrecognized configuration file option found then print an appropriate fatal error message rather than relying upon random errno value. -- Initialize job_desc_msg_t's instead of just memset'ing them. -- Fix divide by zero when job requests no tasks and more memory than MaxMemPer{CPU|NODE}. -- Avoid changing Slurm internal errno on syslog() failures. -- BB - Only launch dependent jobs after the burst buffer is staged-out completely instead of right after the parent job finishes. -- node_features/knl_generic - If plugin can not fully load then do not spawn a background pthread (which will fail with invalid memory reference). -- Don't set the next jobid to give out to the highest jobid in the system on controller startup. Just use the checkpointed next use jobid. -- Docs - add Slurm/PMIx and OpenMPI build notes to the mpi_guide page. -- Add lustre_no_flush option to LaunchParameters for Native Cray systems. -- Fix rpmbuild issue with rpm 4.13+ / Fedora 25+. -- sacct - fix the display for the NNodes field when using the --units option. -- Prevent possible double-xfree on a buffer in stepd_completion. -- Fix for record job state on successful allocation but failed reply message. -- Fill in the user_name field for batch jobs if not sent by the slurmctld. (Which is the default behavior if LaunchParameters=send_gids is not enabled.). This prevents job launch problems for sites using UsePAM=1. -- Handle syncing federated jobs that ran on non-origin clusters and were cancelled while the origin cluster was down. -- Fix accessing variable outside of lock. -- slurm.spec: move libpmi to a separate package to solve a conflict with the version provided by PMIx. This will require a separate change to PMIx as well. -- X11 forwarding: change xauth handling to use hostname/unix:display format, rather than localhost:display. -- mpi/pmix - Fix warning if not compiling with debug. * Changes in Slurm 17.11.0 ========================== -- Fix documentation for MaxQueryTimeRange option in slurmdbd.conf. -- Avoid srun abort trying to run on heterogeneous job component that has ended. -- Add SLURM_PACK_JOB_ID,SLURM_PACK_JOB_OFFSET to PrologSlurmctld and EpilogSlurmctld environment. -- Treat ":" in #SBATCH arguments as fatal error. The "#SBATCH packjob" syntax must be used instead. -- job_submit/lua plugin: expose pack_job fields to get. -- Prevent scheduling deadlock with multiple components of heterogeneous job in different partitions (i.e. one heterogeneous job component is higher priority in one partition and another component is lower priority in a different partition). -- Fix for heterogeneous job starvation bug. -- Fix some slurmctld memory leaks. -- Add SLURM_PACK_JOB_NODELIST to PrologSlurmctld and EpilogSlurmctld environment. -- If PrologSlurmctld fails for pack job leader then requeue or kill all components of the job. -- Fix for mulitple --pack-group srun arguments given out of order. -- Update slurm.conf(5) man page with updated example logrotate script. -- Add SchedulerParameters=whole_pack configuration parameter. If set, then hold, release and cancel operations on any component of a heterogeneous job will be applied to all components -- Handle FQDNs in xauth cookies for x11 display forwarding properly. -- For heterogeneous job steps, the srun --open-mode option default value will be set to "append". -- Pack job scheduling list not being cleared between runs of the backfill scheduler resulted in various anomalies. -- Fix that backward compat for pmix version < 1.1.5. -- Fix use-after-free that can lead to slurmstepd segfaulting when setting ulimit values. -- Add heterogeneous job start data to sdiag output. -- X11 forwarding - handle systems with X11UseLocalhost=no set in sshd_config. -- Fix potential missing issue with missin symbols in gres plugins. -- Ignore querying clusters in federation that are down from status commands. -- Base federated jobs off of origin job and not the local cluster in API. -- Remove erroneous double '-' on rpath for libslurmfull. -- Remove version from libslurmfull and move it to $LIBDIR/slurm since the ABI could change from one version to the other. -- Fix unused wall time for reservations. -- Convert old reservation records to insert unused wall into the rows. -- slurm.spec: further restructing and improvements. -- Allow nodes state to be updated between FAIL and DRAIN. -- x11 forwarding: handle build with alternate location for libssh2. * Changes in Slurm 17.11.0rc3 ============================== -- Fix extern step to wait until launched before allowing job to start. -- Add missing locks around figuring out TRES when clean starting the slurmctld. -- Cray modulefile: avoid removing /usr/bin from path on module unload. -- Make reoccurring reservations show up in the database. -- Adjust related resources (cpus, tasks, gres, mem, etc.) when updating NumNodes with scontrol. -- Don't initialize MPI plugins for batch or extern steps.` -- slurm.spec - do not install a slurm.conf file under /etc/ld.so.conf.d. -- X11 forwarding - fix keepalive message generation code. -- If heterogeneous job step is unable to acquire MPI reserved ports then avoid referencing NULL pointer. Retry assigning ports ONLY for non-heterogeneous job steps. -- If any acct_gather_*_init fails fatal instead of error and keep going. -- launch/slurm plugin - Avoid using global variable for heterogeneous job steps, which could corrupt memory. * Changes in Slurm 17.11.0rc2 ============================== -- Prevent slurmctld abort with NodeFeatures=knl_cray and non-KNL nodes lacking any configured features. -- The --cpu_bind and --mem_bind options have been renamed to --cpu-bind and --mem-bind for consistency with the rest of Slurm's options. Both old and new syntaxes are supported for now. -- Add slurmdb_connection_commit to the slurmdb api to commit when needed. -- Add the federation api's to the slurmdb.h file. -- Add job functions to the db_api. -- Fix sacct to always use the db_api instead of sometimes calling functions directly. -- Fix sacctmgr to always use the db_api instead of sometimes calling functions directly. -- Fix sreport to always use the db_api instead of sometimes calling functions directly. -- Make global uid to the db_api to minimize calls to getuid(). -- Add support for HWLOC version 2.0. -- Added more validation logic for updates to node features. -- Added node_features_p_node_update_valid() function to node_features plugin. -- If a job is held due to bad constraints and a node's features change then test the job again to see if can run with the new features. -- Added node_features_p_changible_feature() function to node_features plugin. -- Avoid rebooting a node if a job's requested feature is not under the control of the node_features plugin and is not currently active. -- node_features/knl_generic plugin: Do not clear a node's non-KNL features specified in slurm.conf. -- Added SchedulerParameters configuration option "disable_hetero_steps" to disable job steps that span multiple components of a heterogeneous job. Disabled by default except with mpi/none plugin. This limitation to be removed in Slurm version 18.08. * Changes in Slurm 17.11.0rc1 ============================== -- Added the following jobcomp/script environment variables: CLUSTER, DEPENDENCY, DERIVED_EC, EXITCODE, GROUPNAME, QOS, RESERVATION, USERNAME. The format of LIMIT (job time limit) has been modified to D-HH:MM:SS. -- Fix QOS usage factor applying to individual TRES run minute usage. -- Print numbers using exponential format if required to fit in allocated field width. The sacctmgr and sshare commands are impacted. -- Make it so a backup DBD doesn't attempt to create database tables and relies on the primary to do so. -- By default have Slurm dynamically link to libslurm.so instead of static linking. If static linking is desired configure with --without-shared-libslurm. -- Change --workdir in sbatch to be --chdir as in all other commands (salloc, srun). -- Add WorkDir to the job record in the database. -- Make the UsageFactor of a QOS work when a qos has the nodecay flag. -- Add MaxQueryTimeRange option to slurmdbd.conf to limit accounting query ranges when fetching job records. -- Add LaunchParameters=batch_step_set_cpu_freq to allow the setting of the cpu frequency on the batch step. -- CRAY - Fix statically linked applications to CRAY's PMI. -- Fix - Raise an error back to the user when trying to update currently unsupported core-based reservations. -- Do not print TmpDisk space as part of 'slurmd -C' line. -- Fix to test MaxMemPerCPU/Node partition limits when scheduling, previously only checked on submit. -- Work for heterogeneous job support (complete solution in v17.11): * Set SLURM_PROCID environment variable to reflect global task rank (needed by MPI). * Set SLURM_NTASKS environment variable to reflect global task count (needed by MPI). * In srun, if only some steps are allocated and one step allocation fails, then delete all allocated steps. * Get SPANK plungins working with heterogeneous jobs. The spank_init_post_opt() function is executed once per job component. * Modify sbcast command and srun's --bcast option to support heterogeneous jobs. * Set more environment variables for MPI: SLURM_GTIDS and SLURM_NODEID. * Prevent a heterogeneous job allocation from including the same nodes in multiple components (required by MPI jobs spanning components). * Modify step create logic so that call components of a heterogeneous job launched by a single srun command have the same step ID value. -- Modify output of "--mpi=list" to avoid duplicates for version numbers in mpi/pmix plugin names. -- Allow nodes to be rebooted while in a maintenance reservation. -- Show nodes as down even when nodes are in a maintenance reservation. -- Harden the slurmctld HA stack to mitigate certain split-brain issues. -- Work for heterogeneous job support (complete solution in v17.11): * Add burst buffer support. * Remove srun's --mpi-combine option (always combined). * Add SchedulerParameters configuration option "enable_hetero_steps" to enable job steps that span multiple components of a heterogeneous job. Disabled by default as most MPI implementations and Slurm configurations are not currently supported. Limitation to be removed in Slurm version 18.08. * Synchronize application launch across multiple components with debugger. * Modify slurm_kill_job_step() to cancel all components of a heterogeneous job step (used by MPI). * Set SLURM_JOB_NUM_NODES environment variable as needed by MVAPICH. * Base time limit upon the time that the latest job component is available (after all nodes in all components booted and ready for use). -- Add cluster name to smail tool email header. -- Speedup arbitrary distribution algorithm. -- Modify "srun --mpi=list" output to match valid option input by removing the "mpi/" prefix on each line of output. -- Automatically set the reservation's partition for the job if not the cluster default. -- mpi/pmi2 plugin - vestigial pointer could be referenced at shutdown with invalid memory reference resulting. -- Fix to _is_gres_cnt_zero() return false for improper input string -- Cleanup all pthread_create calls and replace with new slurm_thread_create macro. -- Removed obsolete MPI plugins. Remaining options are openmpi, pmi2, pmix. -- Removed obsolete checkpoint/poe plugin. -- Process spank environment variable options before processing spank command line options. Spank plugins should be able to handle option callbacks being called multiple times. -- Add support for specialized cores with task/affinity plugin (previously only supported with task/cgroup plugin). -- Add "TaskPluginParam=SlurmdOffSpec" option that will prevent the Slurm compute node daemons (slurmd and slurmstepd) from executing on specialized cores. -- CRAY - Make native mode default, use --disable-native-cray to use ALPS instead of native Slurm. -- Add ability to prevent suspension of some count of nodes in a specified range using the SuspendExcNodes configuration parameter. -- Add SLURM_WCKEY to PrologSlurmctld and EpilogSlurmctld environment. -- Return user response string in response to successful job allocation request not only on failure. Set in LUA using function 'slurm.user_msg("STRING")'. -- Add 'scontrol write batch_script ' command to retrieve the batch script for a given job. -- Remove option to display the batch script as part of 'scontrol show job'. -- On native Cray system the configured RebootProgram is executed on on the head node by the slurmctld daemon rather than by the slurmd daemons on the compute nodes. The "capmc_resume" program from "contribs/cray" can be used. -- Modify "scontrol top" command to accept a comma separated list of job IDs as an argument rather than a single job ID. -- Add MemorySwappiness value to cgroup.conf. -- Add new "billing" TRES which allows jobs to be limited based on the job's billable TRES calculated by the job's partition's TRESBillingWeights. -- sbatch - force line-buffered output so 'sbatch -W' returns the jobid over a piped output immediately. -- Regular user use of "scontrol top" command is now diabled. Use the configuration parameter "SchedulerParameters=enable_user_top" to enable that functionality. The configuration parameter "SchedulerParameters=disable_user_top" will be silently ignored. -- Add -TALL to sreport. -- Removed unused SlurmdPlugstack option and associated framework. -- Correct logic for line continuation in srun --multi-prog file. -- Add DBD Agent queue size to sdiag output. -- Add running job count to sdiag output. -- Print unix timestamps next to ASCII timestamps in sdiag output. -- In a job allocation spanning KNL and non-KNL nodes and requiring a reboot, do not attempt to set default NUMA or MCDRAM modes on non-KNL nodes. -- Change default to let pending jobs run outside of reservation after reservation is gone to put jobs in held state. Added NO_HOLD_JOBS_AFTER_END reservation flag to use old default. -- When creating a reservation, validate the CoreCnt specification matches the number of nodes listed. -- When creating a reservation, correct logic to ignoring job allocations on request. -- Deprecate BLCR plugin, and do not build by default. -- Change sreport report titles from "Use" to "Usage" * Changes in Slurm 17.11.0pre2 ============================== -- Initial work for heterogeneous job support (complete solution in v17.11): * Modified salloc, sbatch and srun commands to parse command line, job script and environment variables to recognize requests for heterogeneous jobs. Same commands also modified to set environment variables describing each component of the heterogeneous job. * Modified job allocate, batch job submit and job "will-run" requests to pass a list of job specifications and get a list of responses. * Modify slurmctld daemon to process a heterogeneous job request and create multiple job records as needed. * Added new fields to job record: pack_job_id, pack_job_offset and pack_job_set (set of job IDs). Added to slurmctld state save/restore logic and job information reported. * Display new job fields in "scontrol show job" output. * Modify squeue command to display heterogeneous job records using "#+#" format. The squeue --job=# output lists all components of a heterogeneous job. * Modify scancel logic to cancel all components of a heterogeneous job with a single request/RPC. * Configuration parameter DebugFlags value of "HeteroJobs" added. * Job requeue and suspend/resume modified to operate on all components of a heterogeneous job with a single request/RPC. * New web page added to describe heterogeneous jobs. * Descriptions of new API added to man pages. * Modified email notifications to only operate on the first job component. * Purge heterogeneous job records at the same time and not by individual components. * Modified logic for heterogeneous jobs submitted to multiple clusters ("--clusters=...") so the job will be routed to the cluster that is expected to start all components earliest. * Modified srun to create multiple job steps for heterogeneous job allocations. * Modified launch plugin to accept a pointer to job step options structure rather than work from a single/common data structure. -- Improve backfill scheduling algorithm with respect to starting jobs as soon as possible while avoiding advanced reservations. -- Add URG as an option to 'scancel --signal'. -- Check if the buffer returned from slurm_persist_msg_pack() isn't NULL. -- Modify all daemons to re-open log files on receipt of SIGUSR2 signal. This is much than using SIGHUP to re-read the configuration file and rebuild various tables. -- Add PrivateData=events configuration parameter -- Work for heterogeneous job support (complete solution in v17.11): * Add pointer to job option structure to job_step_create_allocation() function used by srun. * Parallelize task launch for heterogeneous job allocations (initial work). * Make packjobid, packjoboffset, and packjobidset fields available in squeue output. * Modify smap command to display heterogeneous job records using "#+#" format. * Add srun --pack-group and --mpi-combine options to control job step launch behaviour (not fully implemented). * Add pack job component ID to srun --label output (e.g. "P0 1:" for job component 0 and task 1). * jobcomp/elasticsearch: Add pack_job_id and pack_job_offset fields. * sview: Modified to display pack job information. * Major re-write of task state container logic to support for list of containers rather than one container per srun command. * Add some regression tests. * Add srun pack job environment variables when performing job allocation. -- Set Reason=dependency over Reason=JobArrayTaskLimit for pending jobs. -- Add slurm.conf configuration parameters SlurmctldSyslogDebug and SlurmdSyslogDebug to control which messages from the slurmctld and slurmd daemons get written to syslog. -- Add slurmdbd.conf configuration parameter DebugLevelSyslog to control which messages from the slurmdbd daemon get written to syslog. -- Fix handling of GroupUpdateForce option. -- Work for heterogeneous job support (complete solution in v17.11): * Add support to sched/backfill for concurrent allocation of all pack job components including support of --time-min option. * Defer initiation of a heterogeneous job until a components can be started at the same time, taking into consideration association and QOS limits for the job as a whole. * Perform limit check on heterogeneous job as a whole at submit time to reject jobs that will never be able to run. * Add pack_job_id and pack_job_offset to accounting database. * Modified sacct to accept pack job ID specification using "#+#" notation. * Modified sstat to accept pack job ID specification using "#+#" notation. -- Clear a job's "wait reason" value of BeginTime" after that time has passed. Previously a readon of "BeginTime" could be reported long after the job's requested begin time had passed. -- Split group_info in slurm_ctl_conf_t into group_force and group_time. -- Work for heterogeneous job support (complete solution in v17.11): * Fix I/O race condition on step termination for srun launching multiple pack job groups. * If prolog is running when attempting to signal a step, then return EAGAIN and retry rather than simply returning SLURM_ERROR and aborting. * Modify launch/slurm plugin to signal all components of a pack job rather than just the one (modify to use a list of step context records). * Add logic to support srun --mpi-combine option. * Set up debugger data structures. * Disable cancellation of individual component while the job is pending. * Modify scontrol job hold/release and update to operate with heterogeneous job id specification (e.g. "scontrol hold 123+4"). * If srun lacks application specification for some component, the next one specified will be used for earlier components. * Changes in Slurm 17.11.0pre1 ============================== -- Interpet all format options in output/error file to log prolog errors. Prior logic only supported "%j" (job ID) option. -- Add the configure option --with-shared-libslurm which will link to libslurm.so instead of libslurm.o thus reducing the footprint of all the binaries. -- In switch plugin, added plugin_id symbol to plugins and wrapped switch_jobinfo_t with dynamic_plugin_data_t in interface calls in order to pass switch information between clusters with different switch types. -- Switch naming of acct_gather_infiniband to acct_gather_interconnect -- Make it so you can "stack" the interconnect plugins. -- Add a last_sched_eval timestamp to record when a job was last evaluated by the main scheduler or backfill. -- Add scancel "--hurry" option to avoid staging out any burst buffer data. -- Simplify the sched plugin interface. -- Add new advanced reservation flags of "weekday" (repeat on each weekday; Monday through Friday) and "weekend" (repeat on each weekend day; Saturday and Sunday). -- Add new advanced reservation flag of "flex", which permits jobs requesting the reservation to begin prior to the reservation's start time and use resources inside or outside of the reservation. A typical use case is to prevent jobs not explicitly requesting the reservation from using those reserved resources rather than forcing jobs requesting the reservation to use those resources in the time frame reserved. -- Add NoDecay flag to QOS. -- Node "OS" field expanded from "sysname" to "sysname release version" (e.g. change from "Linux" to "Linux 4.8.0-28-generic #28-Ubuntu SMP Sat Feb 8 09:15:00 UTC 2017"). -- jobcomp/elasticsearch - Add "job_name" and "wc_key" fields to stored information. -- jobcomp/filetxt - Add ArrayJobId, ArrayTaskId, ReservationName, Gres, Account, QOS, WcKey, Cluster, SubmitTime, EligibleTime, DerivedExitCode and ExitCode. -- scontrol modified to report core IDs for reservation containing individual cores. -- MYSQL - Get rid of table join during rollup which speeds up the process dramatically on large job/step tables. -- Add ability to define features on clusters for directing federated jobs to different clusters. -- Add new RPC to process multiple federation RPCs in a single communication. -- Modify slurm_load_jobs() function to load job information from all clusters in a federation. -- Add squeue --local and --sibling options to modify filtering of jobs on federated clusters. -- Add SchedulerParameters option of bf_max_job_user_part to specifiy the maximum number of jobs per user for any single partition. This differs from bf_max_job_user in that a separate counter is applied to each partition rather than having a single counter per user applied to all partitions. -- Modify backfill logic so that bf_max_job_user, bf_max_job_part and bf_max_job_user_part options can all be used independently of each other. -- Add sprio -p/--partition option to filter jobs by partition name. -- Add partition name to job priority factor response message. -- Add sprio --local and --sibling options for use in federation of clusters. -- Add sprio "%c" format to print cluster name in federation mode. -- Modify sinfo logic to provided unified view of all nodes and partitions in a federation, add --local option to only report local state information even in a cluster, print cluster name with "%V" format option, and optionally sort by cluster name. -- If a task in a parallel job fails and it was launched with the --kill-on-bad-exit option then terminate the remaining tasks using the SIGCONT, SIGTERM and SIGKILL signals rather than just sending SIGKILL. -- Include submit_time when doing the sort for job scheduling. -- Modify sacct to report all jobs in federation by default. Also add --local option. -- Modify sacct to accept "--cluster all" option (in addition to the old "--cluster -1", which is still accepted). -- Modify sreport to report all jobs in federation by default. Also add --local option. -- sched/backfill: Improve assoc_limit_stop configuration parameter support. -- KNL features: Always keep active and available features in the same order: first site-specific features, next MCDRAM modes, last NUMA modes. -- Changed default ProctrackType to cgroup. -- Add "cluster_name" field to node_info_t and partition_info_t data structure. It is filled in only when the cluster is part of a federation and SHOW_FEDERATION flag used. -- Functions slurm_load_node() slurm_load_partitions() modified to show all nodes/partitions in a federation when the SHOW_FEDERATION flag is used. -- Add federated views to sview. -- Add --federation option to sacct, scontrol, sinfo, sprio, squeue, sreport to show a federated view. Will show local view by default. -- Add FederationParameters=fed_display slurm.conf option to configure status commands to display a federated view by default if the cluster is a member of a federation. -- Log the down nodes whenever slurmctld restarts. -- Report that "CPUs" plus "Boards" in node configuration invalid only if the CPUs value is not equal to the total thread count. -- Extend the output of the seff utility to also include the job's wall-clock time. -- Add bf_max_time to SchedulerParameters. -- Add bf_max_job_assoc to SchedulerParameters. -- Add new SchedulerParameters option bf_window_linear to control the rate at which the backfill test window expands. This can be used on a system with a modest number of running jobs (hundreds of jobs) to help prevent expected start times of pending jobs to get pushed forward in time. On systems with large numbers of running jobs, performance of the backfill scheduler will suffer and fewer jobs will be evaluated. -- Improve scheduling logic with respect to license use and node reboots. -- CRAY - Alter algorithm to come up with the SLURM_ID_HASH. -- Implement federated scheduling and federated status outputs. -- The '-q' option to srun has changed from being the short form of '--quit-on-interrupt' to '--qos'. -- Change sched_min_interval default from 0 to 2 microseconds. * Changes in Slurm 17.02.12 ========================== -- Fix segfault in slurmdbd hourly rollup when having a job outside a reservation, with no end_time set, from an assoc that's in a reservation. * Changes in Slurm 17.02.11 ========================== -- Fix insecure handling of user_name and gid fields. CVE-2018-10995. * Changes in Slurm 17.02.10 ========================== -- Fix updating of requested TRES memory. -- Cray modulefile: avoid removing /usr/bin from path on module unload. -- Fix issue when resetting the partition pointers on nodes. -- Show reason field in 'sinfo -R' when nodes is marked as failed. -- Fix potential of slurmstepd segfaulting when the extern step fails to start. -- Allow nodes state to be updated between FAIL and DRAIN. -- Avoid registering a job'd credential multiple times. -- Fix sbatch --wait to stop waiting after job is gone from memory. -- Fix memory leak of MailDomain configuration string when slurmctld daemon is reconfigured. -- Fix to properly remove extern steps from the starting_steps list. -- Fix Slurm to work correctly with HDF5 1.10+. -- Add support in salloc/srun --bb option for "access_mode" in addition to "access" for consistency with DW options. -- Fix potential deadlock in _run_prog() in power save code. -- MYSQL - Add dynamic_offset in the database to force range for auto increment ids for the tres_table. -- Avoid setting node in COMPLETING state indefinitely if the job initiating the node reboot is cancelled while the reboot in in progress. -- node_feature/knl_cray - Fix memory leaks that occur when slurmctld reconfigured. -- node_feature/knl_cray - Fix memory leak that can occur during normal operation. -- Fix job array dependency with "aftercorr" option and some task arrays in the first job fail. This fix lets all task array elements that can run proceed rather than stopping all subsequent task array elements. -- Fix whole node allocation cpu counts when --hint=nomultihtread. -- NRT - Fix issue when running on a HFI (p775) system with multiple protocols. -- Fix uninitialized variables when unpacking slurmdb_archive_cond_t. -- Fix security issue in accounting_storage/mysql plugin by always escaping strings within the slurmdbd. CVE-2018-7033. * Changes in Slurm 17.02.9 ========================== -- When resuming powered down nodes, mark DOWN nodes right after ResumeTimeout has been reached (previous logic would wait about one minute longer). -- Fix sreport not showing full column name for TRES Count. -- Fix slurmdb_reservations_get() giving wrong usage data when job's spanned reservation that was modified. -- Fix sreport reservation utilization report showing bad data. -- Show all TRES' on a reservation in sreport reservation utilization report by default. -- Fix sacctmgr show reservation handling "end" parameter. -- Work around issue with sysmacros.h and gcc7 / glibc 2.25. -- Fix layouts code to only allow setting a boolean. -- Fix sbatch --wait to keep waiting even if a message timeout occurs. -- CRAY - If configured with NodeFeatures=knl_cray and there are non-KNL nodes which include no features the slurmctld will abort without this patch when attemping strtok_r(NULL). -- Fix regression in 17.02.7 which would run the spank_task_privileged as part of the slurmstepd instead of it's child process. -- Fix security issue in Prolog and Epilog by always prepending SPANK_ to all user-set environment variables. CVE-2017-15566. * Changes in Slurm 17.02.8 ========================== -- Add 'slurmdbd:' to the accounting plugin to notify message is from dbd instead of local. -- mpi/mvapich - Buffer being only partially cleared. No failures observed. -- Fix for job --switch option on dragonfly network. -- In salloc with --uid option, drop supplementary groups before changing UID. -- jobcomp/elasticsearch - strip any trailing slashes from JobCompLoc. -- jobcomp/elasticsearch - fix memory leak when transferring generated buffer. -- Prevent slurmstepd ABRT when parsing gres.conf CPUs. -- Fix sbatch --signal to signal all MPI ranks in a step instead of just those on node 0. -- Check multiple partition limits when scheduling a job that were previously only checked on submit. -- Cray: Avoid running application/step Node Health Check on the external job step. -- Optimization enhancements for partition based job preemption. -- Address some build warnings from GCC 7.1, and one possible memory leak if /proc is inaccessible. -- If creating/altering a core based reservation with scontrol/sview on a remote cluster correctly determine the select type. -- Fix autoconf test for libcurl when clang is used. -- Fix default location for cgroup_allowed_devices_file.conf to use correct default path. -- Document NewName option to sacctmgr. -- Reject a second PMI2_Init call within a single step to prevent slurmstepd from hanging. -- Handle old 32bit values stored in the database for requested memory correctly in sacct. -- Fix memory leaks in the task/cgroup plugin when constraining devices. -- Make extremely verbose info messages debug2 messages in the task/cgroup plugin when constraining devices. -- Fix issue that would deny the stepd access to /dev/null where GRES has a 'type' but no file defined. -- Fix issue where the slurmstepd would fatal on job launch if you have no gres listed in your slurm.conf but some in gres.conf. -- Fix validating time spec to correctly validate various time formats. -- Make scontrol work correctly with job update timelimit [+|-]=. -- Reduce the visibily of a number of warnings in _part_access_check. -- Prevent segfault in sacctmgr if no association name is specified for an update command. -- burst_buffer/cray plugin modified to work with changes in Cray UP05 software release. -- Fix job reasons for jobs that are violating assoc MaxTRESPerNode limits. -- Fix segfault when unpacking a 16.05 slurm_cred in a 17.02 daemon. -- Fix setting TRES limits with case insensitive TRES names. -- Add alias for xstrncmp() -- slurm_xstrncmp(). -- Fix sorting of case insensitive strings when using xstrcasecmp(). -- Gracefully handle race condition when reading /proc as process exits. -- Avoid error on Cray duplicate setup of core specialization. -- Skip over undefined (hidden in Slurm) nodes in pbsnodes. -- Add empty hashes in perl api's slurm_load_node() for hidden nodes. -- CRAY - Add rpath logic to work for the alpscomm libs. -- Fixes for administrator extended TimeLimit (job reason & time limit reset). -- Fix gres selection on systems running select/linear. -- sview: Added window decorator for maximize,minimize,close buttons for all systems. -- squeue: interpret negative length format specifiers as a request to delimit values with spaces. -- Fix the torque pbsnodes wrapper script to parse a gres field with a type set correctly. * Changes in Slurm 17.02.7 ========================== -- Fix deadlock if requesting to create more than 10000 reservations. -- Fix potential memory leak when creating partition name. -- Execute the HealthCheckProgram once when the slurmd daemon starts rather than executing repeatedly until an exit code of 0 is returned. -- Set job/step start and end times to 0 when using --truncate and start > end. -- Make srun --pty option ignore EINTR allowing windows to resize. -- When resuming node only send one message to the slurmdbd. -- Modify srun --pty option to use configured SrunPortRange range. -- Fix issue with whole gres not being printed out with Slurm tools. -- Fix issue with multiple jobs from an array are prevented from starting. -- Fix for possible slurmctld abort with use of salloc/sbatch/srun --gres-flags=enforce-binding option. -- Fix race condition when using jobacct_gather/cgroup where the memory of the step wasn't always gathered correctly. -- Better debug when slurmdbd queue is filling up in the slurmctld. -- Fixed truncation on scontrol show config output. -- Serialize updates from from the dbd to the slurmctld. -- Fix memory leak in slurmctld when agent queue to the DBD has filled up. -- CRAY - Throttle step creation if trying to create too many steps at once. -- If failing after switch_g_job_init happened make sure switch_g_job_fini is called. -- Fix minor memory leak if launch fails in the slurmstepd. -- Fix issue where UnkillableStepProgram if step was in an ending state. -- Fix bug when tracking multiple simultaneous spawned ping cycles. -- jobcomp/elasticsearch plugin now saves state of pending requests on slurmctld daemon shutdown so then can be recovered on restart. -- Fix issue when an alternate munge key when communicating on a persistent connection. -- Document inconsistent behavior of GroupUpdateForce option. -- Fix bug in selection of GRES bound to specific CPUs where the GRES count is 2 or more. Previous logic could allocate CPUs not available to the job. -- Increase buffer to handle long /proc//stat output so that Slurm can read correct RSS value and take action on jobs using more memory than requested. -- Fix srun job jobs that can run immediately to run in the highest priority partion when multiple partitions are listed. scontrol show jobs can potentially show the partition list in priority order. -- Fix starting controller if StateSaveLocation path didn't exist. -- Fix inherited association 'max' TRES limits combining multiple limits in the tree. -- Sort TRES id's on limits when getting them from the database. -- Fix issue with pmi[2|x] when TreeWidth=1. -- Correct buffer size used in determining specialized cores to avoid possible truncation of core specification and not reserving the specified cores. -- Close race condition on Slurm structures when setting DebugFlags. -- Make it so the cray/switch plugin grabs new DebugFlags on a reconfigure. -- Fix incorrect lock levels when creating or updating a reservation. -- Fix overlapping reservation resize. -- Add logic to help support Dell KNL systems where syscfg is different than the normal Intel syscfg. -- CRAY - Fix BB to handle type= correctly, regression in 17.02.6. * Changes in Slurm 17.02.6 ========================== -- Fix configurator.easy.html to output the SelectTypeParameters line. -- If a job requests a specific memory requirement then gets something else from the slurmctld make sure the step allocation is made aware of it. -- Fix missing initialization in slurmd. -- Fix potential degradation when running HTC (> 100 jobs a sec) like workflows through the slurmd. -- Fix race condition which could leave a stepd hung on shutdown. -- CRAY - Add configuration for ATP to the ansible play script. -- Fix potential to corrupt DBD message. -- burst_buffer logic modified to support sizes in both SI and EIC size units (e.g. M/MiB for powers of 1024, MB for powers of 1000). * Changes in Slurm 17.02.5 ========================== -- Prevent segfault if a job was blocked from running by a QOS that is then deleted. -- Improve selection of jobs to preempt when there are multiple partitions with jobs subject to preemption. -- Only set kmem limit when ConstrainKmemSpace=yes is set in cgroup.conf. -- Fix bug in task/affinity that could result in slurmd fatal error. -- Increase number of jobs that are tracked in the slurmd as finishing at one time. -- Note when a job finishes in the slurmd to avoid a race when launching a batch job takes longer than it takes to finish. -- Improve slurmd startup on large systems (> 10000 nodes) -- Add LaunchParameters option of cray_net_exclusive to control whether all jobs on the cluster have exclusive access to their assigned nodes. -- Make sure srun inside an allocation gets --ntasks-per-[core|socket] set correctly. -- Only make the extern step at job creation. -- Fix for job step task layout with --cpus-per-task option. -- Fix --ntasks-per-core option/environment variable parsing to set the requested value, instead of always setting one (srun). -- Correct error message when ClusterName in configuration files does not match the name in the slurmctld daemon's state save file. -- Better checking when a job is finishing to avoid underflow on job's submitted to a QOS/association. -- Handle partition QOS submit limits correctly when a job is submitted to more than 1 partition or when the partition is changed with scontrol. -- Performance boost for when Slurm is dealing with credentials. -- Fix race condition which could leave a stepd hung on shutdown. -- Add lua support for opensuse. * Changes in Slurm 17.02.4 ========================== -- Do not attempt to schedule jobs after changing the power cap if there are already many active threads. -- Job expansion example in FAQ enhanced to demonstrate operation in heterogeneous environments. -- Prevent scontrol crash when operating on array and no-array jobs at once. -- knl_cray plugin: Log incomplete capmc output for a node. -- knl_cray plugin: Change capmc parsing of mcdram_pct from string to number. -- Remove log files from test20.12. -- When rebooting a node and using the PrologFlags=alloc make sure the prolog is ran after the reboot. -- node_features/knl_generic - If a node is rebooted for a pending job, but fails to enter the desired NUMA and/or MCDRAM mode then drain the node and requeue the job. -- node_features/knl_generic disable mode change unless RebootProgram configured. -- Add new burst_buffer function bb_g_job_revoke_alloc() to be executed if there was a failure after the initial resource allocation. Does not release previously allocated resources. -- Test if the node_bitmap on a job is NULL when testing if the job's nodes are ready. This will be NULL is a job was revoked while beginning. -- Fix incorrect lock levels when testing when job will run or updating a job. -- Add missing locks to job_submit/pbs plugin when updating a jobs dependencies. -- Add support for lua5.3 -- Add min_memory_per_node|cpu to the job_submit/lua plugin to deal with lua not being able to deal with pn_min_memory being a uint64_t. Scripts are urged to change to these new variables avoid issue. If not set the variables will be 'nil'. -- Calculate priority correctly when 'nice' is given. -- Fix minor typos in the documentation. -- node_features/knl_cray: Preserve non-KNL active features if slurmctld reconfigured while node boot in progress. -- node_features/knl_generic: Do not repeatedly log errors when trying to read KNL modes if not KNL system. -- Add missing QOS read lock to backfill scheduler. -- When doing a dlopen on liblua only attempt the version compiled against. -- Fix null-dereference in sreport cluster ulitization when configured with memory-leak-debug. -- Fix Partition info in 'scontrol show node'. Previously duplicate partition names, or Partitions the node did not belong to could be displayed. -- Fix it so the backup slurmdbd will take control correctly. -- Fix unsafe use of MAX() macro, which could result in problems cleaning up accounting plugins in slurmd, or repeat job cancellation attempts in scancel. -- Fix 'scontrol update reservation duration=unlimited' to set the duration to 365-days (as is done elsewhere), rather than 49710 days. -- Check if variable given to scontrol show job is a valid jobid. -- Fix WithSubAccounts option to not include WithDeleted unless requested. -- Prevent a job tested on multiple partitions from being marked WHOLE_NODE_USER. -- Prevent a race between completing jobs on a user-exclusive node from leaving the node owned. -- When scheduling take the nodes in completing jobs out of the mix to reduce fragmentation. SchedulerParameters=reduce_completing_frag -- For jobs submited to multiple partitions, report the job's earliest start time for any partition. -- Backfill partitions that use QOS Grp limits to "float" better. -- node_features/knl_cray: don't clear configured GRES from non-KNL node. -- sacctmgr - prevent segfault in command when a request is denied due to a insufficient priviledges. -- Add warning about libcurl-devel not being installed during configure. -- Streamline job purge by handling file deletion on a separate thread. -- Always set RLIMIT_CORE to the maximum permitted for slurmd, to ensure core files are created even on non-developer builds. -- Fix --ntasks-per-core option/environment variable parsing to set the requested value, instead of always setting one. -- If trying to cancel a step that hasn't started yet for some reason return a good return code. -- Fix issue with sacctmgr show where user='' * Changes in Slurm 17.02.3 ========================== -- Increase --cpu_bind and --mem_bind field length limits. -- Fix segfault when using AdminComment field with job arrays. -- Clear Dependency field when all dependencies are satisfied. -- Add --array-unique to squeue which will display one unique pending job array element per line. -- Reset backfill timers correctly without skipping over them in certain circumstances. -- When running the "scontrol top" command, make sure that all of the user's jobs have a priority that is lower than the selected job. Previous logic would permit other jobs with equal priority (no jobs with higher priority). -- Fix perl api so we always get an allocation when calling Slurm::new(). -- Fix issue with cleaning up cpuset and devices cgroups when multiple steps end at the same time. -- Document that PriorityFlags option of DEPTH_OBLIVIOUS precludes the use of FAIR_TREE. -- Fix issue if an invalid message came in a Slurm daemon/command may abort. -- Make it impossible to use CR_CPU* along with CR_ONE_TASK_PER_CORE. The options are mutually exclusive. -- ALPS - Fix scheduling when ALPS doesn't agree with Slurm on what nodes are free. -- When removing a partition make sure it isn't part of a reservation. -- Fix seg fault if loading attempting to load non-existent burstbuffer plugin. -- Fix to backfill scheduling with respect to QOS and association limits. Jobs submitted to multiple partitions are most likley to be effected. -- sched/backfill: Improve assoc_limit_stop configuration parameter support. -- CRAY - Add ansible play and README. -- sched/backfill: Fix bug related to advanced reservations and the need to reboot nodes to change KNL mode. -- Preempt plugins - fix check for 'preempt_youngest_first' option. -- Preempt plugins - fix incorrect casts in preempt_youngest_first mode. -- Preempt/job_prio - fix incorrect casts in sort function. -- Fix to make task/affinity work with ldoms where there are more than 64 cpus on the node. -- When using node_features/knl_generic make it so the slurmd doesn't segfault when shutting down. -- Fix potential double-xfree() when using job arrays that can lead to slurmctld crashing. -- Fix priority/multifactor priorities on a slurmctld restart if not using accounting_storage/[mysql|slurmdbd]. -- Fix NULL dereference reported by CLANG. -- Update proctrack documentation to strongly encourage use of proctrack/cgroup. -- Fix potential memory leak if job fails to begin after nodes have been selected for a job. -- Handle a job that made it out of the select plugin without a job_resrcs pointer. -- Fix potential race condition when persistent connections are being closed at shutdown. -- Fix incorrect locks levels when submitting a batch job or updating a job in general. -- CRAY - Move delay waiting for job cleanup to after we check once. -- MYSQL - Fix memory leak when loading archived jobs into the database. -- Fix potential race condition when starting the priority/multifactor plugin's decay thread. -- Sanity check to make sure we have started a job in acct_policy.c before we clear it as started. -- Allow reboot program to use arguments. -- Message Aggr - Remove race condition on slurmd shutdown with respects to destroying a mutex. -- Fix updating job priority on multiple partitions to be correct. -- Don't remove admin comment when updating a job. -- Return error when bad separator is given for scontrol update job licenses. * Changes in Slurm 17.02.2 ========================== -- Update hyperlink to LBNL Node Health Check program. -- burst_buffer/cray - Add support for line continuation. -- If a job is cancelled by the user while it's allocated nodes are being reconfigured (i.e. the capmc_resume program is rebooting nodes for the job) and the node reconfiguration fails (i.e. the reboot fails), then don't requeue the job but leave it in a cancelled state. -- capmc_resume (Cray resume node script) - Do not disable changing a node's active features if SyscfgPath is configured in the knl.conf file. -- Improve the srun documentation for the --resv-ports option. -- burst_buffer/cray - Fix parsing for discontinuous allocated nodes. A job allocation of "20,22" must be expressed as "20\n22". -- Fix rare segfault when shutting down slurmctld and still sending data to the database. -- Fix gres output of a job if it is updated while pending to be displayed correctly with Slurm tools. -- Fix pam_slurm_adopt. -- Fix missing unlock when job_list doesn't exist when starting priority/ multifactor. -- Fix segfault if slurmctld is shutting down and the slurmdbd plugin was in the middle of setting db_indexes. -- Add ESLURM_JOB_SETTING_DB_INX to errno to note when a job can't be updated because the dbd is setting a db_index. -- Fix possible double insertion into database when a job is updated at the moment the dbd is assigning a db_index. -- Fix memory error when updating a job's licenses. -- Fix seff to work correctly with non-standard perl installs. -- Export missing slurmdbd_defs_[init|fini] needed for libslurmdb.so to work. -- Fix sacct from returning way more than requested when querying against a job array task id. -- Fix double read lock of tres when updating gres or licenses on a job. -- Make sure locks are always in place when calling assoc_mgr_make_tres_str_from_array. -- Prevent slurmctld SEGV when creating reservation with duplicated name. -- Consider QOS flags Partition[Min|Max]Nodes when doing backfill. -- Fix slurmdbd_defs.c to not have half symbols go to libslurm.so and the other half go to libslurmdb.so. -- Fix 'scontrol show jobs' to remove an errant newline when 'Switches' is printed. -- Better code for handling memory required by a task on a heterogeneous system. -- Fix regression in 17.02.0 with respects to GrpTresMins on a QOS or Association. -- Cleanup to make make dist work. -- Schedule interactive jobs quicker. -- Perl API - correct value of MEM_PER_CPU constant to correctly handle memory values. -- Fix 'flags' variable to be 32 bit from the old 16 bit value in the perl api. -- Export sched_nodes for a job in the perl api. -- Improve error output when updating a reservation that has already started. -- Fix --ntasks-per-node issue with srun so DenyOnLimit would work correctly. -- node_features/knl_cray plugin - Fix memory leak. -- Fix wrong cpu_per_task count issue on heterogeneous system when dealing with steps. -- Fix double free issue when removing usage from an association with sacctmgr. -- Fix issue with SPANK plugins attempting to set null values as environment variables, which leads to the command segfaulting on newer glibc versions. -- Fix race condition on slurmctld startup when plugins have not gone through init() ahead of the rpc_manager processing incoming messages. -- job_submit/lua - expose admin_comment field. -- Allow AdminComment field to be set by the job_submit plugin. -- Allow AdminComment field to be changed by any Administrator. -- Fix key words in jobcomp select. -- MYSQL - Streamline job flush sql when doing a clean start on the slurmctld. -- Fix potential infinite loop when talking to the DBD when shutting down the slurmctld. -- Fix MCS filter. -- Make it so pmix can be included in the plugin rpm without having to specify --with-pmix. -- MYSQL - Fix initial load when not using he DBD. -- Fix scontrol top to not make jobs priority 0 (held). -- Downgrade info message about exceeding partition time limit to a debug2. * Changes in Slurm 17.02.1-2 ============================ -- Replace clock_gettime with time(NULL) for very old systems without the call. * Changes in Slurm 17.02.1 ========================== -- Modify pam module to work when configured NodeName and NodeHostname differ. -- Update to sbatch/srun man pages to explain the "filename pattern" clearer -- Add %x to sbatch/srun filename pattern to represent the job name. -- job_submit/lua - Add job "bitflags" field. -- Update slurm.spec file to note obsolete RPMs. -- Fix deadlock scenario when dumping configuration in the slurmctld. -- Remove unneeded job lock when running assoc_mgr cache. This lock could cause potential deadlock when/if TRES changed in the database and the slurmctld wasn't made aware of the change. This would be very rare. -- Fix missing locks in gres logic to avoid potential memory race. -- If gres is NULL on a job don't try to process it when returning detailed information about a job to scontrol. -- Fix print of consumed energy in sstat when no energy is being collected. -- Print formatted tres string when creating/updating a reservation. -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly. -- Prevent manipulation of the cpu frequency and governor for batch or extern steps. This addresses an issue where the batch step would inadvertently set the cpu frequency maximum to the minimum value supported on the node. -- Convert a slurmctd power management data structure from array to list in order to eliminate the possibility of zombie child suspend/resume processes. -- Burst_buffer/cray - Prevent slurmctld daemon abort if "paths" operation fails. Now job will be held. Update job update time when held. -- Fix issues with QOS flags Partition[Min|Max]Nodes to work correctly. -- Refactor slurmctld agent logic to eliminate some pthreads. -- Added "SyscfgTimeout" parameter to knl.conf configuration file. -- Fix for CPU binding for job steps run under a batch job. * Changes in Slurm 17.02.0 ========================== -- job_submit/lua - Make "immediate" parameter available. -- Fix srun I/O race condtion to eliminate a error message that might be generated if the application exits with outstanding stdin. -- Fix regression when purging/archiving jobs/events. -- Add new job state JOB_OOM indicating Out Of Memory condition as detected by task/cgroup plugin. -- If QOS has been added to the system go refigure out Deny/AllowQOS on partitions. -- Deny job with duplicate GRES requested. -- Fix loading super old assoc_mgr usage without segfaulting. -- CRAY systems: Restore TaskPlugins order of task/cray before task/cgroup. -- Task/cray: Treat missing "mems" cgroup with "debug" messages rather than "error" messages. The file may be missing at step termination due to a change in how cgroups are released at job/step end. -- Fix for job constraint specification with counts, --ntasks-per-node value, and no node count. -- Fix ordering of step task allocation to fill in a socket before going into another one. -- Fix configure to not require C++ -- job_submit/lua - Remove access to slurmctld internal reservation fields of job_pend_cnt and job_run_cnt. -- Prevent job_time_limit enforcement from blocking other internal operations if a large number of jobs need to be cancelled. -- Add 'preempt_youngest_order' option to preempt/partition_prio plugin. -- Fix controller being able to talk to a pre-released DBD. -- Added ability to override the invoking uid for "scontrol update job" by specifying "--uid=|-u ". -- Changed file broadcast "offset" from 32 to 64 bits in order to support files over 2 GB. -- slurm.spec - do not install init scripts alongside systemd service files. * Changes in Slurm 17.02.0rc1 ============================== -- Add port info to 'sinfo' and 'scontrol show node'. -- Fix errant definition of USE_64BIT_BITSTR which can lead to core dumps. -- Move BatchScript to end of each job's information when using "scontrol -dd show job" to make it more readable. -- Add SchedulerParameters configuration parameter of "default_gbytes", which treats numeric only (no suffix) value for memory and tmp disk space as being in units of Gigabytes. Mostly for compatability with LSF. -- Fix race condtion in srun/sattach logic which would prevent srun from terminating. -- Bitstring operations are now 64bit instead of 32bit. -- Replace hweight() function in bitstring with faster version. -- scancel would treat a non-numeric argument as the name of jobs to be cancelled (a non-documented feature). Cancelling jobs by name now require the "--jobname=" command line argument. -- scancel modified to note that no jobs satisfy the filter options when the --verbose option is used along with one or more job filters (e.g. "--qos="). -- Change _pack_cred to use pack_bit_str_hex instead of pack_bit_fmt for better scalability and performance. -- Add BootTime configuration parameter to knl.conf file to optimize resource allocations with respect to required node reboots. -- Add node_features_p_boot_time() to node_features plugin to optimize scheduling with respect to node reboots. -- Avoid allocating resources to a job in the event that its run time plus boot time (if needed) extent into an advanced reservation. -- Burst_buffer/cray - Avoid stage-out operation if job never started. -- node_features/knl_cray - Add capability to detected Uncorrectable Memory Errors (UME) and if detected then log the event in all job and step stderr with a message of the form: error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 *** Similar logic added to node_features/knl_generic in version 17.02.0pre4. -- If job is allocated nodes which are powered down, then reset job start time when the nodes are ready and do not charge the job for power up time. -- Add the ability to purge transactions from the database. -- Add support for requeue'ing of federated jobs (BETA). -- Add support for interactive federated jobs (BETA). -- Add the ability to purge rolled up usage from the database. -- Properly set SLURM_JOB_GPUS environment variable for Prolog. * Changes in Slurm 17.02.0pre4 ============================== -- Add support for per-partitiion OverTimeLimit configuration. -- Add --mem_bind option of "sort" to run zonesort on KNL nodes at step start. -- Add LaunchParameters=mem_sort option to configure running of zonesort by default at step startup. -- Add "FreeSpace" information for each pool to the "scontrol show burstbuffer" output. Required changes to the burst_buffer_info_t data structure. -- Add new node state flag of NODE_STATE_REBOOT for node reboots triggered by "scontrol reboot" commands. Previous logic re-used NODE_STATE_MAINT flag, which could lead to inconsistencies. Add "ASAP" option to "scontrol reboot" command that will drain a node in order to reboot it as soon as possible, then return it to service. -- Allow unit conversion routine to convert 1024M to 1G. -- switch/cray plugin - change legacy spool directory location. -- Add new PriorityFlags option of INCR_ONLY, which prevents a job's priority from being decremented. -- Make it so we don't purge job start messages until after we purge step messages. Hopefully this will reduce the number of messages lost when filling up memory when the database/DBD is down. -- Added SchedulingParameters option of "bf_job_part_count_reserve". Jobs below the specified threshold will not have resources reserved for them. -- If GRES are configured with file IDs, then "scontrol -d show node" will not only identify the count of currently allocated GRES, but their specific index numbers (e.g. "GresUsed=gpu:alpha:2(IDX:0,2),gpu:beta:0(IDX:N/A)"). Ditto for job information with "scontrol -d show job". -- Add new mcs/account plugin. -- Add "GresEnforceBind=Yes" to "scontrol show job" output if so configured. -- Add support for SALLOC_CONSTRAINT, SBATCH_CONSTRAINT and SLURM_CONSTRAINT environment variables to set default constraints for salloc, sbatch and srun commands respectively. -- Provide limited support for the MemSpecLimit configuration parameter without the task/cgroup plugin. -- node_features/knl_generic - Add capability to detected Uncorrectable Memory Errors (UME) and if detected then log the event in all job and step stderr with a message of the form: error: *** STEP 1.2 ON tux1 UNCORRECTABLE MEMORY ERROR AT 2016-12-14T09:09:37 *** -- Add SLURM_JOB_GID to TaskProlog environment. -- burst_buffer/cray - Remove leading zeros from node ID lists passed to dw_wlm_cli program. -- Add "Partitions" field to "scontrol show node" output. -- Remove sched/wiki and sched/wiki2 plugins and associated code. -- Remove SchedulerRootFilter option and slurm_get_root_filter() API call. -- Add SchedulerParameters option of spec_cores_first to select specialized cores from the lowest rather than highest number cores and sockets. -- Add PrologFlags option of Serial to disable concurrent launch of Prolog and Epilog scripts. -- Fix security issue caused by insecure file path handling triggered by the failure of a Prolog script. To exploit this a user needs to anticipate or cause the Prolog to fail for their job. CVE-2016-10030. * Changes in Slurm 17.02.0pre3 ============================== -- Add srun host & PID to job step data structures. -- Avoid creating duplicate pending step records for the same srun command. -- Rewrite srun's logic for pending steps for better efficiency (fewer RPCs). -- Added new SchedulerParameters options step_retry_count and step_retry_time to control scheduling behaviour of job steps waiting for resources. -- Optimize resource allocation logic for --spread-job job option. -- Modify cpu_bind and mem_bind map and mask options to accept a repetition count to better support large task count. For example: "mask_mem:0x0f*2,0xf0*2" is equivalent to "mask_mem:0x0f,0x0f,0xf0,0xf0". -- Add support for --mem_bind=prefer option to prefer, but not restrict memory use to the identified NUMA node. -- Add mechanism to constrain kernel memory allocation using cgroups. New cgroup.conf parameters added: ConstrainKmemSpace, MaxKmemPercent, and MinKmemSpace. -- Correct invokation of man2html, which previously could cause FreeBSD builds to hang. -- MYSQL - Unconditionally remove 'ignore' clause from 'alter ignore'. -- Modify service files to not start Slurm daemons until after Munge has been started. NOTE: If you are not using Munge, but are using the "service" scripts to start Slurm daemons, then you will need to remove this check from the etc/slurm*service scripts. -- Do not process SALLOC_HINT, SBATCH_HINT or SLURM_HINT environment variables if any of the following salloc, sbatch or srun command line options are specified: -B, --cpu_bind, --hint, --ntasks-per-core, or --threads-per-core. -- burst_buffer/cray: Accept new jobs on backup slurmctld daemon without access to dw_wlm_cli command. No burst buffer actions will take place. -- Do not include SLURM_JOB_DERIVED_EC, SLURM_JOB_EXIT_CODE, or SLURM_JOB_EXIT_CODE in PrologSlurmctld environment (not available yet). -- Cray - set task plugin to fatal() if task/cgroup is not loaded after task/cray in the TaskPlugin settings. -- Remove separate slurm_blcr package. If Slurm is built with BLCR support, the files will now be part of the main Slurm packages. -- Replace sjstat, seff and sjobexit RPM packages with a single "contribs" package. -- Remove long since defunct slurmdb-direct scripts. -- Add SbcastParameters configuration option to control default file destination directory and compression algorithm. -- Add new SchedulerParameter (max_array_tasks) to limit the maximum number of tasks in a job array independently from the maximum task ID (MaxArraySize). -- Fix issue where number of nodes is not properly allocated when sbatch and salloc are requested with -n tasks < hosts from -w hostlist or from -N. -- Add infrastructure for submitting federated jobs. * Changes in Slurm 17.02.0pre2 ============================== -- Add new RPC (REQUEST_EVENT_LOG) so that slurmd and slurmstepd can log events through the slurmctld daemon. -- Remove sbatch --bb option. That option was never supported. -- Automatically clean up task/cgroup cpuset and devices cgroups after steps are completed. -- Add federation read/write locks. -- Limit job purge run time to 1 second at a time. -- The database index for jobs is now 64 bits. If you happen to be close to 4 billion jobs in your database you will want to update your slurmctld at the same time as your slurmdbd to prevent roll over of this variable as it is 32 bit previous versions of Slurm. -- Optionally lock slurmstepd in memory for performance reasons and to avoid possible SIGBUS if the daemon is paged out at the time of a Slurm upgrade (changing plugins). Controlled via new LaunchParameters options of slurmstepd_memlock and slurmstepd_memlock_all. -- Add event trigger on burst buffer errors (see strigger man page, --burst_buffer option). -- Add job AdminComment field which can only be set by a Slurm administrator. -- Add salloc, sbatch and srun option of --delay-boot=