Newer
Older
that leaves dlopen locks in a bad state after a fork).
-- Added MPICH1_P4 patch to launch tasks using srun rather than rsh and
automatically generate mpirun's machinefile based upon the job's
allocation. See "etc/mpich1.slurm.patch".
-- BLUEGENE - fix for overlap mode to mark all other base partitions as used
when creating a new block from the file to insure we only use the base
partitions we are asking for.
-- Fix in proctrack/sgi_job plugin that could cause slurmstepd to seg_fault
preventing timely clean-up of batch jobs in some cases.
* Changes in SLURM 1.2.7
========================
-- BLUEGENE - code to make it so you can make a 36x36x36 system.
Danny Auble
committed
The wiring should be correct for a system with x-dim of 1,2,4,5,8,13
in emulation mode. It will work with any real system no matter the size.
-- Major re-write of jobcomp/script plugin: fix memory leak and
general code clean-up.
-- Add ability to change MaxNodes and ExcNodeList for pending job
using scontrol.
-- Purge zombie processes spawned via event triggers.
-- Add support for power saving mode (experimental code to reduce voltage
and frequency on nodes that stay in the IDLE state, for more information
see http://www.llnl.gov/linux/slurm/power_save.html). None of this
code is enabled by default.
Moe Jette
committed
* Changes in SLURM 1.2.6
========================
-- Fix MPIRUN_PORT env variable in mvapich plugin
-- Disable setting triggers by other than user SlurmUser unless SlurmUser
is root for improved security.
Moe Jette
committed
* Changes in SLURM 1.2.5
========================
-- Fix nodelist truncation in "scontrol show jobs" output
-- In mpi/mpichgm, fix potential problem formatting GMPI_PORT, from
Ernest Artiaga, BSC.
-- In sched/wiki2 - Report job's account, from Ernest Artiaga, BSC.
-- Add sbatch option "--ntasks-per-node".
* Changes in SLURM 1.2.4
========================
-- In select/cons_res - fix for function argument type mis-match in getting
CPU count for a job, from Ernest Artiaga, BSC.
-- In sched/wiki2 - Report job's tasks_per_node requirement.
-- In forward logic fix to check if the forwarding node recieves a connection
but doesn't ever get the message from the sender (network issue or
something) also check to make sure if we get something back we make sure
we account for everything we sent out before we call it good.
-- Another fix to make sure steps with requested nodes have correct cpus
accounted for and a fix to make sure the user can't allocate more
cpus than the have requested.
* Changes in SLURM 1.2.3
========================
-- Cpuset logic added to task/affinity, from Don Albert (Bull) and
Moe Jette (LLNL). The /dev/cpuset file system must be mounted and
set "TaskPluginParam=cpusets" in slurm.conf to enable.
-- In sched/wiki2, fix possible overflow in job's nodelist, from
Ernest Artiaga, BSC.
-- Defer creation of new job steps until a suspended job is resumed.
-- In select/linear - fix for potential stack corruption bug.
* Changes in SLURM 1.2.2
========================
-- Added new command "strigger" for event trigger management, a new
capability. See "man strigger" for details.
-- srun --get-user-env now sends su's stderr to /dev/null
-- Fix in node_scheduling logic with multiple node_sets, from
Ernest Artiaga, BSC.
-- In select/cons_res, fix for function argument type mis-match in getting
CPU count for a job.
-- MPICHGM support bug fixes from Ernest Artiaga, BSC.
-- Support longer hostlist strings, from Ernest Artiaga, BSC.
-- Srun to use env vars for SLURM_PROLOG, SLURM_EPILOG, SLURM_TASK_PROLOG,
and SLURM_TASK_EPILOG. patch.1.2.0-pre11.070201.envproepilog from
Dan Palermo, HP.
-- Documenation update. patch.1.2.0-pre11.070201.mchtml from Dan Palermo, HP.
-- Set SLURM_DIST_CYCLIC = 1 (needed for HP MPI, slurm.hp.env.patch).
* Changes in SLURM 1.2.0-pre15
==============================
-- Fix for another spot where the backup controller calls switch/federation
code before switch/federation is initialized.
* Changes in SLURM 1.2.0-pre14
==============================
-- In sched/wiki2, clear required nodes list when a job is requeued.
Note that the required node list is set to every node used when
a job is started via sched/wiki2.
-- BLUEGENE - Added display of deallocating blocks to smap and other tools.
-- Make slurmctld's working directory be same as SlurmctldLogFile (if any),
otherwise StateSaveDir (which is likely a shared directory, possibly
making core file identification more difficult).
-- Fix bug in switch/federation that results in the backup controller
aborting if it receives an epilog-complete message.
* Changes in SLURM 1.2.0-pre13
==============================
-- Fix for --get-user-env.
* Changes in SLURM 1.2.0-pre12
==============================
Danny Auble
committed
-- BLUEGENE - Added correct node info for sinfo and sview for viewing
allocated nodes in a partition.
-- BLUEGENE - Added state save on slurmctld shutdown of blocks in an error
state on real systems and total block config on emulation systems.
-- Major update to Slurm's PMI internal logic for better scalability.
Communications now supported directly between application tasks via
Slurm's PMI library. Srun sends single message to one task on each node
and that tasks forwards key-pairs to other tasks on that nodes. The old
code sent key-pairs directly to each task.
NOTE: PMI applications must re-link with this new library.
-- For multi-core support: Fix task distribution bug and add automated
tests, patch.1.2.0-pre11.070111.plane from Dan Palermo (HP).
* Changes in SLURM 1.2.0-pre11
==============================
-- Add multi-core options to slurm_step_launch API.
-- Add man pages for slurm_step_launch() and related functions.
Danny Auble
committed
-- Jobacct plugin only looks at the proctrack list instead of the entire
list of processes running on the node. Cutting down a lot of unnecessary
file opens in linux and cutting down the time to query the procs by
more than half.
-- Multi-core bug fix, mask re-use with multiple job steps,
patch.1.2.0-pre10.061214.affinity_stepid from Dan Palermo (HP).
-- Modify jobacct/linux plugin to completely eliminate open /proc files.
-- Added slurm_sched_plugin_reconfig() function to re-read config files.
-- BLUEGENE - --reboot option to srun, salloc, and sbatch actually works.
-- Modified step context and step launch APIs.
* Changes in SLURM 1.2.0-pre10
==============================
-- Fix for sinfo node state counts by state (%A and %F output options).
-- Add ability to change a node's features via "scontrol update". NOTE:
Update slurm.conf also to preserve changes over slurmctld restart or
reconfig.
NOTE: Job and node state information can not be preserved from earlier
versions.
-- Added new slurm.conf parameter TaskPluginParam.
-- Fix for job requeue and credential revoke logic from Hongjia Cao (NUDT).
-- Fix for incorrectly generated masks for task/affinity plugin,
patch.1.2.0-pre9.061207.bitfmthex from Dan Palermo (HP).
-- Make mask_cpu options of srun and slaunch commands not requeue prefix
of "0x". patch.1.2.0-pre9.061208.srun_maskparse from Dan Palermo (HP).
-- Add -c support to the -B automatic mask generation for multi-core
support, patch.1.2.0-pre9.061208.mcore_cpuspertask from Dan Palermo (HP).
-- Fix bug in MASK_CPU calculation,
patch.1.2.0-pre9.061211.avail_cpuspertask from Dan Palermo (HP).
-- BLUEGENE - Added --reboot option to srun, salloc, and sbatch commands.
-- Add "scontrol listpids [JOBID[.STEPID]]" support.
-- Multi-core support patches, fixed SEGV and clean up output for large
task counts, patch.1.2.0-pre9.061212.cpubind_verbose from Dan Palermo (HP).
-- Make sure jobacct plugin files are closed before exec of user tasks to
prevent problems with job checkpoint/restart (based on work by
Hongjia Cao, NUDT).
* Changes in SLURM 1.2.0-pre9
=============================
-- Fix for select/cons_res state preservation over slurmctld restart,
patch.1.2.0-pre7.061130.cr_state from Dan Palermo.
-- Validate product of socket*core*thread count on node registration rather
than individual values. Correct values will need to be specified in slurm.conf
with FastSchedule=1 for correct multi-core scheduling behavior.
* Changes in SLURM 1.2.0-pre8
=============================
-- Modity job state "reason" field to report why a job failed (previously
previously reported only reason waiting to run). Requires cold-start of
slurmctld (-c option).
-- For sched/wiki2 job state request, return REJMESSAGE= with reason for
a job's failure.
-- New FastSchedule configuration parameter option "2" means to base
scheduling decisions upon the node's configuration as specified in
slurm.conf and ignore the node's actual hardware configuration. This
can be useful for testing.
-- Add sinfo output format option "%C" for CPUs (active/idle/other/total).
Based upon work by Anne-Marie Wunderlin (BULL).
-- Assorted multi-core bug fixes (patch1.2.0-pre7.061128.mcorefixes).
-- Report SelectTypeParameters from "scontrol show config".
-- Build sched/wiki plugin for Maui Scheduler (based upon new sched/wiki2
code for Moab Scheduler).
Danny Auble
committed
-- BLUEGENE - changed way of keeping track of smaller partitions using
ionode range instead of quarter nodecard notation.
(i.e. bgl000[0-3] instead of bgl000.0.0)
-- Patch from Hongjia Cao (EINPROGRESS error message change)
-- Fix for correct requid for jobacct plugin
-- Added subsec timing display for sacct
* Changes in SLURM 1.2.0-pre7
=============================
-- BLUEGENE - added configurable images for bluegene block creation.
-- Support processors, core, and physical IDs that are not in numeric
order (in slurmd to gathering node state information, based on patch
by Don Albert, Bull).
-- Fixed bug with aix not looking in the correct dir for the proctrack
include files
-- Removed global_srun.* from common merged it into srun proper
-- Added bluegene section to troubleshooting guide (web page).
-- NOTE: Requires cold-start when moving from 1.2.0-pre6, save state
info for jobs changed.
-- BLUEGENE - Changed logic for wiring bgl blocks to be more maintainable.
(Haven't tested on large system yet, works on 2 base partition system)
-- Do not read the select/cons_res state save file if slurmctld is
cold-started (with the "-c" option).
* Changes in SLURM 1.2.0-pre6
=============================
-- Maintain actually job step run time with suspend/resume use.
-- Allow slurm.conf options to appear multiple times. SLURM will use the
last instance of any particular option.
-- Add version number to node state save file. Will not recover node
state information on restart from older version.
-- Add logic to save/restore multi-core state information.
-- Updated multi-core logic to use types uint16_t and uint32_t instead
of just type int.
-- Race condition for forwarding logic fix from Hongjia Cao
-- Add support for Portable Linux Processor Affinity (PLPA, see
http://www.open-mpi.org/software/plpa).
-- When a job epilog completes on all non-DOWN nodes, immediately purge
it's job steps that lack switch windows. Needed for LSF operation.
Based upon slurm.hp.node_fail.patch.
-- Modify srun to ignore entries on --nodelist for job step creation
if their count exceeds the task count. Based on slurm.hp.srun.patch.
* Changes in SLURM 1.2.0-pre5
=============================
-- Patch from HP patch.1.2.0.pre4.061017.crcore_hints, supports cores as
consumable resource.
-- Added node_inx to job_step_info_t to get the node indecies for mapping out
Danny Auble
committed
steps in a job by nodes.
-- sview grid added
-- BLUEGENE node_inx added to blocks for reference.
-- Automatic CPU_MASK generation for task launch, new srun option -B.
-- Automatic logical to physical processor identification and mapping.
-- Added new srun options to --cpu_bind: sockets, cores, and threads
-- Updated select/cons_res to operate as socket granularity.
-- New srun task distribution options to -m: plane
-- Multi-core support in sinfo, squeue, and scontrol.
-- Memory can be treated as a consumable resource.
-- New srun options --ntasks-per-[node|socket|core].
* Changes in SLURM 1.2.0-pre3
=============================
-- Remove configuration parameter ShedulerAuth (defunct).
-- Add NextJobId to "scontrol show config" output.
-- Add new slurm.conf parameter MailProg.
-- New forwarding logic. New recieve_msg functions depending on what you
are expecting to get back. No srun_node_id anymore passed around in
a slurm_msg_t
-- Remove sched/wiki plugin (use sched/wiki2 for now)
-- Disable pthread_create() for PMI_send when TotalView is running for
better performance.
-- Fixed certain tests in test suite to not run with bluegene or front-end
Danny Auble
committed
systems
-- Removed addresses from slurm_step_layout_t
-- Added new job field, "comment". Set by srun, salloc and sbatch. See
with "scontrol show job". Used in sched/wiki2.
-- Report a job's exit status in "scontrol show job".
-- In sched/wiki2: add support for JOBREQUEUE command.
* Changes in SLURM 1.2.0-pre2
=============================
-- Added function slurm_init_slurm_msg to be used to init any slurm_msg_t
you no longer need do any other type of initialization to the type.
* Changes in SLURM 1.2.0-pre2
=============================
-- Fixed task dist to work with hostfile and warn about asking for more tasks
than you have nodes for in arbitray mode.
-- Added "account" field to job and step accounting information and sacct output.
Danny Auble
committed
-- Moved task layout to slurmctld instead of srun. Job step create returns
step_layout structure with hostnames and addresses that corrisponds
to those nodes.
-- Changed api slurm_lookup_allocation params,
resource_allocation_response_msg_t changed to job_alloc_info_response_msg_t
this structure is being renamed so contents are the same.
-- alter resource_allocation_response_msg_t see slurm.h.in
-- remove old_job_alloc_msg_t and function slurm_confirm_alloc
-- Slurm configuration files now support an "Include" directive to
include other files inline.
Danny Auble
committed
-- BLUEGENE New --enable-bluegene-emulation configure parameter to allow
running system in bluegene emulation mode. Only
really useful for developers.
-- New added new tool sview GUI for displaying slurm info.
-- fixed bug in step layout to lay out tasks correctly
* Changes in SLURM 1.2.0-pre1
=============================
-- Fix bug that could run a job's prolog more than once
-- Permit batch jobs to be requeued, scontrol requeue <jobid>
-- Send overcommit flag from srun in RPCs and have slurmd set SLURM_OVERCOMMIT
flag at batch job launch time.
-- Added new configuration parameter MessageTimeout (replaces #define in
the code)
* Changes in SLURM 1.1.37
=========================
- In sched/wiki2: Add NAME to job record.
- Changed -w (--nodelist) option to only read in number of nodes specified
by -N option unless nprocs was set and in Arbitrary layout mode.
- Added some loops around pthread creates incase they fail and also fixed an
issue in srun to fail job has failed instead of waiting around for threads
that will never end.
- Added fork handlers in the slurmstepd
- In sched/wiki2: fix logic for restarting backup slurmctld.
- In sched/wiki2: if job has no time limit specified, return the partition's
time limit (which is the default for the job) rather than 365 days.
* Changes in SLURM 1.1.36
=========================
- Permit node state specification of DRAIN in slurm.conf.
- In jobcomp/script - fix bug that prevented UID and JOBID environment
variables from being set.
* Changes in SLURM 1.1.35
=========================
- In sched/wiki2: Add support for CMD=SIGNALJOB to accept option
of VALUE=SIGXXX in addition to VALUE=# and VALUE=XXX options.
- In sched/wiki2: Add support for CMD=MODIFYJOB to accept option of
DEPEND=afterany:<jobid>, specify jobid=0 to clear.
- Correct logic for job allocation with task count (srun -n ...) AND
FastSchedule=0 AND low CPUs count in Slurm's node configuration.
- Add new and undocumented scancel option, --ctld, to route signal
requests through slurmctld rather than directly to slurmd daemons.
Useful for testing purposes.
- Fixed issue with hostfile support not working in a job step.
- Set supplemental groups for SlurmUser in slurmctld daemon, from
Anne Marie Wunderlin, Bull.
- In jobcomp/script: Add ACCOUNT and PROCS (count) to environment
variables set. Fix bug that prevented UID and JOBID from being
overwritten.
* Changes in SLURM 1.1.34
=========================
- Insure that slurm_signal_job_step() is defined in srun for mvapich
and mpichgm error conditions.
- Modify /etc/init.d/slurm restart command to wait for daemon to terminate
before starting a new one
- Permit job steps to be started on draining nodes that have already
been allocated to that job.
- Prevent backup slurmctld from purging pending batch job scripts when a
SIGHUP is received.
- BLUEGENE - check to make sure set_block_user works when the block
is in a ready state.
- Fix to slurmstepd to not use local variables in a pthread create.
- In sched/wiki2 - add wiki.conf paramter HostFormat specifying
format of hostlists exchanged between Slurm and Moab (experimental).
- mpi/mvapich: Support Adam Moody's fast MPI initialization protocol
(MVAPICH protocol version 8).
* Changes in SLURM 1.1.33
=========================
- sched/wiki2 - Do not wait for job completion before permitting
additional jobs to be scheduled.
- Add srun SLURM_EXCLUSIVE environment variable support, from
Gilles Civario (Bull).
- sched/wiki2 - Report job's node sharing options.
- sched/wiki2 - If SchedulerPort is in use, retry opening it indefinitely.
- sched/wiki2 - Add support for changing the size of a pending job.
- BLUEGENE - Fix to correctly look at downed/drained nodes with picking
a block to run a job and not confuse it with another running job.
* Changes in SLURM 1.1.32
=========================
- If a job's stdout/err file names are unusable (bad path), use the
default names.
- sched/wiki2 - Fix logic to be compatible with select/cons_res plugin
for allocating individual processors within nodes.
- Fix job end time calculation when changed from an initial value of
INFINITE.
Christopher J. Morrone
committed
* Changes in SLURM 1.1.31
=========================
- Correctly identify a user's login shell when running "srun -b --uid"
as root. Use the --uid field for the /etc/passwd lookup instead of
getuid().
* Changes in SLURM 1.1.30
=========================
Christopher J. Morrone
committed
- Fix to make sure users don't include and exclude the same node in
their srun line.
- mpi/mvapich: Forcibly terminate job 60s after first MPI_Abort()
to avoid waiting indefinitely for hung processes.
- proctrack/sgi_job: Fix segv when destroying an active job container
with processes still running.
- Abort a job's stdout/err to srun if not processed within 5 minutes
(prevents node hanging in completing state if the srun is stopped).
* Changes in SLURM 1.1.29
=========================
- Fix bug which could leave orphan process put into background from
batch script.
* Changes in SLURM 1.1.28
=========================
- BLUEGENE - Fixed issue with nodes that return to service outside of an
admin state is now updated in the bluegene plugin.
- Fix for --get-user-env parsing of non-printing characters in users' logins.
- Restore "squeue -n localhost" support.
- Report lack of PATH env var as verbose message, not error in srun.
* Changes in SLURM 1.1.27
=========================
- Fix possible race condition for two simultaneous "scontrol show config"
calls resulting in slurm_xfree() Error: from read_config.c:642
- BLUEGENE - Put back logic to make a block fail a boot 3 times before
cancelling a users job.
- Fix problem using srun --exclude option for a job step.
- Fix problem generating slurmd error "Unrecognized request: 0" with
some compilers.
* Changes in SLURM 1.1.26
=========================
- In sched/wiki2, fixes for support of job features.
- In sched/wiki2, add "FLAGS=INTERACTIVE;" to GETJOBS response for
non-batch (not srun --batch) jobs.
* Changes in SLURM 1.1.25
=========================
- switch/elan: Fix for "Failed to initialise stats structure" from
libelan when ELAN_STATKEY > MAX_INT.
- Tune PMI support logic for better scalability and performance.
- Fix for running a task on each node of an allocation if not specified.
- In sched/wiki2, set TASKLIST for running jobs.
- In sched/wiki2, set STARTDATE for pending jobs with deferred start.
- Added srun --get-user-env option (for Moab scheduler).
* Changes in SLURM 1.1.24
=========================
- In sched/wiki2, add support for direct "srun --dependency=" use.
- mpi/mvapich: Add support for MVAPICH protocol version 6.
- In sched/wiki2, change "JOBMODIFY" command to "MODIFYJOB".
- In sched/wiki2, change "JOBREQUEUE" command to "REQUEUEJOB".
- For sched/wiki2, permit normal user to specify arbitrary job id.
- In sched/wiki2, set buffer pointer to NULL after free() to avoid
possible memory corruption.
- In sched/wiki2, report a job's exit code on completion.
- For AIX, fix mail for job event notification.
- Add documentation for propagation options in man srun and slurm.conf.
* Changes in SLURM 1.1.23
=========================
- Fix bug in non-blocking connect() code affecting AIX.
Christopher J. Morrone
committed
* Changes in SLURM 1.1.22
=========================
- Add squeue option to print a job step's task count (-o %A).
- Initialize forward_struct to avoid trying to free a bad pointer,
patch from Anton Blanchard (SAMBA).
- In sched/wiki2, fix fatal race condition on slurmctld startup.
- Fix for displaying launching verbose messages for each node under the
tree instead of just the head one.
- Fix job suspend bug, job accounting plugin would SEGV when given a
bad job ID.
* Changes in SLURM 1.1.21
=========================
- BLUEGENE - Wait on a fini to make sure all threads are finished before
cleaning up.
- BLUEGENE - replacements to not destroy lists but just empty it to avoid
losing the pointer to the list in the block allocator.
- BLUEGENE - added --enable-bluegene-emulation configure option to 1.1
- In sched/wiki2, enclose a job's COMMENT value in double quotes.
- In sched/wiki2, support newly defined SIGNALJOB command.
- In sched/wiki2, maintain open event socket, don't open and close
for each event.
- In sched/wiki2, fix for scalability problem starting large jobs.
- Fix logic to execute a batch job step (under an existing resource
allocation) as needed by LSF.
- Patches from Hongjia Cao (pmi finialize issues and type declaration)
- Delete pending job if it's associated partition is deleted.
- fix for handling batch steps completing correctly and setting the
return code.
- Altered ncurses check to make sure programs can link before saying we
have a working curses lib and header.
- Fixed an init issue with forward_struct_init not being set correctly in
a few locations in the slurmd.
- Fix for user to use the NodeHostname (when specified in the slurm.conf file)
to start jobs on.
Christopher J. Morrone
committed
* Changes in SLURM 1.1.20
=========================
- Added new SPANK plugin hook slurm_spank_local_user_init() called
from srun after node allocation.
- Fixed bug with hostfile support not working on a direct srun
* Changes in SLURM 1.1.19
=========================
- BLUEGENE - make sure the order of blocks read in from the bluegene.conf
are created in that order (static mode).
- Fix logic in connect(), slurmctld fail-over was broken in v1.1.18.
- Fix logic to calculate the correct timeout for fan out.
* Changes in SLURM 1.1.18
=========================
- In sched/wiki2, add support for EHost and EHostBackup configuration
parameters in wiki.conf file
- In sched/wiki2, fix memory management bug for JOBWILLRUN command.
- In sched/wiki2, consider job Busy while in Completing state for
KillWait+10 seconds (used to be 30 seconds).
- BLUEGENE - Fixes to allow full block creation on the system and not to add
passthrough nodes to the allocation when creating a block.
- BLUEGENE - Fix deadlock issue with starting and failing jobs at the same
time
- Make connect() non-blocking and poll() with timeout to avoid huge
waits under some conditions.
- Set "ENVIRONMENT=BATCH" environment variable for "srun --batch" jobs only.
- Add logic to save/restore select/cons_res state information.
Christopher J. Morrone
committed
- BLUEGENE - make all sprintf's into snprintf's
Christopher J. Morrone
committed
- Fix for "srun -A" segfault on a node failure.
* Changes in SLURM 1.1.17
=========================
- BLUEGENE - fix to make dynamic partitioning not go create block where
there are nodes that are down or draining.
- Fix srun's default node count with an existing allocation when neither
SLURM_NNODES nor -N are set.
- Stop srun from setting SLURM_DISTRIBUTION under job steps when a
specific was not explicitly requested by the user.
* Changes in SLURM 1.1.16
=========================
- BLUEGENE - fix to make prolog run 5 minutes longer to make sure we have
enough time to free the overlapping blocks when starting a new job on a
block.
- BLUEGENE - edit to the libsched_if.so to read env and look at
MPIRUN_PARTITION to see if we are in slurm or running mpirun natively.
- Plugins are now dlopened RTLD_LAZY instead of RTLD_NOW.
* Changes in SLURM 1.1.15
=========================
- BLUEGENE - fix to be able to create static partitions
- Fixed fanout timeout logic.
- Fix for slurmctld timeout on outgoing message (Hongjia Cao, NUDT.edu.cn).
* Changes in SLURM 1.1.14
=========================
- In sched/wiki2: report job/node id and state only if no changes since
time specified in request.
- In sched/wiki2: include a job's exit code in job state information.
- In sched/wiki2: add event notification logic on job submit and completion.
- In sched/wiki2: add support for JOBWILLRUN command type.
- In sched/wiki2: for job info, include required HOSTLIST if applicable.
- In sched/wiki2: for job info, replace PARTITIONMASK with RCLASS (report
partition name associated with a job, but no task count)
- In sched/wiki2: for job and node info, report all data if TS==0,
volitile data if TS<=update_time, state only if TS>update_time
- In sched/wiki2: add support for CMD=JOBSIGNAL ARG=jobid SIGNAL=name or #
- In sched/wiki2: add support for CMD=JOBMODIFY ARG=jobid [BANK=name]
[TIMELIMIT=minutes] [PARTITION=name]
- In sched/wiki2: add support for CMD=INITIALIZE ARG=[USEHOSTEXP=T|F]
[EPORT=#]; RESPONSE=EPORT=# USEHOSTEXP=T
- In sched/wiki2: fix memory leak.
- Fix sinfo node state filtering when asking for idle nodes that are also
draining.
- Add Fortran extension to slurm_get_rem_time() API.
- Fix bug when changing the time limit of a running job that has previously
been suspended (formerly failed to account for suspend time in setting
termination time).
- fix for step allocation to be able to specify only a few nodes in a
step and ask for more that specified.
- patch from Hongjia Cao for forwarding logic
- BLUEGENE - able to allocate specific nodes without locking up.
- BLUEGENE - better tracking of blocks that are created dynamically,
less hitting the db2.
* Changes in SLURM 1.1.13
=========================
- Fix hang in sched/wiki2 if Moab stops responding responding when
response is outgoing.
- BLUEGENE - fix to make sure the block is good to go when picking it
- BLUEGENE - add libsched_if.so so mpirun doesn't try to create a block
by itself.
- Enable specification of srun --jobid=# option with --batch (for user root).
- Verify that job actually starts when requested by sched/wiki2.
- Add new wiki.conf parameters: EPort and JobAggregationTime for event
notification logic (see wiki.conf man page for details)
* Changes in SLURM 1.1.12
=========================
- Sched/wiki2 to report a job's account as COMMENT response to GETJOBS
request.
- Add srun option "--comment" (maps to job account until slurm v1.2,
needed for Moab scheduler functionality).
- fixed some timeout issues in the controller hopefully stopping all the
issues with excessive timeouts.
- unit conversion (i.e. 1024 => 1k) only happens on bgl systems for node
count.
- Sched/wiki2 to report a job's COMPETETIME and SUSPENDTIME in GETJOBS
response.
- Added support for Mellanox's version of mvapich-0.9.7.
* Changes in SLURM 1.1.11
=========================
- Update file headers adding permission to link with OpenSSL.
- Enable sched/wiki2 message authentication.
- Fix libpmi compilation issue.
- Remove "gcc-c++ python" from slurm.spec BuildRequires. It breaks
the AIX build, so we'll have to find another way to deal with that.
* Changes in SLURM 1.1.10
=========================
-- task distribution fix for steps that are smaller than job allocation.
-- BLUEGENE - fix to only send a success when block was created when trying
to allocate the block.
-- fix so if slurm_send_recv_node_msg fails on the send the auth_cred returned
by the resp is NULL.
-- Fix switch/federation plugin so backup controller can assume control
repeatedly without leaking or corrupting memory.
-- Add new error code (for Maui/Moab scheduler): ESLURM_JOB_HELD
-- Tweak slurmctld's node ping logic to better handle failed nodes with
hierarchical communications fail-over logic.
-- Add support for sched/wiki specific configuration file "wiki.conf".
-- Added sched/wiki2 plugin (new experimental wiki plugin).
* Changes in SLURM 1.1.9
========================
Christopher J. Morrone
committed
-- BLUEGENE - fix to handle a NO_VAL sent in as num procs in the job
description.
-- Fix bug in slurmstepd code for parsing --multi-prog command script.
Parser was failing for commands with no arguments.
-- Fix bug to check unsigned ints correctly in bitstring.c
-- Alter node count covert to kilo to only convert number divisible by
1024 or 512
* Changes in SLURM 1.1.8
========================
-- Added bug fixes (fault-tolerance and memory leaks) from Hongjia Cao
<hjcao@nudt.edu.cn>
-- Gixed some potential BLUEGENE issues with the bridge log file not having
a mutex around the fclose and fopen.
-- BLUEGENE - srun -n procs now regristers correctly
-- Fixed problem with reattach double allocating step_layout->tids
-- BLUEGENE - fix race condition where job is finished before it starts.
* Changes in SLURM 1.1.7
========================
-- BLUEGENE - fixed issue with doing an allocation for nodes since asking
for 32,128, or 512 all mean 1 to the controller.
Christopher J. Morrone
committed
-- Add "Include" directive to slurm.conf files. If "Include" is found
at the beginning of a line followed by whitespace and then
the full path to a file, that file is included inline with the current
slurm.conf file.
* Changes in SLURM 1.1.6
========================
-- Improved task layout for relative positions
-- Fixed heterogeous cpu overcommit issue
-- Fix bug where srun would hang if it ran on one node and that
node's slurmd died
-- Fix bug where srun task layout would be bad when min-max node range is
specified (e.g. "srun -N1-4 ...")
-- Made slurmctld_conf.node_prefix only be set on Bluegene systems.
-- Fixed a race condition in the controller to make it so a plugin thread
wouldn't be able to access the slurmctld_conf structure before it was
filled.
* Changes in SLURM 1.1.5
========================
-- Ignore partition's MaxNodes for SlurmUser and root.
-- Fix possible memory corruption with use of PMI_KVS_Create call.
-- Fix race condition when multiple PMI_KVS_Barrier calls.
-- Fix logic in which slurmctld outgoing RPC requests could get delayed.
-- Fix logic for laying out steps without a hostlist.
* Changes in SLURM 1.1.4
========================
-- Improve error handling in hierarchical communications logic.
* Changes in SLURM 1.1.3
========================
-- Fix big-endian bug in the bitstring code which plagued AIX.
-- Fix bug in handling srun's --multi-prog option, could go off end of buffer.
-- Added support for job step completion (and switch window release) on
subset of allocated nodes.
-- BLUEGENE - removed configure option --with-bg-link bridge is linked with
dlopen now no longer needing fake database so files on frontend node.
-- BLUEGENE - implemented use of rm_get_partition_info instead of
...partitions_info which has made a much better design improving stability.
-- Streamline PMI communications and increase timeouts for highly parallel
jobs. Improves scalability of PMI.
* Changes in SLURM 1.1.2
========================
-- Fix bug in jobcomp/filetxt plugin to report proper NodeCnt when a job
fails due to a node failure.
-- Fix Bluegene configure to work with the new 64bit libs.
-- Fix bug in controller that causes it to segfault when hit with a malformed
message.
-- For "srun --attach=X" to other users job, report an error and exit (it
previously just hung).
-- BLUEGENE - fix for doing correct small block logic on user error.
-- BLUEGENE - Added support in slurmd to create a fake libdb2.so if it
doesn't exist so smap won't seg fault
-- BLUEGENE - "scontrol show job" reports "MaxProcs=None" and "Start=None"
if values are not specified at job submit time
-- Add retry logic for PMI communications, may be needed for highly parallel
jobs.
-- Fix bug in slurmd where variable is used in logging message after freed
(slurmstepd rank info).
-- Fix bug in scontrol show daemons if NodeName=localhost will work now to
display slurmd as place where it is running.
-- Patch from HP for init nodes before init_bitmaps
-- ctrl-c killed sruns will result in job state as cancelled instead of
completed.
-- BLUEGENE - added configure option --with-bg-link to choose dynamic linking
or static linking with the bridgeapi.
* Changes in SLURM 1.1.1
========================
-- Fix bug in packing job suspend/resume RPC.
-- If a user breaks out of srun before the allocation takes place, mark the
job as CANCELLED rather than COMPLETED and change its start and end time
to that time.
-- Fix bug in PMI support that prevented use of second PMI_Barrier call.
This fix is needed for MVAPICH2 use.
-- Add "-V" options to slurmctld and slurmd to print version number and exit.
-- Fix scalability bug in sbcast.
-- Fix bug in cons_res allocation strategy.
-- Fix bug in forwarding with mpi
-- Fix bug sacct forwarding with stat option
-- Added nodeid to sacct stat information
-- cleaned up way slurm_send_recv_node_msg works no more clearing errno
-- Fix error handling bug in the networking code that causes the slurmd to
xassert if the server is not running when the slurmd tries to register.
* Changes in SLURM 1.1.0
========================
-- Fix bug that could temporarily make nodes DOWN when they are really
responding.
-- Fix bug preventing backup slurmctld from responding to PING RPCs.
-- Set "CFLAGS=-DISO8601" before configuration to get ISO8601 format
times for all SLURM commands. NOTE: This may break Moab, Maui, and/or
LSF schedulers.
-- Fix for srun -n and -O options when paired with -b.
Danny Auble
committed
-- Added logic for fanout to failover to forward list if main node is
unreachable
-- sacct also now keeps track of submitted, started and ending times of jobs
-- reinit config file mutex at beginning of slurmstepd to avoid fork issues
Danny Auble
committed
* Changes in SLURM 1.1.0-pre8
=============================
-- Fix bug in enforcement of partition's MaxNodes limit.
Danny Auble
committed
-- BLUEGENE - added support for srun -w option also fixed the geometry option
for srun.
Danny Auble
committed
* Changes in SLURM 1.1.0-pre7
=============================
-- Accounting works for aix systems, use jobacct/aix
-- Support large (over 2GB) files on 32-bit linux systems
Danny Auble
committed
-- changed all writes to safe_write in srun
-- added $float to globals.example in the testsuite
-- Set job's num_proc correctly for jobs that do not have exclusive use
of it's allocated nodes.
-- Change in support for test suite: 'testsuite/expect/globals.example'
is now 'testsuite/expect/globals' and you can override variable
settings with a new file 'testsuite/expect/globals.local'.
-- Job suspend now sends SIGTSTP, sleep(1), sends SIGSTOP for better
MPI support.
-- Bluegene - before assigning a job to a block the plugin will check the bps
to make sure they aren't in error state.
-- Change time format in job completion logging (JobCompType=jobcomp/filetxt)
from "MM/DD HH:MM:SS" to "YYYY-MM-DDTHH:MM:SS", conforming with the ISO8601
standard format.
* Changes in SLURM 1.1.0-pre6
=============================
-- Added logic to "stat" a running job with sacct option -S use -j to specify
job.step
Danny Auble
committed
-- removed jobacct/bluegene (no real need for this) meaning, I don't think
there is a way to gather the data yet.
-- Added support for mapping "%h" in configured SlurmdLog to the hostname.
-- Add PropagatePrioProcess to control propagation of a user's nice value
to spawned tasks (based upon work by Daniel Christians, HP).
Danny Auble
committed
* Changes in SLURM 1.1.0-pre5
=============================
-- Added step completion RPC logic
-- Vastly changed sacct and the jobacct plugin. Read documentation for full
details.
-- Added jobacct plugin for AIX and BlueGene, they currently don't work,
but infrastructure is in place.
-- Add support for srun option --ctrl-comm-ifhn to set PMI communications
address (Hongjia Cao, National University of Defense Technology).
-- Moved safe_read/write to slurm_protocol_defs.h removing multiple copies.
-- Remove vestigial functions slurm_allocate_resources_and_run() and
slurm_free_resource_allocation_and_run_response_msg().
-- Added support for different executable files and arguments by task based
upon a configuration file. See srun's --multi-prog option (based upon
work by Hongjia Cao, National University of Defense Technology).
-- moved the way forward logic waited for fanout logic mostly eliminating
problems with scalability issues.
-- changed -l option in sacct to display different params see sacct/sacct.h
for details.
Danny Auble
committed
* Changes in SLURM 1.1.0-pre4
=============================
-- Bluegene specific - Added support to set bluegene block state to
free/error via scontrol update BlockName
-- Add needed symbol to select/bluegene in order to load plugin.
Danny Auble
committed
* Changes in SLURM 1.1.0-pre3
=============================
-- Added framework for XCPU job launch support.
Christopher J. Morrone
committed
-- New general configuration file parser and slurm.conf handling code.
Allows long lines to be continued on the next line by ending with a "\".
Whitespace is allowed between the key and "=", and between the "=" and
value.
WARNING: A NodeName may now occur only once in a slurm.conf file.
If you want to temporarily make nodes DOWN in the slurm.conf,
use the new DownNodes keyword (see "man slurm.conf").
-- Gracefully handle request to submit batch job from within an existing
batch job.
Moe Jette
committed
-- Warn user attempting to create a job allocation from within an existing job
allocation.
-- Add web page description for proctrack plugin.
-- Add new function slurm_get_rem_time() for job's time limit.
-- JobAcct plugin renamed from "log" to "linux" in preparation for support of
new system types.
WARNING: "JobAcctType=jobacct/log" is no longer supported.
-- Removed vestigal 'bg' names from bluegene plugin and smap
-- InactiveLimit parameter is not enforced for RootOnly partitions.
-- Update select/cons_res web page (Susanne Balle, HP,
cons_res_doc_patch_3_29_06).
-- Build a "slurmd.test" along with slurmd. slurmd.test has the path to
slurmstepd set allowing it to run unmodified out of the builddir for
testing (Mark Grondona).
* Changes in SLURM 1.1.0-pre2
=============================
-- Added "bcast" command to transmit copies of a file to compute nodes
with message fanout.
-- Bluegene specific - Added support for overlapping partitions and
dynamic partitioning.
-- Bluegene specific - Added support for nodecard sized blocks.
-- Added logic to accept 1k for 1024 and so on for --nodes option of srun.
This logic is through display tools such as smap, sinfo, scontrol, and
squeue.
-- Added bluegene.conf man page.
-- Added support for memory affinity, see srun --mem_bind option.
* Changes in SLURM 1.1.0-pre1
=============================
-- New --enable-multiple-slurmd configure parameter to allow running
more than one copy of slurmd on a node at the same time. Only
really useful for developers.
-- New communication is now branched on all processes to slurmd's from
slurmctld and srun launch command. This is done with a tree type
algorithm. Spawn and batch mode work the same as before. New slurm.conf
variable TreeWidth=50 is default. This is the number of threads per
-- Configuration parameter HeartBeatInterval is depracated. Now used half
of SlurmdTimeout and SlurmctldTimeout for communications to slurmd and
slurmctld daemons repsectively.
-- Add hash tables for select/cons_res plugin (Susanne Balle, HP,
patch_02222006).
-- Remove some use of cr_enabled flag in slurmctld job record, use
new flag "test_only" in select_g_job_test() instead.
Christopher J. Morrone
committed
* Changes in SLURM 1.0.17
=========================
-- Set correct user groups for task epilogs.
-- Add more debugging for tracking slow slurmd job initiations
(slurm.hp.replaydebug.patch).
Christopher J. Morrone
committed
* Changes in SLURM 1.0.16
=========================
-- For "srun --attach=X" to other users job, report an error and exit (it
previously just hung).
-- Make sure that "scancel -s KILL" terminates the job just like "scancel"
including deletion of all job steps (Chris Holmes, HP, slurm,patch).
-- Recognize ISO-8859 input to srun as a script (for non-English scripts).
-- switch/elan: Fix bug in propagation of ELAN_STATKEY environment variable.
Christopher J. Morrone
committed
-- Fix bug in slurmstepd IO code that can result in it spinning if a
certain error occurs.
Christopher J. Morrone
committed
-- Remove nodes from srun's required node list if their count exceeds
the number of requested tasks.
-- sched/backfill to schedule around jobs that are hung in a completing
state.
-- Avoid possibly re-running the epilog for a job on slurmctld restart or
reconfig by saving and restoring a hostlist of nodes still completing
the job.
* Changes in SLURM 1.0.15
=========================
-- In srun, reset stdin to blocking mode (if it was originally blocking before
we set it to O_NONBLOCK) on exit to avoid trouble with things like running
srun under a bash shell in an emacs *shell* buffer.
-- Fix srun race condition that occasionally causes segfaults at shutdown
-- Fix obscure locking issues in log.c code.
-- Explicitly close IO related sockets. If an srun gets "stuck", possibly
because of unkillable tasks in its job step, it will not hold many TCP
sockets in the CLOSE_WAIT state.
-- Increase the SLURM protocol timeout from 5 seconds to 10 seconds.
(In 1.2 there will be a slurm.conf parameter for this, rather than having
it hardcoded.)
* Changes in SLURM 1.0.14
=========================
-- Fix for bad xfree() call in auth/munge which can raise an assert().
-- Fix installed fork handlers for the conf mutex for slurmd and slurmstepd.
* Changes in SLURM 1.0.13
=========================
-- Fix for AllowGroups option to work when the /etc/group file doesn't
contain all users in group by adding the uids of the names in /etc/passwd
that have a gid of that which we are looking for.
-- Fix bug in InactiveLimit support that can potentially purge active jobs.
NOTE: This is highly unlikely except on very large AIX clusters.
-- Fix bug for reiniting the config_lock around the control_file in
slurm_protocol_api.c logic has changed in 1.1 so no need to merge
* Changes in SLURM 1.0.12
=========================
-- Report node state of DRAIN rather than DOWN if DOWN with DRAIN flag set.
-- Initialize job->mail_type to 0 (NONE) for job submission.
-- Fix for stalled task stdout/stderr when buffered I/O is used, and
a single line exceeds 4096 bytes.
-- Memory leak fixes for maui plugin (hjcao@nudt.edu.cn)
Christopher J. Morrone
committed
-- Fix for spinning srun when the terminal to which srun is talking
goes away.
-- Don't set avail_node_bitmap for DRAINED nodes on slurmctld reconfig
(can schedule a job on drained node after reconfig).
Christopher J. Morrone
committed
* Changes in SLURM 1.0.11
=========================
-- Fix for slurmstepd hang when launching a task. (Needed to install
list library's atfork handlers).
-- Fix memory leak on AIX (and possibly other architectures) due to
missing pthread_attr_destroy() calls.
-- Fix rare task standard I/O setup bug. When the bug hit, stdin, stdout,
Christopher J. Morrone
committed
or stderr could be an invalid file descriptor.
-- General slurmstepd file descriptor cleanup.
-- Fix memory leak in job accounting logic (Andy Riebs, HP, memory_leak.patch).
Christopher J. Morrone
committed
* Changes in SLURM 1.0.10
=========================
-- Fix for job accounting logic submitted from Andy Riebs to handle issues
with suspending jobs and such. patch file named requeue.patch
-- Make select/cons_res interoperate with mpi/lam plugin for task counts.
-- Fix race condition where srun could seg-fault due to use of logging functions
within pthread after calling log_fini.
-- Code changes for clean build with gcc 2.96 (gcc_2_96.patch, Takao Hatazaki, HP).
-- Add CacheGroups configuration support in configurator.html (configurator.patch,
Takao Hatazaki, HP).
-- Fix bug preventing use of mpich-gm plugin (mpichgm.patch, Takao Hatazaki, HP).
* Changes in SLURM 1.0.9
========================
-- Fix job accounting logic to open new log file on slurmctld reconfig.
(Andy Riebs, slurm.hp.logfile.patch).
-- Fix bug which allows a user to run a batch script on a node not allocated
by the slurmctld.
-- Fix poe MP_HOSTFILE handling bug on AIX.
* Changes in SLURM 1.0.8
========================
-- Fix to communication between slurmd and slurmstepd to allow for partial
reads and writes on their communication pipes.
* Changes in SLURM 1.0.7
========================
-- Change in how AuthType=auth/dummy is handled for security testing.
-- Fix for bluegene systems to allow full system partitions to stay booted
when other jobs are submitted to the queue.
* Changes in SLURM 1.0.6
========================
-- Prevent slurmstepd from crashing when srun attaches to batch job.
* Changes in SLURM 1.0.5
========================
-- Restructure logic for scheduling BlueGene small block jobs. Added
"test_only" flag to select_p_job_test() in select plugin.
-- Correct squeue "NODELIST" output for BlueGene small block jobs.
-- Fix possible deadlock situations on BlueGene plugin on errors.
* Changes in SLURM 1.0.4
========================
-- Release job allocation if step creation fails (especially for BlueGene).
-- Fix bug select/bluegene warm start with changed bglblock layout.
-- Fix bug for queuing full-system BlueGene jobs.