Commits · 9b1dadea4eb823b5ef29d8b4ee56cb6b7c3be22f · Manuel G. Marciani / ces_slurm_simulator

03 Apr, 2014 1 commit

launch/poe - fix network value · 01fecf4d

Morris Jette authored Apr 02, 2014

if an job step's network value is set by poe, either by directly
executing poe or srun launching poe, that value was not being
propagated to the job step creation RPC and the network was not
being set up for the proper protocol (e.g. mpi, lapi, pami, etc.).
The previous logic would only work if the srun execute line
explicitly set the protocol using the --network option.

01fecf4d

02 Apr, 2014 1 commit
- Update NEWS and squeue man page. · 247c3ce0
  David Bigagli authored Apr 02, 2014
  
  247c3ce0
31 Mar, 2014 2 commits
- In the PMI implementation by default don't check for duplicate keys. · 88cafae9
  David Bigagli authored Mar 31, 2014
  
  88cafae9
- prempt/partition_prio fix · a0ba1865
  Marcin Stolarek authored Mar 31, 2014
```
Prevent preemption of jobs in partition where PreemptMode=off
```
  a0ba1865
28 Mar, 2014 3 commits
- Fix for building on older hwloc · 31028ebf
  Unknown authored Mar 28, 2014
```
Define hwloc_const_bitmap_t as typdef hwloc_const_cpuset_t
Build fails in task_cgroup_cpuset.c when using an older hwloc (v1.0.2)
due to missing definition of hwloc_const_bitmap_t.
```
  31028ebf
- Do not check cleaning on "pending" steps. · 25b346b2
  Danny Auble authored Mar 28, 2014
  
  25b346b2
- BGQ - fix check for jobinfo when it is NULL · 24f33bbe
  Danny Auble authored Mar 28, 2014
  
  24f33bbe
27 Mar, 2014 3 commits
- Revert "Add a new environment variable PMI2_CONNECT_TO_SERVER. If set in the MPI" · 96384ccb
  David Bigagli authored Mar 27, 2014
```
This reverts commit 084787c0.

Conflicts:

	NEWS
	contribs/pmi2/pmi2_api.c
	src/plugins/mpi/pmi2/mpi_pmi2.c
```
  96384ccb
- Add "Scheduling Configuration Guide" web page · 3f9382c2
  Morris Jette authored Mar 27, 2014
  
  3f9382c2
- add job std_in, std_out and std_err to perl api · 79bd54fe
  Franco Broi authored Mar 27, 2014
```
Add support for job std_in, std_out and std_err fields in Perl API.
```
  79bd54fe
26 Mar, 2014 2 commits
- Update META and NEWS for v14.03.0 · bd148fac
  Morris Jette authored Mar 26, 2014
  
  bd148fac
- Lock the /cgroup/freezer subsystem when creating files for tracking · bd05aaf2
  David Bigagli authored Mar 26, 2014
```
processes.
```
  bd05aaf2
25 Mar, 2014 2 commits
- expand hostlist expression support · 08a97f02
  Morris Jette authored Mar 25, 2014
```
Modify hostlist expressions to accept more than two numeric ranges
(e.g. "row[1-3]rack[0-8]slot[0-63]")
```
  08a97f02
- mysql - Fix invalid memory reference. · 00cabba3
  Danny Auble authored Mar 25, 2014
  
  00cabba3
24 Mar, 2014 4 commits

Added sacctmgr mod qos set RawUsage=0 · f7fb80ec
Danny Auble authored Mar 24, 2014

f7fb80ec
Make AccountingStorageEnforce=all not include nojobs or nosteps. · acbaab41
Danny Auble authored Mar 24, 2014

acbaab41

Add job array hash table · ac7fabc6

Morris Jette authored Mar 24, 2014

Previous logic would typically do list search to find job array elements.
This commit adds two hash tables for job arrays. The first is based upon
the "base" job ID which is common to all tasks. The second hash table
is based upon the sum of the "base" job ID plus the task ID in the array.
This will substantially improve performance for handling dependencies
with job arrays.

ac7fabc6

job array dependency recovery fix · fca71890

Morris Jette authored Mar 24, 2014

When slurmctld restarted, it would not recover dependencies on
job array elements and would just discard the depenency. This
corrects the parsing problem to recover the dependency. The old code
would print a mesage like this and discard it:
slurmctld: error: Invalid dependencies discarded for job 51: afterany:47_*

fca71890

22 Mar, 2014 1 commit

Fix sview abort when adding/removing columns · fbfd0e4d

Morris Jette authored Mar 22, 2014

When adding or removing columns to most data types (jobs, partitions,
nodes, etc.) on some system types an abort is generated. This appears
to be because when columns displayed change, on some systems that
changes the address of "model", while on others the address does not
change (like my laptops). This fix explicitly sets the last_model to
NULL when the columns are changed rather than relying upon the data
structure's address to change.

fbfd0e4d

21 Mar, 2014 4 commits

NRT - Fix issue with 1 node jobs. It turns out the network does need to · 440932df

Danny Auble authored Mar 21, 2014

be setup for 1 node jobs. Here are some of the reasons from IBM...

1. PE expects it.
2. For failover, if there was some challenge or difficulty with the
shared-memory method of data transfer, the protocol stack might
want to go through the adapter instead.
3. For flexibility, the protocol stack might want to be able to transfer
data using some variable combination of shared memory and adapter-based
communication, and
4. Possibly most important, for overall performance, it might be that
bandwidth or efficiency (BW per CPU cycles) might be better using the
adapter resources. (An obvious case is for large messages, it might
require a lot fewer CPU cycles to program the DMA engines on the
adapter to move data between tasks, rather than depend on the CPU
to move the data with loads and stores, or page re-mapping -- and
a DMA engine might actually move the data more quickly, if it's well
integrated with the memory system, as it is in the P775 case.)

440932df

get implicit MPMD task count from config file · 718c8479

Morris Jette authored Mar 20, 2014

If srun invoked with the --multi-prog option, but no task count, then use
the task count provided in the MPMD configuration file.

718c8479

Added scontrol errnumstr command · 04bd1b88
Morris Jette authored Mar 20, 2014

04bd1b88
Update squeue.1 man page describing the SPECIAL_EXIT state. · be3be5d3
David Bigagli authored Mar 20, 2014

be3be5d3

20 Mar, 2014 3 commits
- Change xmalloc()/xfree() to malloc()/free() in hostlist.c for better · b3946aa7
  Hongjia Cao authored Mar 20, 2014
```
performance.
```
  b3946aa7
- task/affinity - Protect against zero divide when simulating more hardware · 92b4de3c
  Danny Auble authored Mar 20, 2014
```
than you really have.
```
  92b4de3c
- sinfo - Make sure if partition name is long and the default the last char · c4bd5ba8
  Danny Auble authored Mar 20, 2014
```
doesn't get chopped off.
```
  c4bd5ba8
19 Mar, 2014 2 commits
- Move the comment from 2.6.7 to 2.6.8 · 9950679b
  David Bigagli authored Mar 19, 2014
  
  9950679b
- Fixed sacct.1 and srun.1 manual pages which contains a hyphen where · e1c8e670
  Gennaro Oliva authored Mar 19, 2014
```
    a minus sign for options was intended.
```
  e1c8e670
18 Mar, 2014 4 commits
- Pudate the NEWS file. · 8c06a54b
  David Bigagli authored Mar 18, 2014
  
  8c06a54b
- Free job_ptr->state_desc where ever state_reason is set. · c2ae6cfc
  Danny Auble authored Mar 17, 2014
  
  c2ae6cfc
- Update last_job_update when a job's state_reason was modified. · b0cc7126
  Danny Auble authored Mar 17, 2014
```
Some of these were resulting in the state of a job not being updated
correctly to tools like sview.
```
  b0cc7126
- Fix issue where jobs still pending after a reservation would remain · 77555c30
  Danny Auble authored Mar 17, 2014
```
in waiting reason ReqNodeNotAvail.
```
  77555c30
17 Mar, 2014 4 commits
- Update sacctmgr man page documenting how to modify account's QOS. · d07172f8
  David Bigagli authored Mar 17, 2014
  
  d07172f8
- Update the scancel man page. · c8f7a291
  David Bigagli authored Mar 17, 2014
  
  c8f7a291
- When recovering node state if the Slurm version is 2.6 or 2.5 set the · ae46953f
  David Bigagli authored Mar 17, 2014
```
protocol version to be SLURM_2_5_PROTOCOL_VERSION which is the minimum
supported version.
```
  ae46953f
- CRAY/ALPS - Add support for CLE52 · a45170c2
  Danny Auble authored Mar 17, 2014
  
  a45170c2
16 Mar, 2014 3 commits

Export "SLURM*" env vars if --export=NONE · 9b4f3634

Morris Jette authored Mar 16, 2014

Previously if the sbatch --export=NONE option was used then several
Slurm environment variables were not propagated from the sbatch
command (SLURM_SUBMIT_DIR, SLURM_SUBMIT_HOST, SLURM_JOB_NAME, etc.)

9b4f3634

schedule enhancement for reservation · 08f0f57c

Morris Jette authored Mar 16, 2014

Scheduler enhancements for reservations: When a job needs to run in
reservation, but can not due to busy resources, then do not block all jobs
in that partition from being scheduled, but only the jobs in that
reservation.

08f0f57c

Reset node's CpuLoad more frequently · fae55cbe

Morris Jette authored Mar 16, 2014

Reset a node's CpuLoad value at least once each SlurmdTimeout seconds.
Previously the value would not be reset unless communications with the
slurmd did not happen for at least 1/3 of the SlurmdTimeout value.
That means nodes that were actively running and terminating jobs would
not get the CpuLoad value reset in a timely fashion. Added a CpuLoad
reset timer to prevent this.

fae55cbe

15 Mar, 2014 1 commit

retry slurm.conf file · 42081d87

Morris Jette authored Mar 15, 2014

Add logic to sleep and retry if slurm.conf can't be read.
Without this, the slurmd daemons may die and when the SlurmdTimeout
is reached, the nodes will be marked DOWN and their jobs will be
killed.
In the long term, it would be good to exit only if the read files
on program startup, and the daemons keep running with old configuration
on reconfiguration, but I don't have time to do that work now.

42081d87