Commits · 79bd54fe58e76ae0ff1e914a47d73d6f91423774 · Manuel G. Marciani / ces_slurm_simulator

27 Mar, 2014 1 commit
- add job std_in, std_out and std_err to perl api · 79bd54fe
  Franco Broi authored Mar 27, 2014
```
Add support for job std_in, std_out and std_err fields in Perl API.
```
  79bd54fe
26 Mar, 2014 2 commits
- Update META and NEWS for v14.03.0 · bd148fac
  Morris Jette authored Mar 26, 2014
  
  bd148fac
- Lock the /cgroup/freezer subsystem when creating files for tracking · bd05aaf2
  David Bigagli authored Mar 26, 2014
```
processes.
```
  bd05aaf2
25 Mar, 2014 2 commits
- expand hostlist expression support · 08a97f02
  Morris Jette authored Mar 25, 2014
```
Modify hostlist expressions to accept more than two numeric ranges
(e.g. "row[1-3]rack[0-8]slot[0-63]")
```
  08a97f02
- mysql - Fix invalid memory reference. · 00cabba3
  Danny Auble authored Mar 25, 2014
  
  00cabba3
24 Mar, 2014 4 commits

Added sacctmgr mod qos set RawUsage=0 · f7fb80ec
Danny Auble authored Mar 24, 2014

f7fb80ec
Make AccountingStorageEnforce=all not include nojobs or nosteps. · acbaab41
Danny Auble authored Mar 24, 2014

acbaab41

Morris Jette authored Mar 24, 2014

Previous logic would typically do list search to find job array elements.
This commit adds two hash tables for job arrays. The first is based upon
the "base" job ID which is common to all tasks. The second hash table
is based upon the sum of the "base" job ID plus the task ID in the array.
This will substantially improve performance for handling dependencies
with job arrays.

ac7fabc6

job array dependency recovery fix · fca71890

Morris Jette authored Mar 24, 2014

When slurmctld restarted, it would not recover dependencies on
job array elements and would just discard the depenency. This
corrects the parsing problem to recover the dependency. The old code
would print a mesage like this and discard it:
slurmctld: error: Invalid dependencies discarded for job 51: afterany:47_*

fca71890

22 Mar, 2014 1 commit

Fix sview abort when adding/removing columns · fbfd0e4d

Morris Jette authored Mar 22, 2014

When adding or removing columns to most data types (jobs, partitions,
nodes, etc.) on some system types an abort is generated. This appears
to be because when columns displayed change, on some systems that
changes the address of "model", while on others the address does not
change (like my laptops). This fix explicitly sets the last_model to
NULL when the columns are changed rather than relying upon the data
structure's address to change.

fbfd0e4d

21 Mar, 2014 4 commits

NRT - Fix issue with 1 node jobs. It turns out the network does need to · 440932df

Danny Auble authored Mar 21, 2014

be setup for 1 node jobs. Here are some of the reasons from IBM...

1. PE expects it.
2. For failover, if there was some challenge or difficulty with the
shared-memory method of data transfer, the protocol stack might
want to go through the adapter instead.
3. For flexibility, the protocol stack might want to be able to transfer
data using some variable combination of shared memory and adapter-based
communication, and
4. Possibly most important, for overall performance, it might be that
bandwidth or efficiency (BW per CPU cycles) might be better using the
adapter resources. (An obvious case is for large messages, it might
require a lot fewer CPU cycles to program the DMA engines on the
adapter to move data between tasks, rather than depend on the CPU
to move the data with loads and stores, or page re-mapping -- and
a DMA engine might actually move the data more quickly, if it's well
integrated with the memory system, as it is in the P775 case.)

440932df

get implicit MPMD task count from config file · 718c8479

Morris Jette authored Mar 20, 2014

If srun invoked with the --multi-prog option, but no task count, then use
the task count provided in the MPMD configuration file.

718c8479

Added scontrol errnumstr command · 04bd1b88
Morris Jette authored Mar 20, 2014

04bd1b88
Update squeue.1 man page describing the SPECIAL_EXIT state. · be3be5d3
David Bigagli authored Mar 20, 2014

be3be5d3

20 Mar, 2014 3 commits
- Change xmalloc()/xfree() to malloc()/free() in hostlist.c for better · b3946aa7
  Hongjia Cao authored Mar 20, 2014
```
performance.
```
  b3946aa7
- task/affinity - Protect against zero divide when simulating more hardware · 92b4de3c
  Danny Auble authored Mar 20, 2014
```
than you really have.
```
  92b4de3c
- sinfo - Make sure if partition name is long and the default the last char · c4bd5ba8
  Danny Auble authored Mar 20, 2014
```
doesn't get chopped off.
```
  c4bd5ba8
19 Mar, 2014 2 commits
- Move the comment from 2.6.7 to 2.6.8 · 9950679b
  David Bigagli authored Mar 19, 2014
  
  9950679b
- Fixed sacct.1 and srun.1 manual pages which contains a hyphen where · e1c8e670
  Gennaro Oliva authored Mar 19, 2014
```
    a minus sign for options was intended.
```
  e1c8e670
18 Mar, 2014 4 commits
- Pudate the NEWS file. · 8c06a54b
  David Bigagli authored Mar 18, 2014
  
  8c06a54b
- Free job_ptr->state_desc where ever state_reason is set. · c2ae6cfc
  Danny Auble authored Mar 17, 2014
  
  c2ae6cfc
- Update last_job_update when a job's state_reason was modified. · b0cc7126
  Danny Auble authored Mar 17, 2014
```
Some of these were resulting in the state of a job not being updated
correctly to tools like sview.
```
  b0cc7126
- Fix issue where jobs still pending after a reservation would remain · 77555c30
  Danny Auble authored Mar 17, 2014
```
in waiting reason ReqNodeNotAvail.
```
  77555c30
17 Mar, 2014 4 commits
- Update sacctmgr man page documenting how to modify account's QOS. · d07172f8
  David Bigagli authored Mar 17, 2014
  
  d07172f8
- Update the scancel man page. · c8f7a291
  David Bigagli authored Mar 17, 2014
  
  c8f7a291
- When recovering node state if the Slurm version is 2.6 or 2.5 set the · ae46953f
  David Bigagli authored Mar 17, 2014
```
protocol version to be SLURM_2_5_PROTOCOL_VERSION which is the minimum
supported version.
```
  ae46953f
- CRAY/ALPS - Add support for CLE52 · a45170c2
  Danny Auble authored Mar 17, 2014
  
  a45170c2
16 Mar, 2014 3 commits

Export "SLURM*" env vars if --export=NONE · 9b4f3634

Morris Jette authored Mar 16, 2014

Previously if the sbatch --export=NONE option was used then several
Slurm environment variables were not propagated from the sbatch
command (SLURM_SUBMIT_DIR, SLURM_SUBMIT_HOST, SLURM_JOB_NAME, etc.)

9b4f3634

schedule enhancement for reservation · 08f0f57c

Morris Jette authored Mar 16, 2014

Scheduler enhancements for reservations: When a job needs to run in
reservation, but can not due to busy resources, then do not block all jobs
in that partition from being scheduled, but only the jobs in that
reservation.

08f0f57c

Reset node's CpuLoad more frequently · fae55cbe

Morris Jette authored Mar 16, 2014

Reset a node's CpuLoad value at least once each SlurmdTimeout seconds.
Previously the value would not be reset unless communications with the
slurmd did not happen for at least 1/3 of the SlurmdTimeout value.
That means nodes that were actively running and terminating jobs would
not get the CpuLoad value reset in a timely fashion. Added a CpuLoad
reset timer to prevent this.

fae55cbe

15 Mar, 2014 3 commits

retry slurm.conf file · 42081d87

Morris Jette authored Mar 15, 2014

Add logic to sleep and retry if slurm.conf can't be read.
Without this, the slurmd daemons may die and when the SlurmdTimeout
is reached, the nodes will be marked DOWN and their jobs will be
killed.
In the long term, it would be good to exit only if the read files
on program startup, and the daemons keep running with old configuration
on reconfiguration, but I don't have time to do that work now.

42081d87

job_submit/lua for error memory reference · e2524968

Morris Jette authored Mar 15, 2014

Fix invalid memory reference if script returns error message
for user. Previous code failed to set static variable to NULL
resulting in xfree of memory previously freed elsewhere.

e2524968

Add support for Torque/PBS job arrays · 11968284

Morris Jette authored Mar 15, 2014

Add support for job array options in the qsub command, in #PBS
options for sbatch scripts and set the appropriate environment
variables in the spank_pbs plugin (PBS_ARRAY_ID and PBS_ARRAY_INDEX).
Note that Torque uses the "-t" option and PBS Pro uses the "-J"
option.

11968284

14 Mar, 2014 4 commits
- update news for next 14.03 release · 368ac1d5
  Danny Auble authored Mar 14, 2014
  
  368ac1d5
- update for potential next 2.6 · aa330c46
  Danny Auble authored Mar 14, 2014
  
  aa330c46
- update for potential next 2.6 · ea90d9d4
  Danny Auble authored Mar 14, 2014
  
  ea90d9d4
- Fix a couple of issues with scontrol reconfig and adding nodes to · d03c9300
  Danny Auble authored Mar 14, 2014
```
slurm.conf.  Rebooting daemons after adding nodes to the slurm.conf
is highly recommended.
```
  d03c9300
13 Mar, 2014 3 commits
- Fix handling RPCs from a 14.03 slurmctld to a 2.6 slurmd · 82fe0f19
  Danny Auble authored Mar 13, 2014
  
  82fe0f19
- Add -j option for jobid to sbcast. · 0f74c03d
  Danny Auble authored Mar 13, 2014
  
  0f74c03d
- Don't purge job until EpilogSlurmctld completes · 2896b5a8
  Morris Jette authored Mar 13, 2014
```
Add a job flag to indicate when the EpilogSlurmctld us running
and don't purge the job record until it completes. This lets
the EpilogSlurmctld requeue the job and otherwise manage it.
bugs 635 and 636
```
  2896b5a8