Commits · 5f1d2a42cf74d6daee8e110c8e43dcd96a04b687 · Manuel G. Marciani / ces_slurm_simulator

07 Jul, 2015 3 commits
- Merge remote-tracking branch 'origin/slurm-14.11' · 5f1d2a42
  Danny Auble authored Jul 07, 2015
```
Conflicts:
	src/common/slurm_step_layout.c
```
  5f1d2a42
- Correct pack node logic · 0e0c64de
  Morris Jette authored Jul 06, 2015
```
Correct task layout with CR_Pack_Node option and more than 1 CPU per task.
Previous logic would place one task per CPU launch too few tasks.
bug 1781
```
  0e0c64de
- Merge remote-tracking branch 'origin/slurm-14.11' · 10d57ac6
  Danny Auble authored Jul 06, 2015
  
  10d57ac6
06 Jul, 2015 5 commits

Fix test to work if some nodes are allocated to other nodes. · c4a7b4d6
Nathan Yee authored Jul 06, 2015

c4a7b4d6
add available_nodes_hostnames function to globals. · f6da81b2
Nathan Yee authored Jul 06, 2015

f6da81b2

scheduler/backfill enhancements · edfbabe6

Morris Jette authored Jul 06, 2015

Backfill scheduler now considers OverTimeLimit and KillWait configuration
parameters to estimate when running jobs will exit. Initially the job's
end time is estimated based upon it's time limit. After the time limit
is reached, the end time estimate is based upon the OverTimeLimit and
KillWait configuration parameters.
bug 1774

edfbabe6

Add backfill scheduler timeout · 7e944220

Morris Jette authored Jul 06, 2015

Backfill scheduler: The configured backfill_interval value (default 30
    seconds) is now interpretted as a maximum run time for the backfill
    scheduler. Once reached, the scheduler will build a new job queue and
    start over, even if not all jobs have been tested.
bub 1774

7e944220

Fix check of getgrouplist to check the original size of array instead · 2d392fbc

Jason Coverston authored Jul 06, 2015

of the size returned.

This check is redundant though since getgrouplist will return -1 if it
tries to use more than ngroups_max.  Perhaps we should just take the check
out.

2d392fbc

03 Jul, 2015 1 commit
- clarify bf_min_age_reserve config parameter · 8fef7dff
  Morris Jette authored Jul 02, 2015
  
  8fef7dff
02 Jul, 2015 18 commits
- mpi/mvapich: Replace spaces with tabs · 7bcc6814
  Morris Jette authored Jul 02, 2015
```
No change in logic
```
  7bcc6814
- MPI/MVAPICH plugin requires Munge for authentication · 90c14bb5
  Morris Jette authored Jul 02, 2015
```
Original patch from LLNL assumed Munge was installed, which would
  result in a build error if the Munge development package was
  not installed
```
  90c14bb5
- remove newlines in mvapich abort message before writing to syslog · 097e6093
  Adam Moody authored Jul 02, 2015
  
  097e6093
- avoid potential buffer overrun in mvapich abort message · 2ce202a9
  Adam Moody authored Jul 02, 2015
  
  2ce202a9
- change some fatals to errors, use xstrdup_printf, add comment in mvapich plugin · eeb408b9
  Adam Moody authored Jul 02, 2015
  
  eeb408b9
- add munge to mvapich plugin · 1d8d9dc8
  Adam Moody authored Jul 02, 2015
  
  1d8d9dc8
- Phil's patch to sreport to correct End time requests in the future · 0ab95050
  Don Lipari authored Jul 02, 2015
```
If a user runs sreport and specifies a time period that stretches into
the future, this patch sets the end of the period to the current time.
```
  0ab95050
- proctrack/cgroup: Move slurmstepd out of freezer cgroup before removal · c2ce30c2
  Mark A. Grondona authored Jul 02, 2015
```
In order to successfully remove the freezer cgroup at the end of
a job step, the slurmstepd process itself must first be moved
outside of the cgroup, or removal will always fail.

This fix moves the slurmstepd back to the root cgroup just
before the rmdir operations are attempted.
```
  c2ce30c2
- Change function name to be more descriptive · 506bad75
  Morris Jette authored Jul 02, 2015
  
  506bad75
- slurmd: return failure to signal job step if prolog is running · 6728119f
  Mark A. Grondona authored Jul 02, 2015
```
If the job prolog is running we can't send a signal to job step
tasks, so return SLURM_FAILURE instead of ESLURM_INVALID_JOB_ID.
This should cause the caller to retry, instead of assuming the
job step is not running on the node.
```
  6728119f
- slurmd: Do not launch job step during job prolog · c2fbf88f
  Mark A. Grondona authored Jul 02, 2015
```
If a job step request comes in while the slurm prolog is running,
slurmd will happily launch the job step. This means that a user
could run code before the prolog is complete, which could cause
strange errors or in some cases a security issue. Instead return
an error (EINPROGRESS for now) and do not allow job steps to run
during prolog.
```
  c2fbf88f
- slurmd: abstract function to search list of running prologs · d9468711
  Mark A. Grondona authored Jul 02, 2015
```
Create a new function, _prolog_is_running(), to determine if a job
prolog is currently running from within slurmd, and use this new
function in place of list_find_first in _wait_for_job_running_prolog.
```
  d9468711
- Do not return error from xcgroup_delete if cgroup already gone · 7e9ff90c
  Mark A. Grondona authored Jul 02, 2015
```
Some cgroup code would retry continuously if somehow a cgroup
didn't exist when xcgroup_delete() was called (See for instance
proctrack/cgroup's slurm_container_plugin_wait()). To fix this
for all callers, return SUCCESS from xcgroup_delete() if rmdir(2)
return ENOENT, since the cgroup is already deleted.
```
  7e9ff90c
- Add gres configuration test · 24d2308f
  Nathan Yee authored Jul 02, 2015
```
Test of gres.conf configurations options
test 913
```
  24d2308f
- Fix test file clean-up · 3bfe5968
  Morris Jette authored Jul 02, 2015
  
  3bfe5968
- Fix test cleanup if can't fully run · b6776abf
  Morris Jette authored Jul 02, 2015
  
  b6776abf
- Correct some data times for "scontrol show cache" · 96592047
  Morris Jette authored Jul 02, 2015
  
  96592047
- Add assoc usage to cache info dump · 35d2edeb
  Morris Jette authored Jul 01, 2015
```
Add association usage information to "scontrol show cache" command output.
```
  35d2edeb
01 Jul, 2015 8 commits

srun: Enable output processing on stdout in pty mode · 03e156c5

Dorian Krause authored Jul 01, 2015

Dear all,

we noticed that debugX() and error() calls in srun mess up the output if the --pty flag is used. The reason for this behavior is that srun sets the terminal in raw mode and disables output processing. This is fine for the output forwarded from the remote damon but not for local printf()s by the srun process. This problem can be circumented by re-enabling OPOST after the cfmakeraw() call in srun(). A patch is pasted at the bottom of the mail. It works nicely on Linux but I am not 100% sure it may not have some undesirable side effects on other platforms.

Best regards,
Dorian

Example with current HEAD:

[snip]
srun: jobid 10: nodes(1):`node1', cpu counts: 1(x1)
srun: launching 10.0 on host node1, 1 tasks: 0
                                              srun: route default plugin loaded
                                                                               srun: Node node1, 1 tasks started
                                                                                                                srun: Received task exit notification for 1 task (status=0x0100).
                                         srun: error: node1: task 0: Exited with exit code 1
                                                                                            #
Example with the patch applied:

[snip]
 srun: jobid 11: nodes(1):`node1', cpu counts: 1(x1)
srun: launching 11.0 on host node1, 1 tasks: 0
srun: route default plugin loaded
srun: Node node1, 1 tasks started
srun: Received task exit notification for 1 task (status=0x0100).
srun: error: node1: task 0: Exited with exit code 1

03e156c5

Add TRES support to sview · bb2c84c6
Morris Jette authored Jul 01, 2015
```
Add job, step and reservation TRES information to sview command
```
bb2c84c6
Make CPU frequency test more thorough · a7ad5445
Morris Jette authored Jul 01, 2015

a7ad5445

Refactor CPU frequency test · 179d2502

Morris Jette authored Jul 01, 2015

Prevent test failure if the compute node does not permit user
  control over CPU frequency (no "userspace" governor).

179d2502

Prevent test leaving vestigial file · 7118060a
Morris Jette authored Jul 01, 2015

7118060a

Show job in sacct when step's cpus are different from job allocation. · 0f8e7338

Brian Christiansen authored Jul 01, 2015

When submitting a job with srun -n# the job may be allocated more than # because
the job was given the whole core or socket (eg. CR_CORE, CR_SOCKET). sacct
showed only what the step used and not the allocation. This commit shows the job
and the step if job and step cpus are different.

0f8e7338

Add example of how EnergyIPMIPowerSensors works in the acct_gather.conf · 26806204
Thomas Cadeau authored Jul 01, 2015
```
man page.
```
26806204

Add TRES support to sreport command · b860ed8e

Morris Jette authored Jun 30, 2015

Major re-write of the sreport command to support --tres job option
and permit users to select specific tracable resources to generate
reports for. For most reports, each TRES is listed on a separate
line of output with its name. The default TRES type is "cpu" to
minimize changes to output.

b860ed8e

30 Jun, 2015 5 commits
- Fix unpack of reservation record. · 13531531
  Danny Auble authored Jun 30, 2015
  
  13531531
- Add other variables to pretend to be power. · 785fe754
  Danny Auble authored Jun 30, 2015
  
  785fe754
- Initialize variables · 9703226a
  Danny Auble authored Jun 30, 2015
  
  9703226a
- Fix calculation regressions for IPMI plugin from commit · 61f6c332
  Danny Auble authored Jun 30, 2015
```
556c4ace
```
  61f6c332
- Get rid of redundant code · 161f6326
  Danny Auble authored Jun 29, 2015
  
  161f6326