Commits · c731768425496107eddf6924626ddf59ebeb5d79 · Manuel G. Marciani / ces_slurm_simulator

03 Jun, 2013 5 commits

Change defaults for input/output for sh5util · c7317684
Danny Auble authored Jun 03, 2013

c7317684
Fix typoes in slurmdbd.conf man page. · e3219c0e
David Bigagli authored Jun 03, 2013

e3219c0e

Nathan Yee authored Jun 03, 2013

test1.70   Validates that srun standard input and output work with binary files.
test1.71   Validates that srun exit code matches that of a test program.

1e676a46

Merge branch 'slurm-2.5' · 2ed9363c
Morris Jette authored Jun 03, 2013

2ed9363c

restore max_nodes of desc to NO_VAL when checkpointing job · f82e0fb8

Hongjia Cao authored Jun 03, 2013

We're having some trouble getting our slurm jobs to successfully
restart after a checkpoint.  For this test, I'm using sbatch and a
simple, single-threaded executable.  Slurm is 2.5.4, blcr is 0.8.5.
I'm submitting the job using sbatch:

$ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh

I am able to create the checkpoint and vacate the node:

$ scontrol checkpoint create 137
.... time passes ....
$ scontrol vacate 137

At that point, I see the checkpoint file from blcr in the current
directory and the checkpoint file from Slurm
in /var/spool/slurm-llnl/checkpoint.  However, when I attempt to
restart the job:

$ scontrol checkpoint restart 137
scontrol_checkpoint error: Node count specification invalid

In slurmctld's log (at level 7) I see:

[2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=*****
[2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002
[2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0
[2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid

f82e0fb8

31 May, 2013 11 commits
- Clean up of a bad sstat test · 4b0cf117
  jette authored May 31, 2013
  
  4b0cf117
- Just spacing changed · d6c99705
  Danny Auble authored May 31, 2013
  
  d6c99705
- sacct options fix · 6974ac0b
  Danny Auble authored May 31, 2013
  
  6974ac0b
- Modify test to work with POE · abe4346a
  jette authored May 31, 2013
  
  abe4346a
- step logic fix for divide by zero · fcfecd9b
  Morris Jette authored May 31, 2013
  
  fcfecd9b
- Fix srun support for job step --memory specification (--mem-per-cpu OK) · f032de75
  Morris Jette authored May 30, 2013
```
Rename slurm_step_ctx_params_t field from "mem_per_cpu" to "pn_min_memory".
Job step now accepts memory specification in either per-cpu or per-node basis.
```
  f032de75
- alphabetize conf listings · fc0a58d3
  Danny Auble authored May 30, 2013
  
  fc0a58d3
- Add missing symbols from the slurm.conf file · e6aa43d7
  Martin Perry authored May 30, 2013
  
  e6aa43d7
- Fix spec file for new plugins · be62cdac
  Danny Auble authored May 30, 2013
  
  be62cdac
- RAPL - combine debug · 8f7241cf
  Danny Auble authored May 30, 2013
  
  8f7241cf
- RAPL - print debug only once. · b10824ec
  Danny Auble authored May 30, 2013
  
  b10824ec
30 May, 2013 17 commits
- select/serial: make code more closely match select/cons_res · bd1966aa
  Morris Jette authored May 30, 2013
  
  bd1966aa
- RAPL - get profiling to work and remove duplicate code. · 02298df7
  Danny Auble authored May 30, 2013
  
  02298df7
- ENERGY - allow for delta time to pass before polling again. · 6e40eb52
  Danny Auble authored May 30, 2013
  
  6e40eb52
- remove unneeded header files · ba74d979
  Danny Auble authored May 30, 2013
  
  ba74d979
- Fix typo · 9518b36b
  Danny Auble authored May 30, 2013
  
  9518b36b
- IPMI - correct math for energy consumption · 5a625932
  Danny Auble authored May 30, 2013
  
  5a625932
- Better debug · f375f46f
  Danny Auble authored May 30, 2013
  
  f375f46f
- ENERGY - put correct polling type and profile info · 47fa7d54
  Yiannis Georgiou authored May 30, 2013
  
  47fa7d54
- Test updates, mostly due to resent profile changes · c5eaddc3
  jette authored May 30, 2013
  
  c5eaddc3
- Fix minor format problem in slurmdbd log message · efee0447
  jette authored May 29, 2013
  
  efee0447
- Merge branch 'slurm-2.5' · 8e622660
  Morris Jette authored May 29, 2013
  
  8e622660
- Select/cons_res - Fix bug resulting in held job · b574168b
  Morris Jette authored May 29, 2013
```
Uninitialized variables resulted in error of
"cons_res: sync loop not progressing, holding job #"
```
  b574168b
- Remove unneeded variables. · c1a88e2f
  Danny Auble authored May 29, 2013
  
  c1a88e2f
- ENERGY - add thread to be able to profile at custom frequency from user. · 2a2db28a
  Danny Auble authored May 29, 2013
  
  2a2db28a
- IPMI - Remove pipe logic can replace with RPC to slurmd. In testing · 671c6874
  Danny Auble authored May 29, 2013
```
calls only took ~1500 usec to complete call.  Since this is out of band
this shouldn't be that big of a deal.
```
  671c6874
- Remove debug · d5ad8d52
  Danny Auble authored May 29, 2013
  
  d5ad8d52
- Corrected slurm.conf man page and accounting shtml file. · 41891a25
  David Bigagli authored May 29, 2013
  
  41891a25
29 May, 2013 5 commits
- Add some tests for foreground/background srun execution · 1ef2eb75
  Nathan Yee authored May 29, 2013
  
  1ef2eb75
- Clarify use of POE with hostlist file · e71f4967
  Morris Jette authored May 29, 2013
  
  e71f4967
- Merge branch 'slurm-2.5' · e5d8c1e9
  Morris Jette authored May 28, 2013
  
  e5d8c1e9
- Fix job step allocation with --exclusive and --hostlist option · 85cab0cb
  jette authored May 28, 2013
```
The most notable problem case is on a cray where a job step
specifically requests one or more node that are not the first
nodes in the job allocation
```
  85cab0cb
- Return exit code instead of exit test · e5f7dad0
  Nathan Yee authored May 28, 2013
  
  e5f7dad0
28 May, 2013 2 commits

Make testing of node names more strict · 71ad3ba0

Morris Jette authored May 28, 2013

If node_name2bitmap() is called with best_effort=false, then do
not attempt to match names with NodeHostName.

Without this change, a partition that contains a NodeHostName rather
that NodeName would be configured with the first one found. On a
front-end system, this would result in the partition's node_bitmap
being out of sync with the actual node positions.

To reproduce the problem, configure with --enable-multiple-slurmd
Then in slurm.conf, define something like this:
NodeName=foo[1-8] NodeHostName=bar ...
PartitionName=debug Nodes=bar,foo[1-8] ...

71ad3ba0

BLUEGENE - Fix for static systems · 536be448
Danny Auble authored May 28, 2013

536be448