Commits · 0754561aed3996cc5ad0a592302c2006dd1a623d · Manuel G. Marciani / ces_slurm_simulator

23 Mar, 2016 16 commits

Heavily refactor file_bcast.c · 0754561a

Tim Wickberg authored Mar 14, 2016

1) Use mmap'd I/O.
2) Move buffer handling down to the individual compression routines.
   This will eventually allow them to read more than buffer_len at a
   time in order to emit a buffer_len message, and thus minimize the
   number of messages send. (Current logic sends the same number of
   messages, but the payload for each can be compressed. I expect
   more performance gains are available by limit the message count.)
3) Move buffer_len to a global, and change function signatures.

0754561a

Add check for lz4 · bcb07f94
Danny Auble authored Mar 14, 2016

bcb07f94

Move files from common to new dir location for bcast. This also sets up the 3... · 6706d2a9

Danny Auble authored Mar 14, 2016

Move files from common to new dir location for bcast. This also sets up the 3 locations that need to be linked to compression libs.

6706d2a9

Partial revert of 4f187ba7 to make more general fix in comming commits. · 83651c89
Danny Auble authored Mar 14, 2016

83651c89
run autogen.sh with new autoconf · 5c91335d
Danny Auble authored Mar 14, 2016

5c91335d
Remove unneeded LIBS in smap makefile. · 98cf6364
Danny Auble authored Mar 14, 2016

98cf6364
Update copyright dates. · 0a27fbdb
Tim Wickberg authored Mar 14, 2016

0a27fbdb
File is already open as global int fd. · 97701dc8
Tim Wickberg authored Mar 14, 2016

97701dc8
Cleanup zlib includes, move global to a static variable. · 24645430
Tim Wickberg authored Mar 14, 2016
```
And switch CHUNK macro -> chunk within the function.
```
24645430

Add lz4 compress/decompress logic. · ca61e54b

Tim Wickberg authored Mar 14, 2016

Add orig_len to data structure to avoid guessing uncompressed size. Have
zlib use this as well to avoid xrealloc() calls. (Future improvement to
zlib would avoid use of the temporary buffer + memcpy calls.)

Note that this is handling each block independently. Stream mode would
be better, switch to that in the future for additional performance gains.

ca61e54b

Fix to ensure compress type is sent on the wire. · f3353c2c
Tim Wickberg authored Mar 13, 2016
```
Also change sbcast info line to print type as int rather than as bool.
```
f3353c2c

Add support for compression types to srun. · 9c48df03

Tim Wickberg authored Mar 13, 2016

Note one quirk - SLURM_COMPRESS env var must be set to a type. Setting
to 'true' or '1' will not work, when it would have enabled zlib before.

9c48df03

Update sbcast to allow an optional compression type for --compress · 67f0a00d

Tim Wickberg authored Mar 13, 2016

If no type given current default is zlib. 'lz4' and 'zlib' are the only
two types supported currently. Print an error if unknown type given, but
continue with compression disabled.

67f0a00d

Restructure compression/decompression code to support multiple types. · afd51f40

Tim Wickberg authored Mar 13, 2016

Stub out lz4 functions, will need to be filled in later. sbcast/srun also
need work to allow them to specify compression routine.

afd51f40

Fix expect tests that expect -lz for compilation. · e7981406
Brian Christiansen authored Mar 23, 2016

e7981406

Update HBM GRES on node feature update · 756613ca

Morris Jette authored Mar 22, 2016

When a node's MCDRAM configuration changes, the amount of High
Bandwidth Memory (HBM) changes also. HBM is treated like a GRES
that can be allocated to jobs, so we update each node's GRES
value when its MCDRAM configuration changes.

756613ca

22 Mar, 2016 2 commits
- Merge branch 'slurm-15.08' · 678c6558
  Morris Jette authored Mar 22, 2016
  
  678c6558
- Add cancel_job calls to test · d2e38a2b
  Morris Jette authored Mar 22, 2016
```
Just in case some job fails to terminate as expected.
```
  d2e38a2b
21 Mar, 2016 6 commits

Remove redundant checks for gang scheduling. · 2738cb65
Danny Auble authored Mar 21, 2016

2738cb65
Merge remote-tracking branch 'origin/slurm-15.08' · f0441052
Danny Auble authored Mar 21, 2016
```
# Conflicts:
#	src/plugins/burst_buffer/cray/burst_buffer_cray.c
```
f0441052
Continuation to commit 701917cc to make sure we are running · 264b950d
Danny Auble authored Mar 21, 2016
```
gang scheduling before doing code for gang scheduling.
```
264b950d

Change point where burst buffer env vars are set · 54f314e7

Morris Jette authored Mar 21, 2016

burst_buffer/cray: Set environment variables just before starting job rather
    than at job submission time to reflect persistent buffers created or
    modified while the job is pending.
bug 2545

54f314e7

Fix deadlock issue with burst_buffer/cray when a newly created burst · dcfa6ec0

Danny Auble authored Mar 21, 2016

buffer is found.

Bug 2576

What happened was a function was doing a double read lock which isn't
awesome to begin with, but not really horrible (if all you are doing is
read locks anyway).  The problem was after the first lock was locked a
different thread was going for a write lock and so when the second
read lock came in it created deadlocked.

dcfa6ec0

Free avail_cores in _gres_sock_job_test() to prevent leak. · c37f421d
Tim Wickberg authored Mar 21, 2016
```
Coverity 77851.
```
c37f421d

18 Mar, 2016 4 commits

Added SchedulingParameters option of "bf_min_prio_reserve" · 45560872
Morris Jette authored Mar 18, 2016
```
Jobs below the specified threshold will not have resources reserved for them.
bug 2565
```
45560872
Fix typo. · 886df85b
Tim Wickberg authored Mar 18, 2016

886df85b
Merge branch 'slurm-15.08' · 78d1bcf8
Morris Jette authored Mar 18, 2016

78d1bcf8

Fix for srun abort on SIGSTOP+SIGCONT · 1ed38f26

Morris Jette authored Mar 18, 2016

Avoid possibly aborting srun that gets simultaneous SIGSTOP+SIGCONT while
    creating the job step. The result is that the signal hanlder gets a
    argument (the signal received) of zero.

Here's a log, window 1:
$ srun hostname
srun: Job step creation temporarily disabled, retrying
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 18
srun: I Got signal 0
srun: Cancelled pending job step

Window 2:
$  kill -STOP 18696 ; kill -CONT 18696
$  kill -STOP 18696 ; kill -CONT 18696
$  kill -STOP 18696 ; kill -CONT 18696
....

bug 2494

1ed38f26

17 Mar, 2016 5 commits

Allocated CPU count fix for select/serial · 2178163f
Morris Jette authored Mar 17, 2016
```
Copy logic from select/cons_res to select/serial that is equivalent
to commit ec50cb2f
```
2178163f

Change calculation of node's allocated CPUs · ec50cb2f

Morris Jette authored Mar 17, 2016

Change how a node's allocated CPU count is calculated to avoid double
counting CPUs allocated to multiple jobs at the same time.
Previous logic would sum the maximum number of CPUs allocated by each
partition for any time slice, which could double count CPUs allocated
to multiple jobs. New logic ORs bitmap of allocated CPUs for every
partition and time slice, then counts the total for a given node.
This avoids double counting CPUs allocated to multiple jobs, but
does not remove from the count CPUs which have been allocated to
jobs which might be suspended by the gang scheduler (either for
time slicing or preemption).

ec50cb2f

Merge branch 'slurm-15.08' · 4770189e
Tim Wickberg authored Mar 17, 2016

4770189e
Merge branch 'slurm-14.11' into slurm-15.08 · dd2324a7
Tim Wickberg authored Mar 17, 2016
```
Update NEWS as well.
```
dd2324a7

Prevent uid update from corrupting assoc_hash table. · 60b58b70

Tim Wickberg authored Mar 17, 2016

The uid is used as part of the hash function, must remove old reference
and recalculate if it may change, otherwise _delete_assoc_hash
will not find it again when the association is removed, causing
slurmctld to segfault.

Bug 2560.

60b58b70

16 Mar, 2016 7 commits
- Add --gres-flags=enforce-binding option · 5d7f8b76
  Morris Jette authored Mar 16, 2016
```
Add --gres-flags=enforce-binding option to salloc, sbatch and srun commands.
    If set, the only CPUs available to the job will be those bound to the
    selected GRES (i.e. the CPUs identifed in the gres.conf file will be
    strictly enforced rather than advisory).
bug 1725
```
  5d7f8b76
- Remove empty src/slurmdbd/agent.[ch] files. · 7fb3207e
  Tim Wickberg authored Mar 16, 2016
  
  7fb3207e
- Merge branch 'slurm-15.08' · 55ee7255
  Morris Jette authored Mar 16, 2016
  
  55ee7255
- Update gang scheduling data structures when job changes in size · 701917cc
  Morris Jette authored Mar 16, 2016
```
Previous gang scheduling logic maintained information about resources
  originally allocated to the job and made scheduling decisions on
  that basis.
bug 2494
```
  701917cc
- Add signal number to error message · 83400184
  Morris Jette authored Mar 16, 2016
```
This will improve ability to diagnose problems if the srun is
killed by a signal.
```
  83400184
- gang scheduling for with manually job suspend/resume · 344d2eab
  Morris Jette authored Mar 16, 2016
```
Update gang scheduling table when job manually suspended or resumed. Prior
    logic could mess up job suspend/resume sequencing.
bug 2494
```
  344d2eab
- Merge remote-tracking branch 'origin/slurm-15.08' · abc2110b
  Danny Auble authored Mar 16, 2016
  
  abc2110b