Commits · 0cda6dafcd27ebe2c55d2ab9cf7cf4d262d0fc46 · Manuel G. Marciani / ces_slurm_simulator

26 Mar, 2016 3 commits
- Harden test for node names without numeric suffix · 0cda6daf
  Morris Jette authored Mar 25, 2016
  
  0cda6daf
- Merge branch 'slurm-15.08' · af7a6fd3
  Morris Jette authored Mar 25, 2016
  
  af7a6fd3
- Revert commit efa83a02 · c1dde86c
  Morris Jette authored Mar 25, 2016
```
The previous commit obviously fixed a problem, but introduced a different
set of problems. This will be pursued later, perhaps in version 16.05.
```
  c1dde86c
25 Mar, 2016 6 commits
- Revert commit 6c14b969 · f5920b77
  Morris Jette authored Mar 25, 2016
```
With some configurations and systems, errors of the following sort were
occuring:
task/cgroup: task[1] infinite loop broken while trying to provision compute elements using block
task/cgroup: task[1] unable to set taskset '0x0'
```
  f5920b77
- Add "sacctmgr lost jobs" to report orphaned jobs on clsuter. · 2dd920b9
  Nathan Yee authored Mar 25, 2016
```
Bug 1706
```
  2dd920b9
- Add OR constraint option test · b3dee298
  Morris Jette authored Mar 25, 2016
  
  b3dee298
- Add test of exclusive OR constraints · 59899e2d
  Nathan Yee authored Mar 25, 2016
```
bug 2070
```
  59899e2d
- Merge branch 'slurm-15.08' · bd39fef3
  Morris Jette authored Mar 25, 2016
  
  bd39fef3
- burst_buffer/cray - pre-run fail fix · 5a48207e
  Morris Jette authored Mar 25, 2016
```
burst_buffer/cray - If the pre-run operation fails then don't issue
    duplicate job cancel/requeue unless the job is still in run state. Prevents
    jobs hung in COMPLETING state.
bug 2587
```
  5a48207e
24 Mar, 2016 4 commits

Select/cray - Log NHC run time on "scontrol reconfig" · 58627d02

Morris Jette authored Mar 24, 2016

Running "scontrol reconfig" releases resources for jobs waiting for
  the completion of Node Health Check so that other jobs can run.
  Cray says to always wait for NHC to complete, but in extreme
  cases that can be 2 hours, during which the entire resource
  allocation for a job may be unusable. Per advice from NERSC,
  the logic to release resources is unchanged, but logging is
  added here.

58627d02

Remove Rgt from the association output of scontrol show assoc_mgr. Rgt · ddfd2781
Danny Auble authored Mar 24, 2016
```
isn't kept up to date in the cache.
```
ddfd2781
Fix incorrect info from commit ce6d3717 · 60979f52
Danny Auble authored Mar 24, 2016

60979f52
Add new expect get_curr_line_num so you can print out the script line · ce6d3717
Danny Auble authored Mar 24, 2016
```
as will.
```
ce6d3717

23 Mar, 2016 27 commits

Merge branch 'slurm-15.08' · 3028cfea
Morris Jette authored Mar 23, 2016
```
Conflicts:
	src/plugins/select/cons_res/job_test.c
```
3028cfea

gang scheduling bug fix · 5f1e78f6

Morris Jette authored Mar 23, 2016

Fix gang scheduling resource selection bug which could prevent multiple jobs
    from being allocated the same resources. Bug was introduced in 15.08.6,
    commit 44f491b8

5f1e78f6

Fix leak of out_buf when file length is zero. · 498624df
Tim Wickberg authored Mar 23, 2016

498624df

Cleanup Coverity errors from file_bcast work. · 54c9ac31

Tim Wickberg authored Mar 23, 2016

Also ensure empty (0-length) files are handled properly.
Remove a stray exit(1) call from _rpc_file_bcast() to avoid
slurmd exiting on malformed data.

54c9ac31

Propagate use of functions added in commits 2ed5c7fb and e7f12058 · 10824898
Danny Auble authored Mar 23, 2016

10824898

task/cgroup: Fix for task binding anomaly · efa83a02

Morris Jette authored Mar 23, 2016

Here's how to reproduce on smd-server with 2 sockets, 6 cores per
socket and 2 threads per core, just run the following command line
3 times in quick succession (all active at the same time):
srun --cpus-per-task=4 -m block sleep 30
What was happening is the first job would be allocated cores 0+1
The second job would be allocated cores 2+3
The thrid job would test use of cores 0-3 then exit because the
 job only needs 4 CPUs. The resulting core binding would include
 NO CPUs. The new logic tests that the core being considered for
 use actually has some resources available to the job before
 updating the counter which is being tested against the needed
 CPU counter.

efa83a02

task/cgroup: Fix for task layout logic when disabled resources. · 6c14b969

Morris Jette authored Mar 23, 2016

Specifically add the HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM flag when
loading configuration from HWLOC library. Previous logic in
task/cgroup did not do this, which was different behaviour from
how slurmd gets configuration information. Here's the HWLOC
documentation:
HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM
Detect the whole system, ignore reservations and offline settings.
Gather all resources, even if some were disabled by the administrator.
For instance, ignore Linux Cpusets and gather all processors and memory
nodes, and ignore the fact that some resources may be offline.

Without this flag, I was rarely observing a bad core count, which
resulted in the logic layout out tasks wrong and generating an error:
task/cgroup: task[0] infinite loop broken while trying to provision compute elements using cyclic

bug 2502

6c14b969

Continuation of commit 2ed5c7fb to work for accounts · e7f12058
Danny Auble authored Mar 23, 2016

e7f12058
Merge remote-tracking branch 'origin/slurm-15.08' · f1ef24e6
Danny Auble authored Mar 23, 2016

f1ef24e6

Revert "Fix expect tests that expect -lz for compilation." · 04c395f5

Tim Wickberg authored Mar 23, 2016

With bcast split into its own directory -lz should not be
required throughout.

This reverts commit e7981406.

04c395f5

Send file_size across as part of the RPC, will be needed for mmap. · ee826b96
Tim Wickberg authored Mar 23, 2016
```
Remove unused struct and macro from file_bcast.h.

Free file_bcast_info_t to prevent leak.
```
ee826b96
Cleanup new fields added to the assoc_mgr cache. · 7e2e8f88
Danny Auble authored Mar 22, 2016

7e2e8f88
Simplify test since all the lines were the same. · 2ab694cb
Danny Auble authored Mar 22, 2016

2ab694cb

Restructure file_bcast mechanism to fork only on first block. · 7bac612c

Tim Wickberg authored Mar 21, 2016

1) Add a new global file_bcast_list to store info on in-progress file
   transfers, cache FD there rather than reopening the file for every block.
2) Restructure security mechanisms. First block will fork() and open the
   file, and pass the FD back to the thread. Thread then registers this file
   transfer in the file_bcast_list. Split fork() stuff into
   _file_bcast_register_file to keep _rpc_file_bcast readable.
3) Successive blocks are handled within the thread. Security is handled by
   matching uid and file name to existing file transfer.

TODO:

1) Write transfer cleanup function to remove stalled transfers.
2) Use mmap for file output.
3) Allow for parallel block transfer. Current code assumes blocks will always
   arrive in order. Out of order blocks will result in corrupted output.
   (sbcast currently prevents this by requiring each message to be ack'd
   before continuing, but at a likely severe performance penalty.)
4) Add stats on receive side.

7bac612c

Revert "Handle sbcast output within the RPC thread instead of fork()'ing." · 90206f27

Tim Wickberg authored Mar 21, 2016

This reverts commit 8c8c3407488fe3f0a552d2359ef5b487330ee8ba.

Thread-only isn't portable, need to use fork() on first block to ensure
file security and containers are handled correctly.

90206f27

Add missing closing quote to line. · ea0310e4
Tim Wickberg authored Mar 21, 2016

ea0310e4

Fix compression calculation again. · f8198ee7

Tim Wickberg authored Mar 15, 2016

Fix bad cast in 3a604563, and update pct to 64-bits to prevent
truncation of intermediate value (pct * 100).

f8198ee7

include .in after autogen.sh ran from mod made in commit 1b81a207f2a0 · a976468a
Danny Auble authored Mar 18, 2016

a976468a

Document compress option · 149dc5c9

Morris Jette authored Mar 17, 2016

Update srun and sbcast man pages and help messages for compression
  library argument.

149dc5c9

Handle sbcast output within the RPC thread instead of fork()'ing. · f3aa2b72
Tim Wickberg authored Mar 16, 2016
```
Relies on Linux-specific behavior of setfsuid/gid, so disable on other
platforms for now.
```
f3aa2b72
Fix linking issues to only have the compression libs link to file_bcast which... · 8f3ed2e3
Danny Auble authored Mar 16, 2016
```
Fix linking issues to only have the compression libs link to file_bcast which intern get pulled in my anyone linking to the .la.
```
8f3ed2e3

Move compression stuff over to the file_bcast lib, so we only have to link... · 095bd8df

Danny Auble authored Mar 16, 2016

Move compression stuff over to the file_bcast lib, so we only have to link directly with that and not pull libs all over the place.

095bd8df

Continue refactoring file_bcast.c . · 5de45d1f

Tim Wickberg authored Mar 15, 2016

- Fix issue where max_out calculation for zlib was incorrect - use
  deflateBound to properly calculate required buffer size, 1024 is
  not sufficient padding for uncompressable input.

- Work towards sending file offsets across the wire in preparation
  for mmap'd output for lz4.

5de45d1f

Set block_offset for older sbcast clients. · c6987901
Tim Wickberg authored Mar 15, 2016

c6987901

Change lz4 support to pack up to block_len per message. · ef026f34

Tim Wickberg authored Mar 15, 2016

Rather than compressing at most block_len into a message, compress up to
(10 * block_len) into a single message.

10x arbitrarily chosen to mitigate buffer issues when uncompresssing in
slurmd.

Testing against a 100MB zero file, this reduces the messages required from
13 to 2.

ef026f34

Write out at offset given in RPC, don't use block index * block size. · ba92681c
Tim Wickberg authored Mar 14, 2016

ba92681c

Fix chunk_remaining calculation for zlib. · 778f81a7

Tim Wickberg authored Mar 15, 2016

Regression in last commit required zlib to send each chunk as a
separate block, rather than packing multiple chunks per block.

778f81a7