Commits · 6054eac9fec07a84c2772c52de68d1b610f511a7 · Manuel G. Marciani / ces_slurm_simulator

07 Jun, 2018 24 commits
- Remove node_scaling from 18.08 pack_all_node/pack_one_node RPC. · 6054eac9
  Tim Wickberg authored Jun 04, 2018
  
  6054eac9
- Remove node_scaling from node_info_msg_t. · aed8e150
  Tim Wickberg authored Jun 04, 2018
  
  aed8e150
- Remove node_scaling from sview. · fb3ad1fe
  Tim Wickberg authored Jun 04, 2018
  
  fb3ad1fe
- Remove cpus_per_node from sview. · 17762a28
  Tim Wickberg authored Jun 04, 2018
```
Unused, but missed during prior clean up since it is
split across files.
```
  17762a28
- Remove node_scaling and cpus_per_mp. · 1124ddc2
  Tim Wickberg authored Jun 04, 2018
```
Both are always one. (On non-BG.)
```
  1124ddc2
- Remove remaining bits of node_scaling from sinfo. · f9522ad3
  Tim Wickberg authored Jun 04, 2018
  
  f9522ad3
- Remove "total_nodes" value (from node_scaling) and simplify. · 2acd86fa
  Tim Wickberg authored Jun 04, 2018
```
Always one (non-BG), so just change these to the increment operator.
```
  2acd86fa
- Remove one last CLUSTER_FLAG_BG block from sinfo. · 6b1810d6
  Tim Wickberg authored Jun 04, 2018
  
  6b1810d6
- Remove _handle_subgrps from sinfo. · 02ee4d93
  Tim Wickberg authored Jun 04, 2018
```
SELECT_NODEDATA_SUBGRP_SIZE is always zero on non-BG, so all
calling paths into this can be removed.
```
  02ee4d93
- Remove another CLUSTER_FLAG_BG block. · 1695140d
  Tim Wickberg authored Jun 04, 2018
  
  1695140d
- Continue removing BlueGene code from sinfo. · 6d5105c7
  Tim Wickberg authored Jun 04, 2018
```
Removing the CLUSTER_FLAG_BG block forces some code up a tabspot,
making this look a bit uglier than indended.

Also, remove the error_cpus values, as those only applied to BG.
```
  6d5105c7
- Continue removal of BlueGene code. · 4446ad09
  Tim Wickberg authored Jun 04, 2018
```
NODE_STATE_ERROR is only valid on a BlueGene, so make this
a no-op for now.
```
  4446ad09
- Continue removing BlueGene code from sinfo. · 5f476dcf
  Tim Wickberg authored Jun 04, 2018
```
As done previously, this can be removed as it always returns zero:
select_g_select_nodeinfo_get(node_ptr->select_nodeinfo,
                             SELECT_NODEDATA_SUBCNT,
                             NODE_STATE_ERROR,
                             &err_cpus);

Simplify the resulting code, and remove CLUSTER_FLAG_BG code as well.
```
  5f476dcf
- Add new RPC block to pack_all_node / pack_one_node. · 2a801250
  Tim Wickberg authored Jun 04, 2018
```
No functional changes, but do cleanup some style in these
updated blocks while copying.
```
  2a801250
- Add new RPC block for _unpack_node_info_msg. · f98de4d1
  Tim Wickberg authored Jun 04, 2018
```
No changes yet.
```
  f98de4d1
- Rework _unpack_node_info_msg() to current style. · 975ccec5
  Tim Wickberg authored Jun 04, 2018
```
Helps avoid pointer soup like &((*msg)->node_scaling).
No functional change.
```
  975ccec5
- Remove node_scaling argument from API calls. · 7796c940
  Tim Wickberg authored Jun 03, 2018
```
Mention in RELEASE_NOTES in the appropriate sections.

Tidy up function declarations and formatting while here.
```
  7796c940
- Remove CPUErr output from 'scontrol show node'. · 5fb544f2
  Tim Wickberg authored Jun 03, 2018
```
This is always 0 on non-select/bluegene. Remove other
BlueGene-specific output from this block while here.

This is all leading up to removal of the node_scaling concept
from Slurm's internals.
```
  5fb544f2
- Rename CONFIG_LOCK to CONF_LOCK. · 2bd79dd9
  Tim Wickberg authored Jun 06, 2018
  
  2bd79dd9
- Change struct names in slurmctld_lock_t. · 09c12d3f
  Tim Wickberg authored May 21, 2018
```
Make it easier to use with designated initializers in the future.
```
  09c12d3f
- Track RequeueExit*'ed array tasks as requeued · 04809480
  Brian Christiansen authored Jun 06, 2018
```
If an array task exited with a RequeueExit* code, the array job wasn't
being marked as requeued.

Bug 5105
```
  04809480
- Add ", with requeued tasks" to job array end email · 57104ecc
  Isaac Hartung authored Jun 06, 2018
```
if any task in the array was requeued. This is a hint to use
"sacct --duplicates" to see the whole picture of the array job.

Bug 5105
```
  57104ecc
- Require a ClusterName to be set in slurm.conf · 0f3e07cc
  Michael Hinton authored Jun 06, 2018
```
Bug 5163
```
  0f3e07cc
- Dont fatal if bad cred_ctx with test_config · 0b7112f2
  Michael Hinton authored Jun 06, 2018
```
Bug 5163
```
  0b7112f2
06 Jun, 2018 14 commits
- Fix memory leak · 81ea329f
  Morris Jette authored Jun 06, 2018
```
Coverity CID 186480
```
  81ea329f
- Fix memory leak · a3403183
  Morris Jette authored Jun 06, 2018
```
Coverity CID 186420
```
  a3403183
- Fix bug in parsing logic · d11c0e1d
  Morris Jette authored Jun 06, 2018
```
bug was introduced in commit 074ef9a7
bug could result in infinite loop given some (many) job scripts
```
  d11c0e1d
- Add SetExecHost flag for cray burst buffers · f3ace3e5
  Morris Jette authored Jun 06, 2018
```
burst_buffer.conf - Add SetExecHost flag to enable burst buffer access
    from the login node for interactive jobs.
```
  f3ace3e5
- Fix memory leak · 141526bb
  Morris Jette authored Jun 06, 2018
```
Coverity CID 186419 and 196422
```
  141526bb
- Alter slurm_mktime() function to set tm_isdst to -1. · d6db076a
  Alejandro Sanchez authored Jun 06, 2018
```
And remove the initialization before all the calls to the function.

It is non-functional and the motivation is more a preventive thing
so that if we ever use slurm_mktime() we know tm_isdst is consistently
set to -1.

Bug 5230.
```
  d6db076a
- Add SlurmctldParameters and SlurmdParameters to RPCs. · 96375455
  Dominik Bartkiewicz authored Jun 05, 2018
```
Bug 4887.
```
  96375455
- Add SlurmctldParameters=allow_user_triggers · a9e22392
  Dominik Bartkiewicz authored May 24, 2018
```
Disable setting triggers from non-root/slurm_user by default.

Bug 4887.
```
  a9e22392
- Add "SlurmctldParameters" configuration parameter. · 1daf0e59
  Dominik Bartkiewicz authored May 24, 2018
  
  1daf0e59
- Merge remote-tracking branch 'origin/slurm-17.11' · e94a5d5b
  Brian Christiansen authored Jun 05, 2018
  
  e94a5d5b
- Fix memory leak · 77e98af9
  Morris Jette authored Jun 05, 2018
```
Coverity CID 186420
```
  77e98af9
- Fix memory leak · dfd52e39
  Morris Jette authored Jun 05, 2018
```
Coverity CID 186423
```
  dfd52e39
- Don't allocate downed cloud nodes · be449407
  Brian Christiansen authored Jun 05, 2018
```
which were marked down due to ResumeTimeout.

If a cloud node was marked down due to not responding by ResumeTimeout,
the code inadvertently added the node back to the avail_node_bitmap --
after being cleared by set_node_down_ptr(). The scheduler would then
attempt to allocate the node again, which would cause a loop of hitting
ResumeTimeout and allocating the downed node again.

Bug 5264
```
  be449407
- Eliminate redundant NULL pointer check · 782d7b42
  Morris Jette authored Jun 05, 2018
```
Coverity CID 186424
```
  782d7b42
05 Jun, 2018 2 commits

Morris Jette authored Jun 05, 2018

Without change, slurmctld would abort running test38.1:

slurmctld: error: /home/jette/Desktop/SLURM/slurm.git/src/slurmctld/job_mgr.c:11177:
get_next_job_id(): Assertion (verify_lock(FED_LOCK, READ_LOCK)) failed.
Aborted (core dumped)

5e746b77

Fix for incorrect locks set · 6c7512b2

Morris Jette authored Jun 05, 2018

slurmctld would abort if started when slurmdbd down then later
started.

slurmctld: (node_scheduler.c:3182) job:14683 gres_req:NONE gres_alloc:
slurmctld: (node_scheduler.c:2868) job:14683 gres:NONE gres_alloc:
slurmctld: sched: Allocate JobID=14683 NodeList=nid00001 #CPUs=1 Partition=debug
slurmctld: error: slurmdbd: Sending PersistInit msg: Connection refused
slurmctld: error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused
slurmctld: error: /home/jette/Desktop/SLURM/slurm.git/src/slurmctld/controller.c:2456:
set_cluster_tres(): Assertion (verify_lock(NODE_LOCK, WRITE_LOCK)) failed.

==29635==
==29635== Process terminating with default action of signal 6 (SIGABRT): dumping core
==29635==    at 0x54980BB: raise (raise.c:51)
==29635==    by 0x5499F5C: abort (abort.c:90)
==29635==    by 0x4FCBFC0: __xassert_failed (xassert.c:57)
==29635==    by 0x131FC5: set_cluster_tres (controller.c:2456)
==29635==    by 0x1329B4: _assoc_cache_mgr (controller.c:3230)
==29635==    by 0x52497FB: start_thread (pthread_create.c:465)
==29635==    by 0x5575B5E: clone (clone.S:95)

6c7512b2