Commits · ccf66079346cf74ff6942f7f57a69c35e41dd0ca · Manuel G. Marciani / ces_slurm_simulator

30 Sep, 2016 8 commits
- Srun pending steps, decrease retry frequency · ccf66079
  Morris Jette authored Sep 30, 2016
```
Previous logic would always retry in 60 to 69 secs (based upon srun PID).
  New logic will wait up to SlurmctldTimeout + 9 secs (minimum value 60
  seconds, maximum 309 seconds).
```
  ccf66079
- Rewrite srun pending step logic · 8f2ef7a5
  Morris Jette authored Sep 30, 2016
```
This change can reduce RPCs sent by srun up to 50%.
```
  8f2ef7a5
- Merge branch 'slurm-16.05' · a9934d42
  Morris Jette authored Sep 29, 2016
  
  a9934d42
- Correction to commit 80fae1eb · 43a197cd
  Morris Jette authored Sep 29, 2016
  
  43a197cd
- Don't re-use a variable · 6ec9b698
  Morris Jette authored Sep 29, 2016
```
Counter became invalid due to re-use of a variable within a loop.
```
  6ec9b698
- Avoid creating duplicate pending step records · 3a51703e
  Morris Jette authored Sep 29, 2016
```
THis can happen due to signal/retry logic and the result is
  vestigial srun ports that are no longer valid and slurmctld
  sending messages to invalid ports to notify them about available
  resources as other steps complete
```
  3a51703e
- Add srun host & PID to job step data structures · 5a95e7ff
  Morris Jette authored Sep 29, 2016
  
  5a95e7ff
- Merge branch 'slurm-16.05' · 7bbdfc2a
  Morris Jette authored Sep 29, 2016
  
  7bbdfc2a
29 Sep, 2016 25 commits
- No change in logic · 80fae1eb
  Morris Jette authored Sep 29, 2016
```
Fix indent and add brackets
```
  80fae1eb
- Fix memory leak reported by CLANG · ade05a08
  Morris Jette authored Sep 29, 2016
  
  ade05a08
- Update NEWS for start of v17.02.0-pre3 · 513b23ba
  Morris Jette authored Sep 29, 2016
  
  513b23ba
- Update META for v17.02.0-pre2 tag · 0a684a1f
  Morris Jette authored Sep 29, 2016
  
  0a684a1f
- Merge branch 'slurm-16.05' · bb9895a8
  Morris Jette authored Sep 29, 2016
  
  bb9895a8
- Start NEWS for v16.05.6 release · 1a862096
  Morris Jette authored Sep 29, 2016
  
  1a862096
- Merge branch 'slurm-16.05' · def6cb25
  Morris Jette authored Sep 29, 2016
  
  def6cb25
- Update META for v16.05.5 tag · 50d1e5dd
  Morris Jette authored Sep 29, 2016
  
  50d1e5dd
- Fix display for nice and priority values in sprio/scontrol/squeue. · df70b651
  Alejandro Sanchez authored Sep 29, 2016
```
Also correct the value of NICE_OFFSET used within the perl API.

Bug 3098.
```
  df70b651
- Merge branch 'slurm-16.05' · bf47569f
  Morris Jette authored Sep 29, 2016
  
  bf47569f
- Add a couple of examples for adding user to account · f2757d9a
  Morris Jette authored Sep 29, 2016
  
  f2757d9a
- Merge branch 'slurm-16.05' · 2b61d656
  Tim Wickberg authored Sep 29, 2016
  
  2b61d656
- Docs - add Mellanox to contributing organizations list. · 71a39bf4
  Tim Wickberg authored Sep 29, 2016
  
  71a39bf4
- Docs - update SchedMD mailing address. · a561f386
  Tim Wickberg authored Sep 29, 2016
  
  a561f386
- mpi/pmix: provide temp directory for the application. · 901aeeb3
  Artem Polyakov authored Sep 04, 2016
```
Bug 3051.
```
  901aeeb3
- Merge branch 'slurm-16.05' · 579f7a69
  Tim Wickberg authored Sep 29, 2016
  
  579f7a69
- Ignore changes to GRES/QOS when the value is the same as before. · c11e092a
  Tim Wickberg authored Sep 29, 2016
```
Otherwise updates would be rejected for running jobs even if there
would be no impact. Most common when the job_submit plugin is overriding
QOS/GRES values on everything; without this change an update to "comment"
or other fields would fail with ESLURM_JOB_NOT_PENDING.

Bug 3117.
```
  c11e092a
- Cray - add NHC_ABSOLUTELY_NO to SelectTypeParameters. · 7a379920
  Tim Wickberg authored Sep 29, 2016
```
Never ever run NHC, even on an edge case that NHC_NO would still launch NHC
after.

Bug 3105.
```
  7a379920
- Docs - add Mellanox to contributing organizations list. · 620c084d
  Tim Wickberg authored Sep 29, 2016
  
  620c084d
- Docs - update SchedMD mailing address. · 1fe6cee5
  Tim Wickberg authored Sep 29, 2016
  
  1fe6cee5
- Prevent update to last_part_update by partition uid access updates when no change made. · b9af7b9c
  Tim Wickberg authored Sep 29, 2016
```
Switch to list_for_each, and check if access list actually changed
after each update before updating last_prat_update. This prevents
the backfill scheduler from resetting mid-cycle unnecessarily.

Bug 3123.
```
  b9af7b9c
- Prevent update to last_part_update by partition uid access updates when no change made. · 9680c9c1
  Tim Wickberg authored Sep 29, 2016
```
Switch to list_for_each, and check if access list actually changed
after each update before updating last_prat_update. This prevents
the backfill scheduler from resetting mid-cycle unnecessarily.

Bug 3123.
```
  9680c9c1
- Merge branch 'slurm-16.05' · 08d8265f
  Morris Jette authored Sep 28, 2016
  
  08d8265f
- Correct buffer allocation size · 7e08feae
  Josko Plazonic authored Sep 28, 2016
```
Buffer was being allocated based upon the size of the wrong structure.
  The buffer would have been larger than required, so this bug
  should not have caused any failures.
bug 3127
```
  7e08feae
- Add protocol version to slurmstepd startup message · d7c30f44
  Morris Jette authored Sep 28, 2016
```
Add protocol version to slurmd startup communications for slurmstepd to
    permit changes in the protocol.
```
  d7c30f44
28 Sep, 2016 7 commits

Merge branch 'slurm-16.05' · e20c948d
Morris Jette authored Sep 28, 2016

e20c948d

Fix test to work with DefMemNode configured · 494814e2

Morris Jette authored Sep 28, 2016

Added memory limits to the job and step. Without these, only one
  step may be able to run at a time and break the test

494814e2

Add task launch flag field · 3251406c

Morris Jette authored Sep 28, 2016

Add "flag" field to launch_tasks_request_msg. Remove the following fields
    (moved into flags): multi_prog, task_flags, user_managed_io, pty,
    buffered_stdio, and labelio. More flags to be added later.

3251406c

Fix clang reported issues · ab40ac9c
Brian Christiansen authored Sep 28, 2016

ab40ac9c

Fix issues reported by coverity · 9164e65a

Brian Christiansen authored Sep 28, 2016

*** CID 149289:  Null pointer dereferences  (REVERSE_INULL)
/slurm/src/slurmctld/step_mgr.c: 1993 in _step_dealloc_lps()
1987     	xassert(job_resrcs_ptr->cpus);
1988     	xassert(job_resrcs_ptr->cpus_used);
1989
1990     	if (step_ptr->step_layout == NULL)	/* batch step */
1991     		return;
1992
>>>     CID 149289:  Null pointer dereferences  (REVERSE_INULL)
>>>     Null-checking "job_resrcs_ptr" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
1993     	if (job_resrcs_ptr == NULL)
1994     		return;
1995     	i_first = bit_ffs(job_resrcs_ptr->node_bitmap);
1996     	i_last  = bit_fls(job_resrcs_ptr->node_bitmap);
1997     	if (i_first == -1)	/* empty bitmap */
1998     		return;

** CID 149288:  Resource leaks  (RESOURCE_LEAK)
/slurm/src/api/reconfigure.c: 171 in _send_message_controller()

________________________________________________________________________________________________________
*** CID 149288:  Resource leaks  (RESOURCE_LEAK)
/slurm/src/api/reconfigure.c: 171 in _send_message_controller()
165     	}
166     	resp_msg = xmalloc(sizeof(slurm_msg_t));
167     	slurm_msg_t_init(resp_msg);
168
169     	if ((rc = slurm_receive_msg(fd, resp_msg, 0)) != 0) {
170     		slurm_shutdown_msg_conn(fd);
>>>     CID 149288:  Resource leaks  (RESOURCE_LEAK)
>>>     Variable "resp_msg" going out of scope leaks the storage it points to.
171     		return SLURMCTLD_COMMUNICATIONS_RECEIVE_ERROR;
172     	}
173
174     	if (slurm_shutdown_msg_conn(fd) != SLURM_SUCCESS)
175     		rc = SLURMCTLD_COMMUNICATIONS_SHUTDOWN_ERROR;
176     	else if (resp_msg->msg_type != RESPONSE_SLURM_RC)

** CID 149287:  Integer handling issues  (NEGATIVE_RETURNS)
/slurm/src/slurmdbd/backup.c: 65 in run_dbd_backup()

________________________________________________________________________________________________________
*** CID 149287:  Integer handling issues  (NEGATIVE_RETURNS)
/slurm/src/slurmdbd/backup.c: 65 in run_dbd_backup()
59     	primary_resumed = false;
60
61     	memset(&slurmdbd_conn, 0, sizeof(slurm_persist_conn_t));
62     	slurmdbd_conn.rem_host = slurmdbd_conf->dbd_addr;
63     	slurmdbd_conn.rem_port = slurmdbd_conf->dbd_port;
64     	slurmdbd_conn.cluster_name = "backup_slurmdbd";
>>>     CID 149287:  Integer handling issues  (NEGATIVE_RETURNS)
>>>     Assigning: "slurmdbd_conn.fd" = a negative value.
65     	slurmdbd_conn.fd = -1;
66     	slurmdbd_conn.shutdown = &shutdown_time;
67
68     	slurm_persist_conn_open_without_init(&slurmdbd_conn);
69
70     	/* repeatedly ping Primary */

** CID 149286:  Integer handling issues  (INCOMPATIBLE_CAST)

________________________________________________________________________________________________________
*** CID 149286:  Integer handling issues  (INCOMPATIBLE_CAST)
/slurm/src/common/slurmdbd_defs.c: 3096 in slurmdbd_unpack_init_msg()
3090
3091     	*msg = msg_ptr;
3092
3093     	/* We don't use rollback going forward and version was packed after it
3094     	 * unfortunately */
3095     	if (rpc_version < SLURM_17_02_PROTOCOL_VERSION)
>>>     CID 149286:  Integer handling issues  (INCOMPATIBLE_CAST)
>>>     Pointer "&tmp32" points to an object whose effective type is "unsigned int" (32 bits, unsigned) but is dereferenced as a narrower "unsigned short" (16 bits, unsigned).  This may lead to unexpected results depending on machine endianness.
3096     		safe_unpack16((uint16_t *)&tmp32, buffer);
3097     	safe_unpack16(&msg_ptr->version, buffer);
3098     	safe_unpackstr_xmalloc(&msg_ptr->cluster_name, &tmp32, buffer);
3099
3100     	/* We find out the version of the caller right here so use
3101     	   that as the rpc_version. */

** CID 149285:  Null pointer dereferences  (FORWARD_NULL)
/slurm/src/common/slurm_persist_conn.c: 588 in slurm_persist_conn_open()

________________________________________________________________________________________________________
*** CID 149285:  Null pointer dereferences  (FORWARD_NULL)
/slurm/src/common/slurm_persist_conn.c: 588 in slurm_persist_conn_open()
582     				error("%s: Failed to unpack persistent connection init resp message from %s:%d",
583     				      __func__,
584     				      persist_conn->rem_host,
585     				      persist_conn->rem_port);
586     			_close_fd(&persist_conn->fd);
587     		} else
>>>     CID 149285:  Null pointer dereferences  (FORWARD_NULL)
>>>     Dereferencing null pointer "resp".
588     			persist_conn->version = resp->ret_info;
589     	}
590
591     end_it:
592
593     	slurm_persist_free_rc_msg(resp);

** CID 149284:  Error handling issues  (CHECKED_RETURN)
/slurm/src/common/slurmdbd_defs.c: 1665 in _send_fini_msg()

________________________________________________________________________________________________________
*** CID 149284:  Error handling issues  (CHECKED_RETURN)
/slurm/src/common/slurmdbd_defs.c: 1665 in _send_fini_msg()
1659     	buffer = init_buf(1024);
1660     	pack16((uint16_t) DBD_FINI, buffer);
1661     	req.commit  = 0;
1662     	req.close_conn   = 1;
1663     	slurmdbd_pack_fini_msg(&req, SLURM_PROTOCOL_VERSION, buffer);
1664
>>>     CID 149284:  Error handling issues  (CHECKED_RETURN)
>>>     Calling "slurm_persist_send_msg" without checking return value (as is done elsewhere 5 out of 6 times).
1665     	slurm_persist_send_msg(slurmdbd_conn, buffer);
1666     	free_buf(buffer);
1667
1668     	return SLURM_SUCCESS;
1669     }
1670

9164e65a

Merge branch 'slurm-16.05' · 8fcdceeb
Morris Jette authored Sep 28, 2016

8fcdceeb
Remove use of CLUSTER_FLAG_BGL and CLUSTER_FLAG_BGP flags. · eec5bccb
Tim Wickberg authored Sep 28, 2016
```
Leave macros defined for now to avoid bit reuse.
```
eec5bccb