- 30 Sep, 2016 8 commits
-
-
Morris Jette authored
Previous logic would always retry in 60 to 69 secs (based upon srun PID). New logic will wait up to SlurmctldTimeout + 9 secs (minimum value 60 seconds, maximum 309 seconds).
-
Morris Jette authored
This change can reduce RPCs sent by srun up to 50%.
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Counter became invalid due to re-use of a variable within a loop.
-
Morris Jette authored
THis can happen due to signal/retry logic and the result is vestigial srun ports that are no longer valid and slurmctld sending messages to invalid ports to notify them about available resources as other steps complete
-
Morris Jette authored
-
Morris Jette authored
-
- 29 Sep, 2016 25 commits
-
-
Morris Jette authored
Fix indent and add brackets
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Alejandro Sanchez authored
Also correct the value of NICE_OFFSET used within the perl API. Bug 3098.
-
Morris Jette authored
-
Morris Jette authored
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Artem Polyakov authored
Bug 3051.
-
Tim Wickberg authored
-
Tim Wickberg authored
Otherwise updates would be rejected for running jobs even if there would be no impact. Most common when the job_submit plugin is overriding QOS/GRES values on everything; without this change an update to "comment" or other fields would fail with ESLURM_JOB_NOT_PENDING. Bug 3117.
-
Tim Wickberg authored
Never ever run NHC, even on an edge case that NHC_NO would still launch NHC after. Bug 3105.
-
Tim Wickberg authored
-
Tim Wickberg authored
-
Tim Wickberg authored
Switch to list_for_each, and check if access list actually changed after each update before updating last_prat_update. This prevents the backfill scheduler from resetting mid-cycle unnecessarily. Bug 3123.
-
Tim Wickberg authored
Switch to list_for_each, and check if access list actually changed after each update before updating last_prat_update. This prevents the backfill scheduler from resetting mid-cycle unnecessarily. Bug 3123.
-
Morris Jette authored
-
Josko Plazonic authored
Buffer was being allocated based upon the size of the wrong structure. The buffer would have been larger than required, so this bug should not have caused any failures. bug 3127
-
Morris Jette authored
Add protocol version to slurmd startup communications for slurmstepd to permit changes in the protocol.
-
- 28 Sep, 2016 7 commits
-
-
Morris Jette authored
-
Morris Jette authored
Added memory limits to the job and step. Without these, only one step may be able to run at a time and break the test
-
Morris Jette authored
Add "flag" field to launch_tasks_request_msg. Remove the following fields (moved into flags): multi_prog, task_flags, user_managed_io, pty, buffered_stdio, and labelio. More flags to be added later.
-
Brian Christiansen authored
-
Brian Christiansen authored
*** CID 149289: Null pointer dereferences (REVERSE_INULL) /slurm/src/slurmctld/step_mgr.c: 1993 in _step_dealloc_lps() 1987 xassert(job_resrcs_ptr->cpus); 1988 xassert(job_resrcs_ptr->cpus_used); 1989 1990 if (step_ptr->step_layout == NULL) /* batch step */ 1991 return; 1992 >>> CID 149289: Null pointer dereferences (REVERSE_INULL) >>> Null-checking "job_resrcs_ptr" suggests that it may be null, but it has already been dereferenced on all paths leading to the check. 1993 if (job_resrcs_ptr == NULL) 1994 return; 1995 i_first = bit_ffs(job_resrcs_ptr->node_bitmap); 1996 i_last = bit_fls(job_resrcs_ptr->node_bitmap); 1997 if (i_first == -1) /* empty bitmap */ 1998 return; ** CID 149288: Resource leaks (RESOURCE_LEAK) /slurm/src/api/reconfigure.c: 171 in _send_message_controller() ________________________________________________________________________________________________________ *** CID 149288: Resource leaks (RESOURCE_LEAK) /slurm/src/api/reconfigure.c: 171 in _send_message_controller() 165 } 166 resp_msg = xmalloc(sizeof(slurm_msg_t)); 167 slurm_msg_t_init(resp_msg); 168 169 if ((rc = slurm_receive_msg(fd, resp_msg, 0)) != 0) { 170 slurm_shutdown_msg_conn(fd); >>> CID 149288: Resource leaks (RESOURCE_LEAK) >>> Variable "resp_msg" going out of scope leaks the storage it points to. 171 return SLURMCTLD_COMMUNICATIONS_RECEIVE_ERROR; 172 } 173 174 if (slurm_shutdown_msg_conn(fd) != SLURM_SUCCESS) 175 rc = SLURMCTLD_COMMUNICATIONS_SHUTDOWN_ERROR; 176 else if (resp_msg->msg_type != RESPONSE_SLURM_RC) ** CID 149287: Integer handling issues (NEGATIVE_RETURNS) /slurm/src/slurmdbd/backup.c: 65 in run_dbd_backup() ________________________________________________________________________________________________________ *** CID 149287: Integer handling issues (NEGATIVE_RETURNS) /slurm/src/slurmdbd/backup.c: 65 in run_dbd_backup() 59 primary_resumed = false; 60 61 memset(&slurmdbd_conn, 0, sizeof(slurm_persist_conn_t)); 62 slurmdbd_conn.rem_host = slurmdbd_conf->dbd_addr; 63 slurmdbd_conn.rem_port = slurmdbd_conf->dbd_port; 64 slurmdbd_conn.cluster_name = "backup_slurmdbd"; >>> CID 149287: Integer handling issues (NEGATIVE_RETURNS) >>> Assigning: "slurmdbd_conn.fd" = a negative value. 65 slurmdbd_conn.fd = -1; 66 slurmdbd_conn.shutdown = &shutdown_time; 67 68 slurm_persist_conn_open_without_init(&slurmdbd_conn); 69 70 /* repeatedly ping Primary */ ** CID 149286: Integer handling issues (INCOMPATIBLE_CAST) ________________________________________________________________________________________________________ *** CID 149286: Integer handling issues (INCOMPATIBLE_CAST) /slurm/src/common/slurmdbd_defs.c: 3096 in slurmdbd_unpack_init_msg() 3090 3091 *msg = msg_ptr; 3092 3093 /* We don't use rollback going forward and version was packed after it 3094 * unfortunately */ 3095 if (rpc_version < SLURM_17_02_PROTOCOL_VERSION) >>> CID 149286: Integer handling issues (INCOMPATIBLE_CAST) >>> Pointer "&tmp32" points to an object whose effective type is "unsigned int" (32 bits, unsigned) but is dereferenced as a narrower "unsigned short" (16 bits, unsigned). This may lead to unexpected results depending on machine endianness. 3096 safe_unpack16((uint16_t *)&tmp32, buffer); 3097 safe_unpack16(&msg_ptr->version, buffer); 3098 safe_unpackstr_xmalloc(&msg_ptr->cluster_name, &tmp32, buffer); 3099 3100 /* We find out the version of the caller right here so use 3101 that as the rpc_version. */ ** CID 149285: Null pointer dereferences (FORWARD_NULL) /slurm/src/common/slurm_persist_conn.c: 588 in slurm_persist_conn_open() ________________________________________________________________________________________________________ *** CID 149285: Null pointer dereferences (FORWARD_NULL) /slurm/src/common/slurm_persist_conn.c: 588 in slurm_persist_conn_open() 582 error("%s: Failed to unpack persistent connection init resp message from %s:%d", 583 __func__, 584 persist_conn->rem_host, 585 persist_conn->rem_port); 586 _close_fd(&persist_conn->fd); 587 } else >>> CID 149285: Null pointer dereferences (FORWARD_NULL) >>> Dereferencing null pointer "resp". 588 persist_conn->version = resp->ret_info; 589 } 590 591 end_it: 592 593 slurm_persist_free_rc_msg(resp); ** CID 149284: Error handling issues (CHECKED_RETURN) /slurm/src/common/slurmdbd_defs.c: 1665 in _send_fini_msg() ________________________________________________________________________________________________________ *** CID 149284: Error handling issues (CHECKED_RETURN) /slurm/src/common/slurmdbd_defs.c: 1665 in _send_fini_msg() 1659 buffer = init_buf(1024); 1660 pack16((uint16_t) DBD_FINI, buffer); 1661 req.commit = 0; 1662 req.close_conn = 1; 1663 slurmdbd_pack_fini_msg(&req, SLURM_PROTOCOL_VERSION, buffer); 1664 >>> CID 149284: Error handling issues (CHECKED_RETURN) >>> Calling "slurm_persist_send_msg" without checking return value (as is done elsewhere 5 out of 6 times). 1665 slurm_persist_send_msg(slurmdbd_conn, buffer); 1666 free_buf(buffer); 1667 1668 return SLURM_SUCCESS; 1669 } 1670
-
Morris Jette authored
-
Tim Wickberg authored
Leave macros defined for now to avoid bit reuse.
-