slurmstepd: avoid coredump in case of NULL job
We build slurm with --enable-memory-leak-debug and encountered twice the same core dump when user 'root' was trying to run jobs during a maintenance session. The root user is not in the accounting database, which explains the errors seen below. The gdb session shows that in this invocation palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core ... Modify: 2011-04-04 19:34:44.000000000 +0200 slurmctld.log [2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773 [2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333 [2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920 [2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host [2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING [2011-04-04T19:34:44] completing job 3254 [2011-04-04T19:34:44] Requeue JobId=3254 due to node failure [2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful [2011-04-04T19:34:44] requeue batch job 3254 [2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285 (gdb) core-file palu7-slurmstepd-6602.core [New Thread 6604] Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'. Program terminated with signal 11, Segmentation fault. #0 main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413 413 jobacct_gather_g_destroy(job->jobacct); (gdb) print job $1 = (slurmd_job_t *) 0x0 (gdb) list 408 409 #ifdef MEMORY_LEAK_DEBUG 410 static void 411 _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc) 412 { 413 jobacct_gather_g_destroy(job->jobacct); 414 if (!job->batch) 415 job_destroy(job); 416 /* 417 * The message cannot be freed until the jobstep is complete (gdb) print msg $2 = (slurm_msg_t *) 0x916008 (gdb) print rc $3 = -1 (gdb) The patch tests for a NULL job argument for the calls that need to dereference the job pointer.
Please register or sign in to comment