1. 10 Apr, 2011 4 commits
    • Moe Jette's avatar
      slurmstepd: avoid coredump in case of NULL job · e0d92b8a
      Moe Jette authored
      We build slurm with --enable-memory-leak-debug and encountered twice the same core
      dump when user 'root' was trying to run jobs during a maintenance session. 
      
      The root user is not in the accounting database, which explains the errors seen
      below. The gdb session shows that in this invocation 
      
      palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core 
      ...
      Modify: 2011-04-04 19:34:44.000000000 +0200
      
      slurmctld.log
      [2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773
      [2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333
      [2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920
      [2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host
      [2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING
      [2011-04-04T19:34:44] completing job 3254
      [2011-04-04T19:34:44] Requeue JobId=3254 due to node failure
      [2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful
      [2011-04-04T19:34:44] requeue batch job 3254
      [2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285
      
      (gdb) core-file palu7-slurmstepd-6602.core 
      [New Thread 6604]
      Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'.
      Program terminated with signal 11, Segmentation fault.
      #0  main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413
      413             jobacct_gather_g_destroy(job->jobacct);
      (gdb) print job
      $1 = (slurmd_job_t *) 0x0
      (gdb) list
      408
      409     #ifdef MEMORY_LEAK_DEBUG
      410     static void
      411     _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc)
      412     {
      413             jobacct_gather_g_destroy(job->jobacct);
      414             if (!job->batch)
      415                     job_destroy(job);
      416             /*
      417              * The message cannot be freed until the jobstep is complete
      (gdb) print msg
      $2 = (slurm_msg_t *) 0x916008
      (gdb) print rc
      $3 = -1
      (gdb) 
      
      The patch tests for a NULL job argument for the calls that need to dereference the job pointer.
      e0d92b8a
    • Moe Jette's avatar
      select/cray: zero reservation ID is not an error · 03f984aa
      Moe Jette authored
      This avoids meaningless error messages that warn about a zero reservation ID:
      
       [2011-04-07T15:31:26] _slurm_rpc_submit_batch_job JobId=2870 usec=33390
                             ... a minute later the user decides to scancel the queued job:
       [2011-04-07T15:32:34] error: JobId=2870 has invalid (ZERO) resId
       [2011-04-07T15:32:34] sched: Cancel of JobId=2870 by UID=21770, usec=230
      
      To keep things simple, that test has been removed.
      
      (The patch is in particular also necessary since now job_signal() may trigger
       a basil_release() of a pending job which has no ALPS reservation yet.)
      03f984aa
    • Moe Jette's avatar
      select/cray: release ALPS reservation on termination signals · 12772a3a
      Moe Jette authored
      On rosa we experienced severe problems when jobs got killed via scancel or
      as a result of job timeout. Job cleanup took several minutes, created stray
      processes that consumed resources on the slurmd node, keeping the system 
      for long spans unable from scheduling.
      
      This problem did not show up on the smaller 2-cabinet XE system (which also
      runs a more recent ALPS version). The fix for the problem is to keep new
      script lines from starting by sending apkill only after formally releasing
      the reservation.
      
      For all signals whose default disposition is to terminate or to dump core,
      the reservation is released before signalling the aprun job steps. This
      prevents a race condition where further aprun lines get executed while the
      apkill of the current aprun line in the job script is in progress.
      
      We did a before/after test on rosa under full load and the problem disappeared.
      12772a3a
    • Moe Jette's avatar
      add testimonial from CSCS · 44bec602
      Moe Jette authored
      44bec602
  2. 09 Apr, 2011 4 commits
  3. 08 Apr, 2011 5 commits
  4. 07 Apr, 2011 12 commits
  5. 06 Apr, 2011 6 commits
  6. 05 Apr, 2011 9 commits