1. 13 Jul, 2012 6 commits
    • Mark A. Grondona's avatar
      slurmd: set SLURM_CONF in prolog/epilog environment · b2b5b908
      Mark A. Grondona authored
      Set SLURM_CONF in default prolog/epilog environment instead
      of only in spank prolog/epilog environment.
      
      This change fixes a potential hang during spank prolog/epilog
      execution due to the possibility of memory allocation after
      fork(2) and before exec(2) when invoking slurmstepd spank
      prolog|epilog.
      
      This also has the benefit that SLURM commands used in prolog and epilog
      scripts will use the correct slurm.conf file.
      b2b5b908
    • Mark A. Grondona's avatar
      slurmstepd: don't call exec if task fails to get notification from parent · 9006dda4
      Mark A. Grondona authored
      If exec_wait_child_wait_for_parent() fails for any reason, it is safer
      to abort immediately rather than proceed to execute the user's job.
      9006dda4
    • Mark A. Grondona's avatar
      slurmstepd: Kill remaining children if fork fails · 5b8dba9e
      Mark A. Grondona authored
      On a failure of fork(2), slurmstepd would print an error and exit,
      possibly leaving previously forked children waiting.
      
      Ensure a better cleanup by killing all active children on fork failure
      before exiting slurmstepd.
      5b8dba9e
    • Mark A. Grondona's avatar
      slurmstepd: Close childfd of exec_wait_info in parent · eca089e3
      Mark A. Grondona authored
      Close the read end of the pipe slurmstepd uses to notify children
      it is time to call exec(2) in order to save one file descriptor per
      task. (Previously, the read side of the pipe wasn't closed until
      exec_wait_info was destroyed)
      eca089e3
    • Mark A. Grondona's avatar
      squeue: report number of nodes in completing for completing jobs · 2ddc6e70
      Mark A. Grondona authored
      For some reason squeue was treating completing jobs the same as
      pending jobs, and reported the number of nodes as the maximum of
      requested nodelist, requested node count or CPUs (divided into nodes?)
      
      This is in contrast to the squeue manpage which explicitly states
      that the number of nodes reported for completing jobs should
      be only the nodes that are still allocated to the job.
      
      This patch removes the special handling of completing jobs in
      src/squeue/print.c:_get_node_cnt(), so that the squeue output for
      completing jobs matches documentation. A comment is also added
      so that developers looking at the code understand what is going on.
      2ddc6e70
    • Morris Jette's avatar
  2. 12 Jul, 2012 10 commits
  3. 11 Jul, 2012 4 commits
  4. 09 Jul, 2012 1 commit
  5. 06 Jul, 2012 1 commit
    • Carles Fenoy's avatar
      Fix for incorrect partition point for job · dd1d573f
      Carles Fenoy authored
      If job is submitted to more than one partition, it's partition pointer can
      be set to an invalid value. This can result in the count of CPUs allocated
      on a node being bad, resulting in over- or under-allocation of its CPUs.
      Patch by Carles Fenoy, BSC.
      
      Hi all,
      
      After a tough day I've finally found the problem and a solution for 2.4.1
      I was able to reproduce the explained behavior by submitting jobs to 2 partitions.
      This makes the job to be allocated in one partition but in the schedule function the partition of the job is changed to the NON allocated one. This makes that the resources can not be free at the end of the job.
      
      I've solved this by changing the IS_PENDING test some lines above in the schedule function in (job_scheduler.c)
      
      This is the code from the git HEAD (Line 801). As this file has changed a lot from 2.4.x I have not done a patch but I'm commenting the solution here.
      I've moved the if(!IS_JOB_PENDING) after the 2nd line (part_ptr...). This prevents the partition of the job to be changed if it is already starting in another partition.
      
      job_ptr = job_queue_rec->job_ptr;
      
      part_ptr = job_queue_rec->part_ptr;
      job_ptr->part_ptr = part_ptr;
      xfree(job_queue_rec);
      
      if (!IS_JOB_PENDING(job_ptr))
      
      continue; /* started in other partition */
      
      Hope this is enough information to solve it.
      
      I've just realized (while writing this mail) that my solution has a memory leak as job_queue_rec is not freed.
      
      Regards,
      Carles Fenoy
      dd1d573f
  6. 04 Jul, 2012 1 commit
  7. 03 Jul, 2012 4 commits
  8. 02 Jul, 2012 5 commits
  9. 29 Jun, 2012 2 commits
  10. 28 Jun, 2012 2 commits
  11. 27 Jun, 2012 3 commits
  12. 26 Jun, 2012 1 commit