1. 05 Feb, 2013 1 commit
    • Morris Jette's avatar
      Fix race condition in job dependency logic · ff26cc50
      Morris Jette authored
      If job involved in dependency completes and is purged the logic
      used to test for circular dependencies can use the invalid pointer
      and generate an invalid memory reference before the pointer is
      cleared from the dependency list data structure.
      ff26cc50
  2. 24 Jul, 2012 1 commit
  3. 23 May, 2012 2 commits
  4. 22 May, 2012 1 commit
  5. 16 May, 2012 2 commits
  6. 09 May, 2012 2 commits
  7. 07 May, 2012 1 commit
    • Don Lipari's avatar
      Job priority reset bug on slurmctld restart · 5e9dca41
      Don Lipari authored
      The commit 8b14f388 on Jan 19, 2011 is causing problems with Moab cluster-scheduled machines.  Under this case, Moab hands off every job submitted immediately to SLURM which gets a zero priority.  Once Moab schedules the job, Moab raises the job's priority to 10,000,000 and the job runs.
      
      When you happen to restart the slurmctld under such conditions, the sync_job_priorities() function runs which attempts to raise job priorities into a higher range if they are getting too close to zero.  The problem as I see it is that you include the "boost" for zero priority jobs.  Hence the problem we are seeing is that once the slurmctld is restarted, a bunch of zero priority jobs are suddenly eligible.  So there becomes a disconnect between the top priority job Moab is trying to start and the top priority job SLURM sees.
      
      I believe the fix is simple:
      
      diff job_mgr.c~ job_mgr.c
      6328,6329c6328,6331
      <       while ((job_ptr = (struct job_record *) list_next(job_iterator)))
      <               job_ptr->priority += prio_boost;
      ---
             while ((job_ptr = (struct job_record *) list_next(job_iterator))) {
                     if (job_ptr->priority)
                             job_ptr->priority += prio_boost;
             }
      Do you agree?
      
      Don
      5e9dca41
  8. 03 May, 2012 2 commits
  9. 27 Apr, 2012 1 commit
  10. 26 Apr, 2012 1 commit
  11. 25 Apr, 2012 1 commit
    • Don Albert's avatar
      Append "*" to default partition name with format and no size · 77645508
      Don Albert authored
      Show this HTML in a new window?
      There is a minor problem with the display of partition names in
      "sinfo".  Without options, the partition name field displays a
      asterisk "*" at the end of the name of the Default partition.  If you
      specify a formatting option which contains the %P field specifier with
      a width option (e.g., sinfo -o %8P) the asterisk also is appended to
      the default partition name.  With no width option, the "%P" displays
      the name based on the full length of the name string, however, no "*"
      is appended on the default partition name.
      
      The attached patch for version 2.4.0-pre4 corrects the problem so that
      the "*" is correctly appended when %P with no width specifier is
      used. The patch will also apply to version 2.3.4.
      
        -Don Albert-
      77645508
  12. 24 Apr, 2012 1 commit
  13. 23 Apr, 2012 2 commits
  14. 20 Apr, 2012 1 commit
  15. 17 Apr, 2012 1 commit
  16. 12 Apr, 2012 3 commits
  17. 10 Apr, 2012 6 commits
  18. 05 Apr, 2012 1 commit
    • Don Lipari's avatar
      Prevent users from extending the EndTime of running jobs · 62edab22
      Don Lipari authored
      While safeguards are in place to prevent unauthorized users from extending the
      TimeLimit of their running jobs, there were no such restrictions for extending
      the EndTime.  This patch adds the same constraints to modifying EndTime that
      currently exists for modifying TimeLimit.
      62edab22
  19. 03 Apr, 2012 1 commit
    • Morris Jette's avatar
      Limit depth of circular job dependency check · 0caecbc5
      Morris Jette authored
      Add support for new SchedulerParameters of max_depend_depth defining the
      maximum number of jobs to test for circular dependencies (i.e. job A waits
      for job B to start and job B waits for job A to start). Default value is
      10 jobs.
      0caecbc5
  20. 02 Apr, 2012 1 commit
  21. 30 Mar, 2012 1 commit
  22. 29 Mar, 2012 2 commits
    • Morris Jette's avatar
      Fix in select/cons_res+topology+job with node range count · f64b29a2
      Morris Jette authored
      The problem was conflicting logic in the select/cons_res plugin. Some of the code was trying to get the job the maximum node count in the range while other logic was trying to minimize spreading out of the job across multiple switches. As you note, this problem only happens when a range of node counts is specified and the select/cons_res plugin and the topology/tree plugin and even then it is not easy to reproduce (you included all of the details below).
      
      Quoting Martin.Perry@Bull.com:
      
      > Certain combinations of topology configuration and srun -N option produce
      > spurious job rejection with "Requested node configuration is not
      > available" with select/cons_res. The following example illustrates the
      > problem.
      >
      > [sulu] (slurm) etc> cat slurm.conf
      > ...
      > TopologyPlugin=topology/tree
      > SelectType=select/cons_res
      > SelectTypeParameters=CR_Core
      > ...
      >
      > [sulu] (slurm) etc> cat topology.conf
      > SwitchName=s1 Nodes=xna[13-26]
      > SwitchName=s2 Nodes=xna[41-45]
      > SwitchName=s3 Switches=s[1-2]
      >
      > [sulu] (slurm) etc> sinfo
      > PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
      > ...
      > jkob         up   infinite      4   idle xna[14,19-20,41]
      > ...
      >
      > [sulu] (slurm) etc> srun -N 2-4 -n 4 -p jkob hostname
      > srun: Force Terminated job 79
      > srun: error: Unable to allocate resources: Requested node configuration is
      > not available
      >
      > The problem does not occur with select/linear, or topology/none, or if -N
      > is omitted, or for certain other values for -N (for example, -N 4-4 and -N
      > 2-3 work ok). The problem seems to be in function _eval_nodes_topo in
      > src/plugins/select/cons_res/job_test.c. The srun man page states that when
      > -N is used, "the job will be allocated as many nodes as possible within
      > the range specified and without delaying the initiation of the job."
      > Consistent with this description, the requested number of nodes in the
      > above example is 4 (req_nodes=4).  However, the code that selects the
      > best-fit topology switches appears to make the selection based on the
      > minimum required number of nodes (min_nodes=2). It therefore selects
      > switch s1.  s1 has only 3 nodes from partition jkob. Since this is fewer
      > than req_nodes the job is rejected with the "node configuration" error.
      >
      > I'm not sure where the code is going wrong.  It could be in the
      > calculation of the number of needed nodes in function _enough_nodes.  Or
      > it could be in the code that initializes/updates req_nodes or rem_nodes. I
      > don't feel confident that I understand the logic well enough to propose a
      > fix without introducing a regression.
      >
      > Regards,
      > Martin
      f64b29a2
    • Morris Jette's avatar
      Format change, no change in logic · ebca432e
      Morris Jette authored
      ebca432e
  23. 27 Mar, 2012 2 commits
  24. 26 Mar, 2012 1 commit
  25. 21 Mar, 2012 2 commits