Job priority reset bug on slurmctld restart (5e9dca41) · Commits · Manuel G. Marciani / ces_slurm_simulator

Commit 5e9dca41 authored May 07, 2012 by

Don Lipari Committed by Morris Jette May 07, 2012

Job priority reset bug on slurmctld restart

The commit 8b14f388 on Jan 19, 2011 is causing problems with Moab cluster-scheduled machines.  Under this case, Moab hands off every job submitted immediately to SLURM which gets a zero priority.  Once Moab schedules the job, Moab raises the job's priority to 10,000,000 and the job runs.

When you happen to restart the slurmctld under such conditions, the sync_job_priorities() function runs which attempts to raise job priorities into a higher range if they are getting too close to zero.  The problem as I see it is that you include the "boost" for zero priority jobs.  Hence the problem we are seeing is that once the slurmctld is restarted, a bunch of zero priority jobs are suddenly eligible.  So there becomes a disconnect between the top priority job Moab is trying to start and the top priority job SLURM sees.

I believe the fix is simple:

diff job_mgr.c~ job_mgr.c
6328,6329c6328,6331
<       while ((job_ptr = (struct job_record *) list_next(job_iterator)))
<               job_ptr->priority += prio_boost;
---
       while ((job_ptr = (struct job_record *) list_next(job_iterator))) {
               if (job_ptr->priority)
                       job_ptr->priority += prio_boost;
       }
Do you agree?

Don

parent 63833965

Hide whitespace changes

Inline Side-by-side

Please register or to comment