Job priority reset bug on slurmctld restart
The commit 8b14f388 on Jan 19, 2011 is causing problems with Moab cluster-scheduled machines. Under this case, Moab hands off every job submitted immediately to SLURM which gets a zero priority. Once Moab schedules the job, Moab raises the job's priority to 10,000,000 and the job runs. When you happen to restart the slurmctld under such conditions, the sync_job_priorities() function runs which attempts to raise job priorities into a higher range if they are getting too close to zero. The problem as I see it is that you include the "boost" for zero priority jobs. Hence the problem we are seeing is that once the slurmctld is restarted, a bunch of zero priority jobs are suddenly eligible. So there becomes a disconnect between the top priority job Moab is trying to start and the top priority job SLURM sees. I believe the fix is simple: diff job_mgr.c~ job_mgr.c 6328,6329c6328,6331 < while ((job_ptr = (struct job_record *) list_next(job_iterator))) < job_ptr->priority += prio_boost; --- while ((job_ptr = (struct job_record *) list_next(job_iterator))) { if (job_ptr->priority) job_ptr->priority += prio_boost; } Do you agree? Don
Please register or sign in to comment