- 10 Apr, 2011 7 commits
-
-
Moe Jette authored
This patch avoids that the priority is not recalculated on 'scontrol release', which happens when an authorized operator releases a job, or if the job is released via e.g. the job_submit plugin. The patch reorders the tests in update_job() to * test first if the job has been held by the user and, only if not, * test whether an authorized operator changed the priority or the updated priority is being reduced. Due to earlier permission checks, we have either * job_ptr->user_id == uid or * authorized, where in both cases the release-user-hold operation is authorized.
-
Moe Jette authored
This fix is related to an earlier one and was observed when trying to 'scontrol release' a job previously submitted via 'sbatch --hold' by the same user. Within the job_submit/lua plugin, the user gets automatically assigned a partition. So, even if no submitter uid checks are usually expected, it can happen in the process of releasing a job, that a part_check is performed. In this case, the error message was [2011-03-30T18:37:17] _part_access_check: uid 4294967294 access to partition usup denied, bad group [2011-03-30T18:37:17] error: _slurm_rpc_update_job JobId=12856 uid=21215: User's group not permitted to use this partition and like before (in scontrol_update_job()), was fixed by supplying the UID of the requesting user.
-
Moe Jette authored
-
Moe Jette authored
We build slurm with --enable-memory-leak-debug and encountered twice the same core dump when user 'root' was trying to run jobs during a maintenance session. The root user is not in the accounting database, which explains the errors seen below. The gdb session shows that in this invocation palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core ... Modify: 2011-04-04 19:34:44.000000000 +0200 slurmctld.log [2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773 [2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333 [2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920 [2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host [2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING [2011-04-04T19:34:44] completing job 3254 [2011-04-04T19:34:44] Requeue JobId=3254 due to node failure [2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful [2011-04-04T19:34:44] requeue batch job 3254 [2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285 (gdb) core-file palu7-slurmstepd-6602.core [New Thread 6604] Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'. Program terminated with signal 11, Segmentation fault. #0 main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413 413 jobacct_gather_g_destroy(job->jobacct); (gdb) print job $1 = (slurmd_job_t *) 0x0 (gdb) list 408 409 #ifdef MEMORY_LEAK_DEBUG 410 static void 411 _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc) 412 { 413 jobacct_gather_g_destroy(job->jobacct); 414 if (!job->batch) 415 job_destroy(job); 416 /* 417 * The message cannot be freed until the jobstep is complete (gdb) print msg $2 = (slurm_msg_t *) 0x916008 (gdb) print rc $3 = -1 (gdb) The patch tests for a NULL job argument for the calls that need to dereference the job pointer.
-
Moe Jette authored
This avoids meaningless error messages that warn about a zero reservation ID: [2011-04-07T15:31:26] _slurm_rpc_submit_batch_job JobId=2870 usec=33390 ... a minute later the user decides to scancel the queued job: [2011-04-07T15:32:34] error: JobId=2870 has invalid (ZERO) resId [2011-04-07T15:32:34] sched: Cancel of JobId=2870 by UID=21770, usec=230 To keep things simple, that test has been removed. (The patch is in particular also necessary since now job_signal() may trigger a basil_release() of a pending job which has no ALPS reservation yet.)
-
Moe Jette authored
On rosa we experienced severe problems when jobs got killed via scancel or as a result of job timeout. Job cleanup took several minutes, created stray processes that consumed resources on the slurmd node, keeping the system for long spans unable from scheduling. This problem did not show up on the smaller 2-cabinet XE system (which also runs a more recent ALPS version). The fix for the problem is to keep new script lines from starting by sending apkill only after formally releasing the reservation. For all signals whose default disposition is to terminate or to dump core, the reservation is released before signalling the aprun job steps. This prevents a race condition where further aprun lines get executed while the apkill of the current aprun line in the job script is in progress. We did a before/after test on rosa under full load and the problem disappeared.
-
Moe Jette authored
-
- 09 Apr, 2011 4 commits
-
-
Don Lipari authored
-
Moe Jette authored
-
Moe Jette authored
Patch from Andrej N. Gritsenko.
-
Moe Jette authored
-
- 08 Apr, 2011 5 commits
-
-
Moe Jette authored
Most of the changes were to support hostname with 4 digit suffix. Some other tests are failing, but will leave for now
-
Moe Jette authored
they are satisfied.
-
Moe Jette authored
-
Moe Jette authored
rather than the original dependencies. From Dan Rusak, Bull.
-
Moe Jette authored
add some more logging information
-
- 07 Apr, 2011 12 commits
-
-
Don Lipari authored
Submitted by Martin Perry / Rod Schultz, Bull.
-
Moe Jette authored
-
Moe Jette authored
avoid purging active jobs.
-
Moe Jette authored
event triggers.
-
Moe Jette authored
errors after slurmctld restarts.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
-
Danny Auble authored
Fix so slurmctld will pack correctly 2.1 step information. (Only needed if a 2.1 client is talking to a 2.2 slurmctld.)
-
Danny Auble authored
-
- 06 Apr, 2011 6 commits
-
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
-
Don Lipari authored
-
Don Lipari authored
-
- 05 Apr, 2011 6 commits
-
-
Moe Jette authored
resources (gres).
-
Moe Jette authored
-
Moe Jette authored
scheduling which prevented some jobs from starting as soon as possible.
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-