- 08 Oct, 2013 1 commit
-
-
Morris Jette authored
EpilogSlurmctld pthread is passed required arguments rather than a pointer to the job record, which under some conditions could be purged and result in an invalid memory reference.
-
- 02 Oct, 2013 1 commit
-
-
Morris Jette authored
bug 436
-
- 23 Sep, 2013 1 commit
-
-
Morris Jette authored
bug 428
-
- 13 Aug, 2013 2 commits
-
-
jette authored
I don't see how this could happen, but it might explain something reported by Harvard University. In any case, this could prevent an infinite loop if the task distribution funciton is passed a job allocation with zero nodes.
-
jette authored
This problem was reported by Harvard University and could be reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id". With other job types, the error message could be logged many times for each job. This change logs the error once per job and only if the job request does not include the -O/--overcommit option.
-
- 05 Jul, 2013 1 commit
-
-
jette authored
-
- 28 Jun, 2013 1 commit
-
-
Morris Jette authored
Effects jobs with --exclusive and --cpus-per-task options bug 355
-
- 25 Jun, 2013 1 commit
-
-
David Gloe authored
The SLURM Makefile.am scripts use pkglibexecdir. One source indicates that this was not added until automake 1.10.2 (https://github.com/rerun/rerun/issues/167). So we just made that to be the minimum.
-
- 21 Jun, 2013 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
default and appeared to break other things.
-
Danny Auble authored
-
- 19 Jun, 2013 1 commit
-
-
Danny Auble authored
-
- 12 Jun, 2013 1 commit
-
-
Morris Jette authored
if on "scontrol reconfig" when AllowNodes is manually changed using scontrol since last slurmctld restart.
-
- 11 Jun, 2013 2 commits
-
-
Gennaro Oliva authored
-
Danny Auble authored
correctly with poe when attempting to run > 32 nodes.
-
- 10 Jun, 2013 1 commit
-
-
Morris Jette authored
due to either down nodes or explicit resizing. Generated slurmctld errors of this type: [2013-06-04T12:43:46+06:00] error: gres/gpu: step_test 68662.4294967294 gres_bit_alloc is NULL This is a movement of the logic introduced in commit https://github.com/SchedMD/slurm/commit/6fff97bb77d2d88aa808c47fd7880246a0c1d090 to eliminate a memory leak.
-
- 06 Jun, 2013 1 commit
-
-
Mark Nelson authored
-
- 05 Jun, 2013 4 commits
-
-
Danny Auble authored
-
Danny Auble authored
Since we don't currently track energy usage per task (only per step). Otherwise we get double the energy.
-
Danny Auble authored
-
Janne Blomqvist authored
Andy Wettstein (University of Chicago) reported privately to me that slurmctld 2.5.4 crashed after he enabled the priority/multifactor2 plugin due to a division by zero error. I was able to reproduce the crash by creating an account hierarchy where all the accounts and users had zero shares. See bug 315
-
- 04 Jun, 2013 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
Without this change, it appears that POE ignores the -procs argument resulting in a job step request with multiple host names, but only one ntask required
-
- 03 Jun, 2013 2 commits
-
-
jette authored
Previously if the required node has no available CPUs left, then other nodes in the job allocation would be used
-
Hongjia Cao authored
We're having some trouble getting our slurm jobs to successfully restart after a checkpoint. For this test, I'm using sbatch and a simple, single-threaded executable. Slurm is 2.5.4, blcr is 0.8.5. I'm submitting the job using sbatch: $ sbatch -n 1 -t 12:00:00 bin/bowtie-ex.sh I am able to create the checkpoint and vacate the node: $ scontrol checkpoint create 137 .... time passes .... $ scontrol vacate 137 At that point, I see the checkpoint file from blcr in the current directory and the checkpoint file from Slurm in /var/spool/slurm-llnl/checkpoint. However, when I attempt to restart the job: $ scontrol checkpoint restart 137 scontrol_checkpoint error: Node count specification invalid In slurmctld's log (at level 7) I see: [2013-05-29T12:41:08-07:00] debug2: Processing RPC: REQUEST_CHECKPOINT(restart) from uid=***** [2013-05-29T12:41:08-07:00] debug3: Version string in job_ckpt header is JOB_CKPT_002 [2013-05-29T12:41:08-07:00] _job_create: max_nodes == 0 [2013-05-29T12:41:08-07:00] _slurm_rpc_checkpoint restart 137: Node count specification invalid
-
- 30 May, 2013 1 commit
-
-
Morris Jette authored
Uninitialized variables resulted in error of "cons_res: sync loop not progressing, holding job #"
-
- 29 May, 2013 1 commit
-
-
jette authored
The most notable problem case is on a cray where a job step specifically requests one or more node that are not the first nodes in the job allocation
-
- 23 May, 2013 8 commits
-
-
Morris Jette authored
The problem we have observed is the backfill scheduler temporarily gives up its locks (one second), but then reclaims them before the backlog of work completes, basically keeping the backfill scheduler running for a really long time when under a heavy load. bug 297
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
Fix minor bug in sdiag backfill scheduling time reported on Bluegene systems Improve explanation of backfill scheduling cycle time calculation.
-
Morris Jette authored
Defers (rather than forgets) reboot request with job running on the node within a reservation.
-
Danny Auble authored
-
Danny Auble authored
-
- 22 May, 2013 3 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-