- 07 Nov, 2012 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
specifying the number of tasks and not the number of nodes.
-
- 05 Nov, 2012 2 commits
-
-
Morris Jette authored
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send SIGKILL. Previously just sent SIGKILL to tasks.
-
Morris Jette authored
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send SIGKILL. Previously just sent SIGKILL to tasks.
-
- 02 Nov, 2012 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 31 Oct, 2012 1 commit
-
-
Mark Nelson authored
launching that wouldn't be able to run to completion because of a GrpCPUMins limit.
-
- 29 Oct, 2012 2 commits
-
-
Morris Jette authored
Now try to get max node count rather than minimizing leaf switches used. For example, if each leaf switch has 8 nodes then a request for -N4-16 would allocate 8 nodes (one leaf switch) rather than 16 nodes over two leaf switches.
-
Morris Jette authored
Anyhow, after applying the patch, I was still running into the same difficulty. Upon a closer look, I saw that I was still receiving the ALPS backend error in the slurmctld.log file. When I examined the code pertaining this and ran some SLURM-independent tests, I found that we were executing the do_basil_confirm function multiple times in the cases where it would fail. My independent tests show precisely the same behaviour; that is, if you make a reservation request, then successfully confirm it and then attempt to confirm it again, you receive this error message. However, the "apstat -rvv" command shows that the ALPS reservation is fine and therefore I concluded that this particular ALPS/BASIL message is more of an informational one and not a "show-stopper." In other words, I can consider the node ready at this point. As a simple work around, I currently just inserted an if-block immediately after the call to "basil_confirm" in function "do_basil_confirm" in ".../src/plugins/select/cray/basil_interface.c." The if-statment checks for "BE_BACKEND" and if this is the result then it prints an informational message to slurmctld.log and sets the variable rc=0 so that we can consider the node ready. This, now allows my prolog scripts to run and I can clearly see the SLURM message that I had placed in that if-block. However, I am not certain if we really should just allow this error code to pass through as it seems like it could be a fairly generic code and there could be various other causes of it where we would not wish to allow it to pass. I really only want to limit the number of calls to basil_confirm to one. Perhaps I could add a field to the job_record so that I can mark whether the ALPS reservation had been confirmed or not.
-
- 26 Oct, 2012 2 commits
-
-
Danny Auble authored
instead of putting 0's into the database we put NO_VALS and have sacct figure out jobacct_gather wasn't used.
-
Olli-Pekka Lehto authored
-
- 25 Oct, 2012 2 commits
-
-
Morris Jette authored
Incorrect error codes returned in some cases, especially if the slurmdbd is down
-
Morris Jette authored
-
- 24 Oct, 2012 1 commit
-
-
Morris Jette authored
Previously for linux systems all information was placed on a single line.
-
- 23 Oct, 2012 1 commit
-
-
Danny Auble authored
interface instead of through the normal method.
-
- 22 Oct, 2012 3 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Don Albert authored
-
- 19 Oct, 2012 1 commit
-
-
Morris Jette authored
-
- 18 Oct, 2012 6 commits
-
-
Danny Auble authored
for passthrough gets removed on a dynamic system.
-
Danny Auble authored
in error for passthrough.
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
user's pending allocation was started with srun and then for some reason the slurmctld was brought down and while it was down the srun was removed.
-
Danny Auble authored
This is needed for when a free request is added to a block but there are jobs finishing up so we don't start new jobs on the block since they will fail on start.
-
- 17 Oct, 2012 2 commits
-
-
Morris Jette authored
Previously the node count would change from c-node count to midplane count (but still be interpreted as a c-node count).
-
jette authored
No real changes to logic other than some additional error checking.
-
- 16 Oct, 2012 2 commits
-
-
Morris Jette authored
Preempt jobs only when insufficient idle resources exist to start job, regardless of the node weight.
-
Danny Auble authored
-
- 05 Oct, 2012 2 commits
-
-
Morris Jette authored
Preemptor was not being scheduled. Fix for bugzilla #3.
-
Morris Jette authored
While this change lets gang scheduling happen, it overallocates resources from different priority partitions when gang scheduling is not running.
-
- 04 Oct, 2012 1 commit
-
-
Morris Jette authored
Preemptor was not being scheduled. See bugzilla #3 for details
-
- 02 Oct, 2012 2 commits
-
-
Morris Jette authored
See bugzilla bug 132 When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD (2.5.0-pre3). Conditions: ----------- * SelectType=select/cons_res * SelectTypeParameters=CR_Core_Memory * Using threads - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=400" Description: ------------ In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has sufficient memory for a job. However, the per-CPU memory limits appear to be scaled by the number of threads. This new value may exceed the available memory on the node. And, once a node is overcommitted on memory, future memory checks in _verify_node_state() will always succeed. Scenario to reproduce: ---------------------- With the example node linux0, we run a single-core job with 250MB/core srun --mem-per-cpu=250 sleep 60 cons_res checks that it will fit: ((real - alloc) >= job mem) ((400 - 0) >= 250) and the job starts Then, the memory requirement is doubled: "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for job X" "slurmd: scaling CPU count by factor of 2" This job should not have started While the first job is still running, we submit a second, identical job srun --mem-per-cpu=250 sleep 60 cons_res checks that it will fit: ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job starts This second job also should not have started
-
Morris Jette authored
-
- 27 Sep, 2012 3 commits
-
-
Danny Auble authored
purged from the system if its front-end node goes down.
-
Danny Auble authored
database, and the job is running on a small block make sure we free up the correct node count.
-
Bill Brophy authored
-
- 25 Sep, 2012 1 commit
-
-
Morris Jette authored
Based upon work by Jason Sollom, Cray Inc. and used by permission
-
- 24 Sep, 2012 1 commit
-
-
Morris Jette authored
This addresses bug 130
-