- 08 Nov, 2012 1 commit
-
-
Danny Auble authored
Signed-off-by: Danny Auble <da@schedmd.com>
-
- 07 Nov, 2012 5 commits
-
-
Danny Auble authored
-
Danny Auble authored
aprun process instead of a perl script.
-
Janne Blomqvist authored
the attached patch changes the default timestamp format in logfiles to conform to RFC 5424 (the current version of the syslog RFC). It is identical to the current default "ISO 8601" timestamp used by slurm, with the exception that the timezone offset is appended. This has the benefits of 1) It's unambiguous. 2) Avoids potential confusion for admins running cluster(s) in different timezones. 3) Might help debug issues related to DST transitions. (More on that later..) (To be pedantic, a RFC 5424 timestamp is still a valid ISO 8601 timestamp, but the converse is not necessarily true. So there is RFC 3339 which is a "profile" of ISO 8601, that is a subset, recommended for internet protocols. The RFC 5424 timestamp, in turn, is a subset of the RFC 3339 timestamps.) The previous behavior of can be used by running configure with the --disable-rfc5424time flag.
-
Danny Auble authored
-
Danny Auble authored
specifying the number of tasks and not the number of nodes.
-
- 05 Nov, 2012 2 commits
-
-
Morris Jette authored
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send SIGKILL. Previously just sent SIGKILL to tasks.
-
Morris Jette authored
On job kill requeust, send SIGCONT, SIGTERM, wait KillWait and send SIGKILL. Previously just sent SIGKILL to tasks.
-
- 02 Nov, 2012 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 31 Oct, 2012 1 commit
-
-
Mark Nelson authored
launching that wouldn't be able to run to completion because of a GrpCPUMins limit.
-
- 29 Oct, 2012 2 commits
-
-
Morris Jette authored
Now try to get max node count rather than minimizing leaf switches used. For example, if each leaf switch has 8 nodes then a request for -N4-16 would allocate 8 nodes (one leaf switch) rather than 16 nodes over two leaf switches.
-
Morris Jette authored
Anyhow, after applying the patch, I was still running into the same difficulty. Upon a closer look, I saw that I was still receiving the ALPS backend error in the slurmctld.log file. When I examined the code pertaining this and ran some SLURM-independent tests, I found that we were executing the do_basil_confirm function multiple times in the cases where it would fail. My independent tests show precisely the same behaviour; that is, if you make a reservation request, then successfully confirm it and then attempt to confirm it again, you receive this error message. However, the "apstat -rvv" command shows that the ALPS reservation is fine and therefore I concluded that this particular ALPS/BASIL message is more of an informational one and not a "show-stopper." In other words, I can consider the node ready at this point. As a simple work around, I currently just inserted an if-block immediately after the call to "basil_confirm" in function "do_basil_confirm" in ".../src/plugins/select/cray/basil_interface.c." The if-statment checks for "BE_BACKEND" and if this is the result then it prints an informational message to slurmctld.log and sets the variable rc=0 so that we can consider the node ready. This, now allows my prolog scripts to run and I can clearly see the SLURM message that I had placed in that if-block. However, I am not certain if we really should just allow this error code to pass through as it seems like it could be a fairly generic code and there could be various other causes of it where we would not wish to allow it to pass. I really only want to limit the number of calls to basil_confirm to one. Perhaps I could add a field to the job_record so that I can mark whether the ALPS reservation had been confirmed or not.
-
- 26 Oct, 2012 2 commits
-
-
Danny Auble authored
instead of putting 0's into the database we put NO_VALS and have sacct figure out jobacct_gather wasn't used.
-
Olli-Pekka Lehto authored
-
- 25 Oct, 2012 2 commits
-
-
Morris Jette authored
Incorrect error codes returned in some cases, especially if the slurmdbd is down
-
Morris Jette authored
-
- 24 Oct, 2012 1 commit
-
-
Morris Jette authored
Previously for linux systems all information was placed on a single line.
-
- 23 Oct, 2012 1 commit
-
-
Danny Auble authored
interface instead of through the normal method.
-
- 22 Oct, 2012 3 commits
-
-
Danny Auble authored
-
Morris Jette authored
-
Don Albert authored
-
- 19 Oct, 2012 1 commit
-
-
Morris Jette authored
-
- 18 Oct, 2012 6 commits
-
-
Danny Auble authored
for passthrough gets removed on a dynamic system.
-
Danny Auble authored
in error for passthrough.
-
Morris Jette authored
-
Danny Auble authored
-
Danny Auble authored
user's pending allocation was started with srun and then for some reason the slurmctld was brought down and while it was down the srun was removed.
-
Danny Auble authored
This is needed for when a free request is added to a block but there are jobs finishing up so we don't start new jobs on the block since they will fail on start.
-
- 17 Oct, 2012 2 commits
-
-
Morris Jette authored
Previously the node count would change from c-node count to midplane count (but still be interpreted as a c-node count).
-
jette authored
No real changes to logic other than some additional error checking.
-
- 16 Oct, 2012 2 commits
-
-
Morris Jette authored
Preempt jobs only when insufficient idle resources exist to start job, regardless of the node weight.
-
Danny Auble authored
-
- 05 Oct, 2012 2 commits
-
-
Morris Jette authored
Preemptor was not being scheduled. Fix for bugzilla #3.
-
Morris Jette authored
While this change lets gang scheduling happen, it overallocates resources from different priority partitions when gang scheduling is not running.
-
- 04 Oct, 2012 1 commit
-
-
Morris Jette authored
Preemptor was not being scheduled. See bugzilla #3 for details
-
- 02 Oct, 2012 2 commits
-
-
Morris Jette authored
See bugzilla bug 132 When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD (2.5.0-pre3). Conditions: ----------- * SelectType=select/cons_res * SelectTypeParameters=CR_Core_Memory * Using threads - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2 RealMemory=400" Description: ------------ In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has sufficient memory for a job. However, the per-CPU memory limits appear to be scaled by the number of threads. This new value may exceed the available memory on the node. And, once a node is overcommitted on memory, future memory checks in _verify_node_state() will always succeed. Scenario to reproduce: ---------------------- With the example node linux0, we run a single-core job with 250MB/core srun --mem-per-cpu=250 sleep 60 cons_res checks that it will fit: ((real - alloc) >= job mem) ((400 - 0) >= 250) and the job starts Then, the memory requirement is doubled: "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for job X" "slurmd: scaling CPU count by factor of 2" This job should not have started While the first job is still running, we submit a second, identical job srun --mem-per-cpu=250 sleep 60 cons_res checks that it will fit: ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job starts This second job also should not have started
-
Morris Jette authored
-
- 27 Sep, 2012 1 commit
-
-
Danny Auble authored
purged from the system if its front-end node goes down.
-