- 19 Feb, 2013 1 commit
-
-
David Bigagli authored
-
- 15 Feb, 2013 1 commit
-
-
Morris Jette authored
-
- 13 Feb, 2013 2 commits
-
-
Danny Auble authored
midplane.
-
Hongjia Cao authored
Handle situation where a receiving/forwarding host can't unpack the header of the sender (not compatible version).
-
- 12 Feb, 2013 6 commits
-
-
Puenlap Lee authored
-
Morris Jette authored
The logic is only an example and not meant for actual use.
-
Danny Auble authored
built.
-
Morris Jette authored
(within the existing select_nodeinfo field of the node_info_t data structure). Added Allocated Memory to node information displayed by sview and scontrol commands. bug 229
-
Morris Jette authored
This makes the configuration parameter names consistent within a partition and system-wide
-
Morris Jette authored
Added new field to partition_info data structure Break up some long lines and minor format changes Move some definitions and statements into alphabetic order
-
- 11 Feb, 2013 2 commits
-
-
Morris Jette authored
1. Removed the job_submit and job_modify functions from the plugin, they are not required for the "slurmctld" plugin type 2. Renamed the new parameter from "JobSubmitDynAllocPort" to "DynAllocPort" and renamed the variable (You need to change this in your slurm.conf file) 3. Added logic so you can see the DynAllocPort value using "scontrol show config" or "sview" 4. I made some minor formatting changes, mostly for lines that were too long 5. Added #ifdef to the msg.h header file 6. Changed the #ifdef variables in the header files to start with "DYNALLOC_", perhaps not needed, but it should safer, especiallly with some common names like "INFO_H" 7. I re-wrote much of info.c. There was no need to get a copy of the node information and process the copy. We can just work directly with the data structures.
-
Jimmy Cao authored
These provide support for MapReduce+
-
- 08 Feb, 2013 2 commits
-
-
Danny Auble authored
of the current day.
-
David Bigagli authored
the user commands.
-
- 07 Feb, 2013 2 commits
-
-
Morris Jette authored
Make core reservation test a bit more flexible in terms of node names Fix arithmetic for core reservation count return for old RPCs Update NEWS and RELEASE_NOTES Remove trailing spaces from recent scontrol man page changes
-
Danny Auble authored
the allocation to 1 instead of 0 which prevents the allocation to happen.
-
- 06 Feb, 2013 3 commits
-
-
Morris Jette authored
Old logic uses some vestigial power management logic that no longer applies. It creates a job allocation, but the job step blocks indefinitely.
-
Morris Jette authored
Added for MR+ operation
-
Morris Jette authored
-
- 05 Feb, 2013 6 commits
-
-
Morris Jette authored
-
Danny Auble authored
since left the system.
-
Danny Auble authored
-
Danny Auble authored
to compile correctly on newer compilers Signed-off-by: Danny Auble <da@schedmd.com>
-
Don Lipari authored
-
Morris Jette authored
If job involved in dependency completes and is purged the logic used to test for circular dependencies can use the invalid pointer and generate an invalid memory reference before the pointer is cleared from the dependency list data structure.
-
- 04 Feb, 2013 1 commit
-
-
Morris Jette authored
-
- 01 Feb, 2013 2 commits
-
-
Morris Jette authored
This bug was introduced in 2.5.2 for the case where a GPU count was configured, but without device files. Unfortunately it left the CUDA_VISIBLE_DEVICES environment variable unset if GPUs were configured, but the job did not request any of them.
-
Morris Jette authored
-
- 31 Jan, 2013 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
This eliminates the need for libnrt.so on the head node.
-
- 30 Jan, 2013 2 commits
-
-
Danny Auble authored
Signed-off-by: Danny Auble <da@schedmd.com>
-
David Bigagli authored
_prefix /usr/local _slurm_sysconfdir %{_prefix}/etc/slurm _mandir %{_prefix}/share/man _infodir %{_prefix}/share/info
-
- 29 Jan, 2013 4 commits
-
-
Danny Auble authored
block, destroy it correctly.
-
Danny Auble authored
at least 3.5.0. This avoids a stack overflow when running jobs on more than 120k nodes.
-
jette authored
-
Morris Jette authored
Used for generic stack of slurmctld daemon plugins
-
- 25 Jan, 2013 1 commit
-
-
Morris Jette authored
-
- 23 Jan, 2013 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
jette authored
I run into a problem with slurm-2.5.1 that IDLE nodes can not be allocated to jobs. This can be reproduced as follows: First, submit a job with --no-kill option (I have SLURM_EXCLUSIVE set to allocate nodes exclusively by default). Then set one of the nodes allocated to the job(cn2) to state DOWN: srun: error: Node failure on cn2 srun: error: Node failure on cn2 srun: error: cn2: task 0: Killed ^Csrun: interrupt (one more within 1 sec to abort) srun: task 1: running srun: task 0: exited abnormally ^Csrun: sending Ctrl-C to job 22605.0 srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: Force Terminated job step 22605.0 Then change state of the node to IDLE again. But it can not be allocated to jobs: srun: job 22606 queued and waiting for resources JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 22606 work hostname root PD 0:00 1 (Resources) 22604 work sbatch root R 3:06 1 cn1 NodeName=cn2 Arch=x86_64 CoresPerSocket=8 CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=0.05 Features=abc Gres=(null) NodeAddr=cn2 NodeHostName=cn2 OS=Linux RealMemory=30000 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2012-12-24T15:22:34 SlurmdStartTime=2013-01-14T11:06:32 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 I traced and located the problem in select/cons_res. The call sequence is: slurmctld/node_mgr.c: update_node() => slurmctld/job_mgr.c: kill_running_job_by_node_name() => excise_node_from_job() => plugins/select/cons_res/select_cons_res.c: select_p_job_resized() => _rm_job_from_one_node() => _build_row_bitmaps() => common/job_resources: remove_job_from_cores() If there are other jobs running in the partition, the partition row bitmap will not be set correctly. In the example above, before _build_row_bitmaps(), output of _dump_part() is: [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-63 after setting the node down, output of _dump_part() is [2013-01-19T13:24:56+08:00] part:work rows:1 pri:1 [2013-01-19T13:24:56+08:00] row0: num_jobs 2: bitmap: 16,32-47 Cores of cn2 are not marked as available. Instead, cores of other nodes are released. When another job requires the node cn2, the following log message appears: [2013-01-19T13:25:03+08:00] debug3: cons_res: _vns: node cn2 busy I do not understand the design of select/cons_res well and I do not know how to fix this. But it seems that _build_row_bitmaps() should not be called, since the job is not removed totally, but only one of the nodes released.
-