- 29 Mar, 2011 8 commits
-
-
Moe Jette authored
We have installed the same file this morning on all our systems (including a non-Cray cluster which also is SuSe based). I have verified that the limits get picked up by looking at /proc/$(pidof slurmd)/limits. select/cray: override ulimits on SuSe based system This provides a sample /etc/sysconfig/slurm file to override ulimits on Suse systems such as Cray. Since slurm respects limits configured by the system administrator, and since Cray/SuSe systems (in contrast to Debian-based systems) do not automatically exempt processes owned by the super-user from pam_limits configured in /etc/security/limits.conf, it can (and did) happen on Cray systems that such limits cause premature and counter-intuitive interaction with slurmd frontend nodes. The provided file overrides limits, using sensible defaults which have been inspired by the defaults set for processes owned by user root
-
Moe Jette authored
did this in preparation for the migration from PBS which will start next week. sbatch: support mpp.* PBS variants This adds support for Cray-specific PBS directives: * mppwidth: Task width (corresponds to --ntasks). This is not directly mapped, depends on the other parameters. * mppmem: Memory in units of k/m/g. Default unit is Mbyte, kbyte units are rounded up to the next Mbyte. Actual amount depends on mppnppn. * mppdepth: Task depth, maps into --cpus-per-task. * mppnppn: Processing elements per node, maps into --ntasks-per-node. * mppnodes: Nodelist. In contrast to PBS, requires nid%05u prefix, i.e the comma-separated list contains single entries nid%05u and/or ranges nid%05u-nid%05u.
-
Moe Jette authored
value (due to uint32_t conversion) in sprio. Helpful when fine-tuning weight parameters. sprio: print overall priority value even if it is less than 0 With some combinations of component values and low weight factors, it can happen that the priority computed by the priority/multifactor plugin lies below 0 (and would be rounded up to 2). When this condition happens, the negative values are difficult to interpret and can give the wrong impression that the resulting priority is very large (due to the conversion into a large unsigned number). In our tests we found it more helpful to display the negative priority value: a user can know that SLURM does not use negative values, having the absolute value gives a better indication how much weight to add to the other factors so that the overall priority centers around 0. Before: JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE 9968 sukysj 8955 218 0 236 0 0 -8500 10065 amsmax 4294957826 9 0 340 0 0 9821 10066 amsmax 4294957826 9 0 340 0 0 9821 10067 amsmax 4294957826 9 0 340 0 0 9821 10068 amsmax 4294957826 9 0 340 0 0 9821 10069 amsmax 4294957826 9 0 340 0 0 9821 10070 amsmax 4294957826 9 0 340 0 0 9821 After: JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE 9968 sukysj 8955 218 0 236 0 0 -8500 10065 amsmax -9470 9 0 340 0 0 9821 10066 amsmax -9470 9 0 340 0 0 9821 10067 amsmax -9470 9 0 340 0 0 9821 10068 amsmax -9470 9 0 340 0 0 9821 10069 amsmax -9470 9 0 340 0 0 9821 10070 amsmax -9470 9 0 340 0 0 9821
-
Moe Jette authored
set directly (since the priority factor fields are 0). i rity/multifactor: skip jobs whose priority has been set directly This avoids displaying "house numbers" in sprio if the priority has been set directly, as in the following example for aghasemi (whose group is a "bottom-feeder" with a fixed priority of 10): palu> squeue JOBID USER ACCOUNT NAME PARTITION ST REASON START_TIME TIME TIME_LEFT NODES PRIORITY 6971 robinson g13 cp2k day PD Resources 2011-03-16T13:09 0:00 40:00 35 10327 6983 rpopescu s190 bash day PD Resources N/A 0:00 1:00:00 1 8254 6958 aghasemi s142 poslow007 day PD Priority 2011-03-16T15:28 0:00 1:00:00 108 10 palu> sprio JOBID USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE 6958 aghasemi 10000 0 0 0 0 0 -10000 6964 rpopescu 8353 71 0 56 0 0 -8225 6971 robinson 10327 63 0 1988 0 0 -8276 ...
-
Moe Jette authored
this is ongoing, whenever I see something, I add it to such a patch.
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
- 28 Mar, 2011 7 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
passthrough nodes (the new bit). Sort by 1. Count of full dimensions (higher is better), then 2. Number of passthrough nodes (lower is better)
-
Moe Jette authored
user if the limit is not infinite.
-
Moe Jette authored
Patch from Hongjia Cao, NUDT.
-
Moe Jette authored
Load the apkill value from cray.conf even on alps emulated systems
-
- 27 Mar, 2011 3 commits
- 26 Mar, 2011 15 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
Added --with-alps-emulation to configure, and also an optional cray.conf to setup alps location and database information.
-
Moe Jette authored
child process, to avoid interactions with network sockets and file descriptors in use by the slurmctld.
-
Moe Jette authored
perform admissibility checks when e.g. moving a pending job into a restricted partition to which the user has access).
-
Moe Jette authored
-
Moe Jette authored
correspondence with Ben - to me this now looks much similar to the way the SGI pagg IDs are used on Cray: * upper part = node_id (corresponding to alloc_node) * lower part = 32 bit job ID (corresponding to alloc_sid)
-
Moe Jette authored
of patches #17 and #18.
-
Moe Jette authored
-
Moe Jette authored
meant to pass the signal on to the plugin, along with the job_ptr, so that the plugin can take signal-dependent actions. In the case of select/cray this is to forward the signal to the remote aprun invocations using apkill (implemented in subsequent patch #20).
-
Moe Jette authored
separates (a) signalling job steps from (b) cleaning up reservations. The former is used by subsequent patches for a better job_signal() on Cray.
-
Moe Jette authored
-
Moe Jette authored
If ALPS directory is explicitly set, then assume it's a Cray system.
-
Danny Auble authored
better error handling, we need to make sure we add blocks to the list before starting the real time server or we could miss things on start up
-
Danny Auble authored
-
- 25 Mar, 2011 7 commits
-
-
Danny Auble authored
BLUEGENE - fixed bug where the usage of a node could get the incorrect usage when adding it from another node when only using it for passthrough
-
Danny Auble authored
-
Moe Jette authored
to use select_g_job_expand_allow() instead
-
https://eris.llnl.gov/svn/slurm/branches/slurm-2.3.expandMoe Jette authored
-- Add support for jobs to expand in size. Submit additional batch job with the option "--dependency=expand:<jobid>". See web page "faq.html#job_size" for details. Restrictions to be removed in the future.
-
Moe Jette authored
extern bool select_p_job_expand_allow(void); Now only true for select/linear
-
Danny Auble authored
-
Danny Auble authored
-