Commits · e6722cf434cdbaf735c4597a1a2addf399d932ba · Manuel G. Marciani / ces_slurm_simulator

16 Apr, 2011 6 commits
- Change container ID supported by proctrack plugin from 32-bit to 64-bit. · e6722cf4
  Moe Jette authored Apr 16, 2011
```
This is the size of the container ID in the current SGI_JOB (PAGG)
library.
```
  e6722cf4
- The srun2aprun wrapper should be complete now, although it is still largely untested. · eae5dc75
  Moe Jette authored Apr 16, 2011
  
  eae5dc75
- results after running autogen.sh on snowflake · d4be0bd2
  Moe Jette authored Apr 15, 2011
  
  d4be0bd2
- Add result to check for --with-srun2aprun configure option · 16d77565
  Moe Jette authored Apr 15, 2011
  
  16d77565
- -- Add new configure option --with-srun2aprun to build an srun command · c720d586
  Moe Jette authored Apr 15, 2011
```
    which is a wrapper over Cray's aprun command and supports many srun
    options. Without this option, the srun command will advise the user
    to use the aprun command.
```
  c720d586
- -- Add new configure option --with-srun2aprun to build an srun command · 50fb255a
  Moe Jette authored Apr 15, 2011
```
    which is a wrapper over Cray's aprun command and supports many srun
    options. Without this option, the srun command will advise the user
    to use the aprun command.
```
  50fb255a
15 Apr, 2011 3 commits
- Remove unused function argument · 507d8a3a
  Moe Jette authored Apr 14, 2011
  
  507d8a3a
- svn merge -r 23081:23090 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 9c34fae6
  Moe Jette authored Apr 14, 2011
  
  9c34fae6
- -- Fix memory leak if MPI ports are reserved (for OpenMPI) and srun's · 200c2bc8
  Moe Jette authored Apr 14, 2011
```
    --resv-ports option is used.
```
  200c2bc8
14 Apr, 2011 7 commits
- remove un-needed input option · cf119d90
  Danny Auble authored Apr 14, 2011
  
  cf119d90
- fix memory leak · a551477a
  Danny Auble authored Apr 14, 2011
  
  a551477a
- no change in logic, only minor code clean-up (refactor if/else logic) · 3e2f7dd6
  Moe Jette authored Apr 14, 2011
  
  3e2f7dd6
- BLUEGENE - structures put in place to handle querying the block a job is using. · 90116515
  Danny Auble authored Apr 13, 2011
  
  90116515
- removed spaces · 8eeadb0e
  Danny Auble authored Apr 13, 2011
  
  8eeadb0e
- -- Do not build select/bluegene plugin if C++ compiler is not installed. · 453e1ba0
  Moe Jette authored Apr 13, 2011
  
  453e1ba0
- svn merge -r23072:23081 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · b50a76ec
  Danny Auble authored Apr 13, 2011
  
  b50a76ec
13 Apr, 2011 5 commits
- fix alignment · d6f8b304
  Moe Jette authored Apr 13, 2011
  
  d6f8b304
- BLUEGENE - when running in overlap mode make sure to check the connection type... · 08267277
  Danny Auble authored Apr 13, 2011
```
BLUEGENE - when running in overlap mode make sure to check the connection type so you can create overlapping blocks on the exact same nodes with different connection types (i.e. one torus, one mesh).
```
  08267277
- added allocation of jobinfo structure for a step · 595fb8a4
  Danny Auble authored Apr 13, 2011
  
  595fb8a4
- updated cray config info · c8cc2b79
  Moe Jette authored Apr 13, 2011
  
  c8cc2b79
- add pointer to yiannis' thesis · 59bf6675
  Moe Jette authored Apr 13, 2011
  
  59bf6675
12 Apr, 2011 3 commits
- -- Updated plugins/task/cgroup/task_cgroup_cpuset.c to support newer · 18cba554
  Don Lipari authored Apr 12, 2011
```
   HWLOC_API_VERSION
```
  18cba554
- Added structures for steps in bluegene Q land. · 27fb2682
  Danny Auble authored Apr 12, 2011
  
  27fb2682
- Improved the work of revision r23074 · fbc91856
  Don Lipari authored Apr 11, 2011
  
  fbc91856
11 Apr, 2011 7 commits
- Made the man pages to html pages (r23038 and r23052) generation conditional on · 069ae9f6
  Don Lipari authored Apr 11, 2011
```
 the presence of the man2html utility.
```
  069ae9f6
- svn merge -r23028:23072 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · bd222bc7
  Danny Auble authored Apr 11, 2011
  
  bd222bc7
- Make sview display correct cpu count for steps. · 59bd0f5b
  Danny Auble authored Apr 11, 2011
  
  59bd0f5b
- Fix a couple of problems with exported symbols that prevented tests · 96451c9c
  Moe Jette authored Apr 11, 2011
```
20.2, 20.4 and 7.3 from running with select/bluegene
```
  96451c9c
- Correct spelling error · ca5b59e2
  Moe Jette authored Apr 11, 2011
  
  ca5b59e2
- Add #include "src/common/slurm_xlator.h" to the select/cray iles · b3d3f12b
  Moe Jette authored Apr 11, 2011
```
so that we can resolve xstrdup() to slurm_xstrdup().
```
  b3d3f12b
- aprun wrapper should be close to as good as possible · ae8a34ed
  Moe Jette authored Apr 11, 2011
  
  ae8a34ed
10 Apr, 2011 9 commits

tweaks to some tests to reflect changes recent changes in priority change · 5498bb90
Moe Jette authored Apr 10, 2011
```
and dependency clearing logic
```
5498bb90

api: remove unreferenced and undocumented function · 22ece52e

Moe Jette authored Apr 10, 2011

This removes a function "slurm_pack_msg_no_header" which is nowhere referenced
in the src tree and which also is not listed in any of the slurm manpages.

As far as I understand the documentation, each slurm message needs to have a
header, it could thus be that this function is from very old or initial code.

22ece52e

scontrol: refactor if/else statement · 367c71ba
Moe Jette authored Apr 10, 2011

367c71ba
protocol_defs: remove duplicate/identical test · 5c13acad
Moe Jette authored Apr 10, 2011
```
This removes a duplicated test statement which appears identically twice.
```
5c13acad

sprio: add support for the SLURM_CLUSTERS environment variable · 0a0efdf2

Moe Jette authored Apr 10, 2011

This adds support for the SLURM_CLUSTERS environment variable also for sprio.
It also makes the test for the priority plugin type dependent on whether
running with multiple cluster support or not.

0a0efdf2

scontrol: add support for the SLURM_CLUSTERS environment variable · c7045c83

Moe Jette authored Apr 09, 2011

On our frontend host we support multiple clusters (Cray and non-Cray) by
setting the SLURM_CLUSTERS environment variable accordingly.

In order to use scontrol (e.g. for hold/release of a user job) from a
frontend host to control jobs on a remote Cray system, we need support for
the SLURM_CLUSTERS environment variable also in scontrol.

c7045c83

slurmctld: keep original nice value when putting job on hold · b414712e

Moe Jette authored Apr 09, 2011

The current code erases the old nice value (both negative and positive) when a job is
put on hold so that the job has a 0 nice component upon release.

This interaction causes difficulties if the nice value set at submission time had been
set there for a reason, for instance when
 * a system administrator has allowed to set a negative nice value;
 * the user wanted to keep this as a low-priority job and wants his/her other jobs
   to go first (indenpendent of the hold option);
 * the nice value is used for other semantics - at our site for instance, we use it
   for computed "base priority values" that are computed by looking at how much of
   their quota a given group has already (over)used.

Here is an example which illustrates the loss of original nice values:

  [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 55
  [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 66
  [2011-03-31T09:47:53] sched: update_job: setting priority to 0 for job_id 77
  [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 88
  [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 99
  [2011-03-31T09:47:54] sched: update_job: setting priority to 0 for job_id 110

This is from user 'kraused' whose project 's310' is within the allocated quota and thus
has an initial nice value of -542 (set via the job_submit/lua plugin).

However, by putting his jobs on hold, he has lost this advantage:

  JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
     55  kraused      15181        153          0       5028      10000          0      0
     66  kraused      15181        153          0       5028      10000          0      0
     77  kraused      15181        153          0       5028      10000          0      0
     88  kraused      15178        150          0       5028      10000          0      0
     99  kraused      15178        150          0       5028      10000          0      0
    110  kraused      15178        150          0       5028      10000          0      0

I believe that resetting the nice value has been there for a reason, thus the patch prevents
reset of current nice value only if the operation is not user/administrator hold.

b414712e

slurmctld: test job_specs->min_nodes before altering the value via partition setting · f8ea48bb

Moe Jette authored Apr 09, 2011

This fixes a problem when trying to move a pending job from one partitition to another
while not supplying any other parameters:
 * if a partition value is present, the job is pending and no min_nodes are supplied,
   the job_specs->min_nodes get set from the detail_ptr value;
 * this causes subsequent tests for job_specs->min_nodes ==/!= NO_VAL to fail.

The following illustrates the behaviour, the example is taken from our system:
  palu2:0 ~>scontrol update jobid=3944 partition=night
  slurm_update error: Requested operation not supported on this system

  slurmctld.log
  [2011-04-06T14:39:51] update_job: setting partition to night for job_id 3944
  [2011-04-06T14:39:51] Change of size for job 3944 not supported
  [2011-04-06T14:39:51] updating accounting
  [2011-04-06T14:39:51] _slurm_rpc_update_job JobId=3944 uid=21215: Requested operation not supported on this system

==> The 'Change of size for job 3944' reveals that the !select_g_job_expand_allow() case was triggered,
    after setting the job_specs->min_nodes due to supplying a job_specs->partition.

Fix:
====
 Since the test for select_g_job_expand_allow() is not dependent on the job state, moved it up, before
 the test for the job_specs->partition. At the same time, also moved the equality for INFINITE/NO_VAL
 min_nodes values to the same place.
 Tests for job_specs->min_nodes below the job_specs->partition setting depend on the job state, 
 - the 'Reset min and max node counts as needed, insure consistency' requires pending state;
 - the other remaining test is only for IS_JOB_RUNNING/SUSPENDED.

f8ea48bb

slurmctld: case of authorized operator releasing user hold · 1895a10a

Moe Jette authored Apr 09, 2011

This patch avoids that the priority is not recalculated on 'scontrol release',
which  happens when an authorized operator releases a job, or if the job is
released via e.g. the job_submit plugin.

The patch reorders the tests in update_job() to 
 * test first if the job has been held by the user and, only if not,
 * test whether an authorized operator changed the priority or
   the updated priority is being reduced.

Due to earlier permission checks, we have either
 * job_ptr->user_id == uid or 
 * authorized,
where in both cases the release-user-hold operation is authorized.

1895a10a