Commits · 380ccdf57743f6c5f4396525ae669f7cfac8fc62 · Manuel G. Marciani / ces_slurm_simulator

22 Apr, 2011 3 commits
- Added a README title · 380ccdf5
  Ralph Bean authored Apr 20, 2011
  
  380ccdf5
- Updated references from README to README.rst · bf94e631
  Ralph Bean authored Apr 20, 2011
  
  bf94e631
- Renamed README to README.rst for github formatting. · 98b1f3ae
  Ralph Bean authored Apr 20, 2011
```
The `.rst` extension stands for reStructuredText, one of the markup formats
supported by github.  The tool they use to render your `README.*` and the list
of markup languages it supports can be found here:
    https://github.com/github/markup
```
  98b1f3ae
21 Apr, 2011 3 commits
- minor updates to cray.html document · c7730a12
  Morris Jette authored Apr 20, 2011
  
  c7730a12
- fix a number of problems with respect to cray systems. · 05ece609
  Morris Jette authored Apr 20, 2011
  
  05ece609
- reverted more of the change from Gerrit since the last revert made functions... · 1fc3dd83
  Danny Auble authored Apr 20, 2011
```
reverted more of the change from Gerrit since the last revert made functions that aren't used anymore.
```
  1fc3dd83
20 Apr, 2011 6 commits
- Fix a bunch of Cray regression tests failures. · 752d0ba1
  Morris Jette authored Apr 19, 2011
  
  752d0ba1
- Increment version number in proctrack plugins to 91 for 64-bit container ids. · a52aa101
  Morris Jette authored Apr 19, 2011
  
  a52aa101
- proctrack: conversion of container ID to 64 bit · 5caef359
  Morris Jette authored Apr 19, 2011
```
This continues the conversion of the cont_id from 32bit to 64bit,
 * updated slurm_container_find() to return u64 in order to match type of container ID;
 * updated slurm_proctrack_ops() to match the update of u32 -> u64 container ID in the
   slurm_container_xxx() functions,
 * miscellaneous type conversions from/to u64,
 * using "%"PRIu64"" for printing 64 bit.
Patch 01_proctrack-64-bit-conversion.diff from Gerrit Renker
```
  5caef359
- modify printing of container id (a uint64_t) from %lu to %"PRIu64". · cebcaa07
  Morris Jette authored Apr 19, 2011
  
  cebcaa07
- r23125 | lipari | 2011-04-19 16:39:48 -0700 (Tue, 19 Apr 2011) | 2 lines · a7b2ce71
  Morris Jette authored Apr 19, 2011
```
Minor mods to man page html generation
```
  a7b2ce71
- Reverted back the change in get_basil_version, and made it a static var, only... · d86ea460
  Danny Auble authored Apr 19, 2011
```
Reverted back the change in get_basil_version, and made it a static var, only compile checked so far.
```
  d86ea460
19 Apr, 2011 5 commits
- -- Add configuration parameter MaxStepCount to limit effect of bad batch · 26c97457
  Moe Jette authored Apr 19, 2011
```
    scripts.
```
  26c97457
- -- Create RPM for srun command that is a wrapper for the Cray/ALPS aprun · 4ce20bea
  Moe Jette authored Apr 19, 2011
```
    command. Dependent upon .rpmmacros parameter of "%_with_srun2aprun"
```
  4ce20bea
- svn merge -r23090:23121 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 6f23e95a
  Moe Jette authored Apr 18, 2011
  
  6f23e95a
- Bug fix to doc/html/Makefile.am · cd23dac2
  Don Lipari authored Apr 18, 2011
  
  cd23dac2
- Remove linkage of StorageType plugin with GathterType plugin. · bd43cd60
  Moe Jette authored Apr 18, 2011
```
Just use whichever StorageType plugin the user specifies to the configurator
```
  bd43cd60
18 Apr, 2011 6 commits
- Don't re-kill a job that has been allocated a node newly marked DOWN · c6ac5499
  Moe Jette authored Apr 18, 2011
  
  c6ac5499
- Note that PreemptType=preempt/qos can not be used with PreemptMode=SUSPEND · a219d47a
  Moe Jette authored Apr 18, 2011
  
  a219d47a
- clarify use of PreemptTime configuration parameter, · 5dfcf720
  Moe Jette authored Apr 18, 2011
```
Patch from Bill Brophy
```
  5dfcf720
- -- Fix some anomalies in select/cons_res task layout when using the · 855ca6f3
  Moe Jette authored Apr 18, 2011
```
    --cpus-per-task option. Patch from Martin Perry, Bull.
```
  855ca6f3
- fix for multi-cluster mode and cray · 0bf9d9a9
  Danny Auble authored Apr 18, 2011
  
  0bf9d9a9
- various patches from Gerrit and such · 9e1e216c
  Danny Auble authored Apr 18, 2011
  
  9e1e216c
17 Apr, 2011 13 commits

adds a bunch of cosmetic changes from Gerrit. · 24a02ef9
Moe Jette authored Apr 17, 2011

24a02ef9

job_submit/lua: expose priority-related job_record fields to lua interface · b22385b7

Moe Jette authored Apr 17, 2011

This allows scripted modification of job records, by exposing the
 * job_ptr->direct_set_prio
 * job_ptr->priority
 * job_ptr->details->nice
fields to the job_submit.lua script.

b22385b7

slurmctld: allow job_submit plugin to modify/set the priority/nice values · 98d7059f

Moe Jette authored Apr 17, 2011

This allows the job_submit plugin to directly set priority values. If it
assigns a priority value different from 0 and NO_VAL, the priority is marked
as "fixed" via job_ptr->direct_set_prio.

To enable this, the permission check for directly set priority is now done
before calling the job_submit plugin, which in addition also allows to
influence the nice value of the job via the plugin.

98d7059f

job_submit: allow job_submit plugin to put job on hold · fecf4769

Moe Jette authored Apr 17, 2011

This reorders the code of _job_create() to the effect that the job_submit plugin
is able to put a job on hold (by setting the job priority to 0). To prevent the
user from releasing such jobs, jobs put on hold by the job_submit plugin use
WAIT_HELD rather than WAIT_HELD_USER.

fecf4769

select/cray: unconditionally release reservations · 6d36b50c

Moe Jette authored Apr 17, 2011

This increases robustness in releasing ALPS reservations. Previously
the reservation was only released through
 * select_g_job_fini() for interactive (salloc) sessions;
 * batch_finish() by slurmstepd for batch sessions.

This introduces a single point of failure for batch jobs, since a failure
of batch_finish() would mean that the reservation could only be released
much later, through the detection of orphaned ALPS reservations in
basil_inventory().

For batch jobs that terminate normally this means that the RELEASE method is
called twice: first in job_complete(), and then in batch_finish(). The Basil
1.2 design document by Ben Landsteiner (dated 15 Feb 2011) suggests in section
3.3.5 repeated calls of RELEASE as one possible way of improving the response
of the RELEASE method. There will be additional "entry not found" messages in
the apschedMMDD logs, but (due to the preceding patch) not in the SLURM logs.

For jobs that have to be terminated (e.g. job_timed_out, job_requeue, job_fail),
this patch will mean that the RELEASE is called much sooner and thus is 
expected to improve efficiency.

For interactive salloc sessions that are cancelled via scancel, there is now
no longer a warning message about the no longer existing ALPS reservation
(since the release happens first through select_p_job_signal and then through
 job_complete -> deallocate_nodes -> select_p_job_fini).

6d36b50c

select/cray: special case for "no such resId" · d26dc971

Moe Jette authored Apr 17, 2011

For robustness, it does make sense calling the RELEASE method multiple times.

The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
section 3.3.5, "Improve RELEASE method response" the following:

 "Periodically send RELEASE method requests until the RELEASE method
  response indicates the reservation is gone (via an error response)".

The typical error message for this case is (also shown in the document on page 11):

[2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
[2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543

There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
 1. scancel causes job_signal() to be invoked,
 2. job_signal() defers to select_g_job_signal(),
    - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
    - at this stage the reservation is already released,
 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
 4. this calls select_g_job_fini(),
    - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
    - since at this stage the reservation has already been released, the error message results.

I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
for robustness, I would very much prefer that do_basil_release() is always called from
deallocate_nodes().

Hence this patch creates a custom error class for the case "no entry for resId xxx". This
allows the calling function to still catch the error, but the unnecessary warning is no
longer printed in the logfiles.

The callers of this method are:
 * do_basil_release() - which already is set up to handle error/non-error case;
 * basil_safe_release() - this does not extra error checking, since it is called
   when trying to remove orphaned reservations, any failure in attempting to 
   release the reservation will result in repeated "orphaned ALPS reservation ..."
   messages.select/cray: special case for "no such resId"

For robustness, it does make sense calling the RELEASE method multiple times.

The Basil 1.2 document by Ben Landsteiner, dated 15th Feb 2011, suggests in
section 3.3.5, "Improve RELEASE method response" the following:

 "Periodically send RELEASE method requests until the RELEASE method
  response indicates the reservation is gone (via an error response)".

The typical error message for this case is (also shown in the document on page 11):

[2011-04-13T17:57:35] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 2087
[2011-04-13T17:57:35] sched: Cancel of JobId=5730 by UID=21215, usec=107543

There is already at least 1 use case for this type of error: cancelling a salloc sesion via scancel:
 1. scancel causes job_signal() to be invoked,
 2. job_signal() defers to select_g_job_signal(),
    - select/cray:select_p_job_signal() calls do_basil_release() before forwarding SIGKILL,
    - at this stage the reservation is already released,
 3. in the likely case of a running salloc session, job_signal() then calls deallocate_nodes(),
 4. this calls select_g_job_fini(),
    - the salloc default action for select/cray:select_p_job_fini() is to call do_basil_release(),
    - since at this stage the reservation has already been released, the error message results.

I don't like this error message myself, but avoiding it in all call paths is complicated. Also,
for robustness, I would very much prefer that do_basil_release() is always called from
deallocate_nodes().

Hence this patch creates a custom error class for the case "no entry for resId xxx". This
allows the calling function to still catch the error, but the unnecessary warning is no
longer printed in the logfiles.

The callers of this method are:
 * do_basil_release() - which already is set up to handle error/non-error case;
 * basil_safe_release() - this does not extra error checking, since it is called
   when trying to remove orphaned reservations, any failure in attempting to 
   release the reservation will result in repeated "orphaned ALPS reservation ..."
   messages.

d26dc971

select/cray: always release reservation before calling apkill · 164ed8df

Moe Jette authored Apr 17, 2011

This patch implements the same principle as an earlier one to fix issues
when signalling aprun job steps via apkill: to avoid race conditions
where further aprun lines get started while the current one is still in
progress, always release the reservation first.

164ed8df

select/cray: refactor Basil 4.0 table · 25f5fbc8

Moe Jette authored Apr 17, 2011

This refactors the code to parse Basil 4.0 response data, removing
code that is applicable to both Basil 3.1 and 4.0.

25f5fbc8

Major update of Cray documentation. · f95315c9
Moe Jette authored Apr 17, 2011

f95315c9
Add some more Cray-specific tools. · fbe72f99
Moe Jette authored Apr 17, 2011

fbe72f99
updated libapls test logic and add more notes · ed422a91
Moe Jette authored Apr 17, 2011

ed422a91
-- Added contribs/cray/libalps_test_programs.tar.gz with tools to validate · 2bf4d43e
Moe Jette authored Apr 17, 2011
```
    SLURM's logic used to support Cray systems.
```
2bf4d43e
modify srun wrapper to take single character options without a space between · c3c4e48b
Moe Jette authored Apr 17, 2011
```
the key and value (e.g. "-N2" gets translated to "-N 2" for the perl parser).
```
c3c4e48b

16 Apr, 2011 4 commits
- make the test and error logic a bit more clear · 5f468fdc
  Moe Jette authored Apr 16, 2011
  
  5f468fdc
- Put some file parsing functions into slurm_xlator.h. · c4af5600
  Moe Jette authored Apr 16, 2011
```
This is so that the select/cray plugin can read its
configuration file and still be used by the perl 
wrappers.
```
  c4af5600
- Change container ID supported by proctrack plugin from 32-bit to 64-bit. · e6722cf4
  Moe Jette authored Apr 16, 2011
```
This is the size of the container ID in the current SGI_JOB (PAGG)
library.
```
  e6722cf4
- The srun2aprun wrapper should be complete now, although it is still largely untested. · eae5dc75
  Moe Jette authored Apr 16, 2011
  
  eae5dc75