- 21 Apr, 2016 2 commits
-
-
Brian Christiansen authored
-
Morris Jette authored
burst_buffer/cray - Don't call Datawarp "paths" function if script includes only create or destroy of persistent burst buffer. Some versions of Datawarp software return an error for such scripts, causing the job to be held. bug 2624
-
- 20 Apr, 2016 2 commits
-
-
Janne Blomqvist authored
I noticed that the CpuFreqDef config option was only partially implemented. The value was parsed, but the never used. So I took the liberty of re-purposing it to mean sort of the opposite, namely the frequency governor to use when running a job step in case the job doesn't explicitly provide any --cpu-freq option. I also changed the default of the CpuFreqGovernors option to be "ondemand,performance", since ondemand isn't available with the intel_pstate driver. Otherwise the patch should be relatively straightforward and only changes a few minor things here and there.
-
Tim Wickberg authored
-
- 15 Apr, 2016 1 commit
-
-
Morris Jette authored
Add TopologyParam option of "TopoOptional" to optimize network topology only for jobs requesting it. bug 2567
-
- 14 Apr, 2016 8 commits
-
-
Tim Wickberg authored
Timeout stalled transfers and cleanup related data structures. Default to wait five minutes since last update. Hook onto registration/ping message type to trigger cleanup in a minimally invasive manner. While here restructure certain functions to use list_* functions rather than iterate on the structures.
-
Tim Wickberg authored
Otherwise --mail-type=ALL will send an unexpected stage_out message back to the user. Bug 2541.
-
Morris Jette authored
-
Janne Blomqvist authored
Siphash is a state of the art keyed hash function that is performance competitive with the usual non-cryptographic hash functions. It's used as the default hash function backing hash tables in e.g. Perl, Python, Rust, and so on. Here we initially use it for the gid cache hash table, and in the common xhash implementation.
-
Brian Christiansen authored
For commits: f980c588 510abf23
-
Brian Christiansen authored
-
Brian Christiansen authored
Bug 1783
-
Morris Jette authored
select/cray - Initiate step node health check at start of step termination rather than after application completely ends so that NHC can capture information about hung (non-killable) processes. bug 2192
-
- 13 Apr, 2016 7 commits
-
-
Tim Wickberg authored
Make default compression vary based on library availability.
-
Morris Jette authored
-
Morris Jette authored
power/cray - Fix bug introduced in 15.08.10 preventin operation in many cases. bug 2628
-
Morris Jette authored
-
Morris Jette authored
burst_buffer/cray - Fix for script creating or deleting persistent buffer would fail "paths" operation and hold the job. bug 2624
-
Danny Auble authored
and it doesn't meet basic requirements.
-
Danny Auble authored
that wasn't set up correctly.
-
- 12 Apr, 2016 3 commits
-
-
Morris Jette authored
power/cray - Fix bug introduced in 15.08.10 preventin operation in many cases. bug 2628
-
Brian Christiansen authored
Bug 2431
-
Morris Jette authored
-
- 11 Apr, 2016 5 commits
-
-
Morris Jette authored
burst_buffer/cray - Fix for script creating or deleting persistent buffer would fail "paths" operation and hold the job. bug 2624
-
Danny Auble authored
and it doesn't meet basic requirements.
-
Tim Wickberg authored
Bug 2622.
-
Morris Jette authored
burst_buffer/cray - Decrement job's prolog_running counter if pre_run fails. bug 2621
-
Morris Jette authored
If a job is no longer in configuring state, then clear the prolog_running counter on slurmctld restart or reconfigure. bug 2621
-
- 09 Apr, 2016 1 commit
-
-
Morris Jette authored
When determining when a pending job will be able to start, rather than testing after removing each running job and trying to schedule the pending jobs, remove multiple jobs that all end about the same time before testing. This reduces the number of calls to the job placement logic, which is time consuming.
-
- 08 Apr, 2016 1 commit
-
-
Morris Jette authored
-
- 07 Apr, 2016 2 commits
-
-
Sami Ilvonen authored
-
Morris Jette authored
Fix for job "--contiguous" option that could cause job allocation/launch failure or slurmctld crash. bug 2573
-
- 06 Apr, 2016 7 commits
-
-
Morris Jette authored
-
Danny Auble authored
constraints mattered in a job. Details include: A job doesn't request memory but the system is running with CR_*MEMORY with no default memory limit and the job requests nodes with features of different sizes. Previously the order of constraints mattered where the smaller memory node would need to be requested first or the job would fail. Bug 2608
-
Danny Auble authored
This reverts commit f559a55c.
-
Danny Auble authored
constraints mattered in a job. Details include: A job doesn't request memory but the system is running with CR_*MEMORY with no default memory limit and the job requests nodes with features of different sizes. Previously the order of constraints mattered where the smaller memory node would need to be requested first or the job would fail. Bug 2608
-
Morris Jette authored
Previous logic would get an account and/or QOS time limit and use that value to overwrite the incoming RPC's NO_VAL value, which would change a job's time limit when changing an unrelated field (e.g. priority, QOS, etc.). bug 2610
-
Danny Auble authored
-
Morris Jette authored
bug 2609
-
- 05 Apr, 2016 1 commit
-
-
Morris Jette authored
Fix backfill scheduler race condition that could cause invalid pointer in select/cons_res plugin. Bug introduced in 15.08.9, commit: efd9d35e The scenario is as follows 1. Backfill scheduler is running, then releases locks 2. Main scheduling loop starts a job "A" 3. Backfill scheduler resumes, finds job "A" in its queue and resets it's partition pointer. 4. Job "A" completes and tries to remove resource allocation record from select/cons_res data structure, but fails to find it because it is looking in the table for the wrong partition. 5. Job "A" record gets purged from slurmctld 6. Select/cons_res plugin attempts to operate on resource allocation data structure, finds pointer into the now purged data structure of job "A" and aborts or gets SEGV Bug 2603
-