1. 14 Jan, 2013 5 commits
    • Hongjia Cao's avatar
      Prevent srun abort on task launch failure · 163d9547
      Hongjia Cao authored
      On job step launch failure, the function
      "slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
      which causes srun to be aborted:
      
      srun: error: Task launch for 22495.0 failed on node cn6: Job credential
      expired
      srun: error: Application launch failed: Job credential expired
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      cn5
      cn4
      cn7
      srun: error: Timed out waiting for job step to complete
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: error: Timed out waiting for job step to complete
      srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
      Aborted (core dumped)
      
      The attached patch(version 2.5.1) fixes it. But the message of
      "
      Job step aborted: Waiting up to 2 seconds for job step to finish.
      Timed out waiting for job step to complete
      "
      will still be printed twice.
      163d9547
    • Morris Jette's avatar
      Add debugging hint to MPI guide for MPICH2 · dd8c22c7
      Morris Jette authored
      dd8c22c7
    • Yair Yarom's avatar
      Fix bug in accounting_storage/pgsql · 667cbf15
      Yair Yarom authored
      667cbf15
    • Morris Jette's avatar
      08cfbf0a
    • Morris Jette's avatar
      Revision of gres topology bug fix · e9c216c4
      Morris Jette authored
      e9c216c4
  2. 11 Jan, 2013 6 commits
  3. 10 Jan, 2013 7 commits
  4. 09 Jan, 2013 6 commits
  5. 08 Jan, 2013 6 commits
    • Danny Auble's avatar
      b50e2269
    • jette's avatar
    • jette's avatar
      Disable a test for select/serial plugin · 9bc0cf0b
      jette authored
      9bc0cf0b
    • Morris Jette's avatar
    • Rod Schultz's avatar
      Report node state as MAINT only if not allocated jobs · 2af5ce33
      Rod Schultz authored
      One of our testers has observed that when a long running job continues to run after a maintenance reservation comes into effect sinfo reports the node as being in the allocated state while scontrol shows it to be in the maintenance state.
      
      This can happen when a node is not completely allocated. (select cons_res, a partition which is not Shared=EXCLUSIVE, jobs allocated without –exclusive, or jobs that are allocated only some of the cpus on a node.)
      
      Execution paths leading up to calls to node_state_string  (slurm_protocol_defs.c) or node_state_string_compact, in scontrol, test for allocated_cpus less that total_cpus on the node and set the node state to MIXED rather than ALLOCATED, while similar paths in sinfo do not.
      
      I think this is probably a bug, since the mixed state is defined and think it is desirable that both command return the same result.
      
      The problem can be fixed with two logic changes (in multiple places)
      
      1)        node_state_string and node_state_string_compact have to check for mixed as well as allocated before returning the MAINT state. This means that the reported state for the node with the allocated job will be MIXED.
      
      2)        Sinfo must also check allocated_cpus less than total_cpus and set the state to MIXED before calling either node_state_string or node_state_string_compact.
      
      The attached patch (against 2.5.1) makes these changes. The attached script is a test case.
      2af5ce33
    • Morris Jette's avatar
  6. 03 Jan, 2013 8 commits
  7. 28 Dec, 2012 2 commits