1. 16 Jan, 2014 1 commit
  2. 15 Jan, 2014 1 commit
  3. 13 Jan, 2014 2 commits
  4. 08 Jan, 2014 3 commits
  5. 07 Jan, 2014 2 commits
  6. 06 Jan, 2014 2 commits
    • Morris Jette's avatar
      Reset job priority on manual resume · 65d9196c
      Morris Jette authored
      If a job is explicitly suspended, its priority is set to zero.
      This resets the priority when requeued and also documents that
      if the job is requeued (e.g. due to a node failure), then it
      is placed in a held state.
      65d9196c
    • Morris Jette's avatar
      Correct job RunTime if requeued from suspend state · bc3d8828
      Morris Jette authored
      Without this patch, the job's RunTime includes its RunTime from
      before it's prior suspend (i.e. the job's full RunTime rather than
      just the RunTime of the requeued job).
      bc3d8828
  7. 27 Dec, 2013 1 commit
    • Filip Skalski's avatar
      Fix sched/backfill bug that could starve jobs · 2bae8bd6
      Filip Skalski authored
      Hello,
      
      I think I found another bug in the code (I'm using 2.6.3 but I checked the 2.6.5 and 14.03 versions and it's the same there).
      
      In file sched/backfill/backfill.c:
      
      1)
      _add_reservation function, from lines 1172:
      
      if (placed == true) {
              j = node_space[j].next;
              if (j && (end_reserve < node_space[j].end_time)) {
                      /* insert end entry record */
                      i = *node_space_recs;
                      node_space[i].begin_time = end_reserve;
                      node_space[i].end_time = node_space[j].end_time;
                      node_space[j].end_time = end_reserve;
                      node_space[i].avail_bitmap =
                              bit_copy(node_space[j].avail_bitmap);
                      node_space[i].next = node_space[j].next;
                      node_space[j].next = i;
                      (*node_space_recs)++;
              }
              break;
      }
      I draw a picture with `node_space` state after 2 iterations (see attachment).
      
      In case where the new reservation is fully inside another reservation,
      then everything is OK.
      But if the new reservation spans multiple existing reservations then the `end entry record` is not created.
      This is because only the newly created `start entry record` is checked.
      
      Easy fix would be to change the if into a loop, for example:
      
      if (placed == true) {
          while((j = node_space[j].next) > 0) {
              if (end_reserve < node_space[j].end_time) {
                 //same as above
                 break;
              }
          }
          break;
      }
      
      2)
      You could also change line 612:
              node_space = xmalloc(sizeof(node_space_map_t) *
                                   (max_backfill_job_cnt + 3));
      To `(max_backfill_job_cnt * 2 + 1)` , since each reservation can add at most two entries (check at line 982 should never execute). At the moment, in a worst case scenario this only checks half of the max_backfill_job_cnt.
      
      NOTE: However this is all based on the assumption, that it is not done on purpose to speed up the calculations and trading some of the accuracy (especially point 2).
      
      Best regards,
      Filip Skalski
      2bae8bd6
  8. 23 Dec, 2013 2 commits
  9. 20 Dec, 2013 2 commits
  10. 19 Dec, 2013 1 commit
    • Morris Jette's avatar
      scontrol show job - Correct NumNodes value · b31e2176
      Morris Jette authored
      It has been changed to improve the calculated value for pending
      jobs and use the actual node count value for jobs that have been
      started (including suspended, completed, etc.)
      bug 549
      b31e2176
  11. 18 Dec, 2013 1 commit
  12. 17 Dec, 2013 2 commits
  13. 16 Dec, 2013 1 commit
  14. 14 Dec, 2013 1 commit
  15. 13 Dec, 2013 2 commits
  16. 12 Dec, 2013 1 commit
    • Morris Jette's avatar
      slurmstepd variable initialization · 06b41cdc
      Morris Jette authored
      Without this patch, free() is called on a random memory location
      (i.e. whatever is on the stack), which can result in slurmstepd
      dying and a completed job not being purged in a timely fashion.
      06b41cdc
  17. 11 Dec, 2013 2 commits
  18. 09 Dec, 2013 2 commits
    • Morris Jette's avatar
      Modify squeue to support longer job ID values · 17f27007
      Morris Jette authored
      This is needed for job arrays with discontiguous task ID values
      (e.g. "123_[1,3,5,...99999]")
      17f27007
    • Morris Jette's avatar
      Improve sview support for job arrays · d998640f
      Morris Jette authored
      Previously job arrays were only listed with their native job ID
      (e.g. 123_0 listed as 123, 123_1 as 124, etc). Now lists the job ID
      using both format (e.g. "123_1 (124)"). The same format is used
      for job step IDs (e.g. "123_1.2 (124.2)").
      d998640f
  19. 08 Dec, 2013 1 commit
  20. 07 Dec, 2013 2 commits
  21. 06 Dec, 2013 2 commits
  22. 05 Dec, 2013 1 commit
  23. 04 Dec, 2013 1 commit
  24. 03 Dec, 2013 3 commits
    • Morris Jette's avatar
      Improve REQUEST_JOB_INFO_SINGLE RPC performance · 80d3b343
      Morris Jette authored
      Use hash function to locate job records for improved performance.
      80d3b343
    • Morris Jette's avatar
      Improve REQUEST_JOB_INFO_SINGLE RPC performance · 14bcfe58
      Morris Jette authored
      Change partition write lock to a read lock as we use a different
      mechanism for hidden partitions in getting individual jobs.
      14bcfe58
    • Morris Jette's avatar
      Correct job dependency string · 08265c03
      Morris Jette authored
      Correct logic returning remaining job dependencies in job information
      reported by scontrol and squeue. Eliminates vestigial descriptors with
      no job ID values (e.g. "afterany"). As depdencies are removed, the
      job ID values were removed from the strings, but not the descriptors.
      This eliminates both. It also checks the full job ID to make sure we do
      not remove "afterany:1234" when job "123" completes.
      08265c03
  25. 02 Dec, 2013 1 commit
    • Morris Jette's avatar
      fix race condition in batch exit code · 6d1d932b
      Morris Jette authored
      Fix race condition on batch job termination that could result in a job exit
      code of 0xfffffffe if the slurmd on node zero registers its active jobs at
      the same time that slurmstepd is recording the job's exit code.
      but 535
      6d1d932b