1. 09 Nov, 2012 1 commit
  2. 08 Nov, 2012 2 commits
  3. 07 Nov, 2012 5 commits
  4. 05 Nov, 2012 2 commits
  5. 02 Nov, 2012 3 commits
  6. 31 Oct, 2012 1 commit
  7. 29 Oct, 2012 2 commits
    • Morris Jette's avatar
      Fix bug with topology/tree and job with min-max node count. · e15cab3f
      Morris Jette authored
      Now try to get max node count rather than minimizing leaf switches used.
      For example, if each leaf switch has 8 nodes then a request for -N4-16
      would allocate 8 nodes (one leaf switch) rather than 16 nodes over two
      leaf switches.
      e15cab3f
    • Morris Jette's avatar
      Cray - Prevent calling basil_confirm more than once per job using a flag. · faa96d55
      Morris Jette authored
          Anyhow, after applying the patch, I was still running into the same difficulty.  Upon a closer look, I saw that I was still receiving the ALPS backend error in the slurmctld.log file.  When I examined the code pertaining this and ran some SLURM-independent tests, I found that we were executing the do_basil_confirm function multiple times in the cases where it would fail.  My independent tests show precisely the same behaviour; that is, if you make a reservation request, then successfully confirm it and then attempt to confirm it again, you receive this error message.  However, the "apstat -rvv" command shows that the ALPS reservation is fine and therefore I concluded that this particular ALPS/BASIL message is more of an informational one and not a "show-stopper."  In other words, I can consider the node ready at this point.
          As a simple work around, I currently just inserted an if-block immediately after the call to "basil_confirm" in function "do_basil_confirm" in ".../src/plugins/select/cray/basil_interface.c."  The if-statment checks for "BE_BACKEND" and if this is the result then it prints an informational message to slurmctld.log and sets the variable rc=0 so that we can consider the node ready.  This, now allows my prolog scripts to run and I can clearly see the SLURM message that I had placed in that if-block.
           However, I am not certain if we really should just allow this error code to pass through as it seems like it could be a fairly generic code and there could be various other causes of it where we would not wish to allow it to pass.  I really only want to limit the number of calls to basil_confirm to one.  Perhaps I could add a field to the job_record so that I can mark whether the ALPS reservation had been confirmed or not.
      faa96d55
  8. 26 Oct, 2012 2 commits
  9. 25 Oct, 2012 2 commits
  10. 24 Oct, 2012 1 commit
  11. 23 Oct, 2012 1 commit
  12. 22 Oct, 2012 3 commits
  13. 19 Oct, 2012 1 commit
  14. 18 Oct, 2012 6 commits
  15. 17 Oct, 2012 2 commits
  16. 16 Oct, 2012 2 commits
  17. 05 Oct, 2012 2 commits
  18. 04 Oct, 2012 1 commit
  19. 02 Oct, 2012 1 commit
    • Morris Jette's avatar
      Correct -mem-per-cpu logic for multiple threads per core · 6a103f2e
      Morris Jette authored
      See bugzilla bug 132
      
      When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
      overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
      (2.5.0-pre3).
      
      Conditions:
      -----------
      * SelectType=select/cons_res
      * SelectTypeParameters=CR_Core_Memory
      * Using threads
        - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
      RealMemory=400"
      
      Description:
      ------------
      In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
      sufficient memory for a job. However, the per-CPU memory limits appear to be
      scaled by the number of threads. This new value may exceed the available memory
      on the node. And, once a node is overcommitted on memory, future memory checks
      in _verify_node_state() will always succeed.
      
      Scenario to reproduce:
      ----------------------
      With the example node linux0, we run a single-core job with 250MB/core
          srun --mem-per-cpu=250 sleep 60
      
      cons_res checks that it will fit: ((real - alloc) >= job mem)
          ((400 - 0) >= 250) and the job starts
      
      Then, the memory requirement is doubled:
          "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
      job X"
          "slurmd: scaling CPU count by factor of 2"
      
      This job should not have started
      
      While the first job is still running, we submit a second, identical job
          srun --mem-per-cpu=250 sleep 60
      
      cons_res checks that it will fit:
          ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
      starts
      
      This second job also should not have started
      6a103f2e