1. 07 Nov, 2012 3 commits
    • Janne Blomqvist's avatar
      Modify default log timestamp pto conform to RFC 5424 format · 4b941731
      Janne Blomqvist authored
      the attached patch changes the default timestamp format in logfiles to conform to RFC 5424 (the current version of the syslog RFC). It is identical to the current default "ISO 8601" timestamp used by slurm, with the exception that the timezone offset is appended. This has the benefits of
      
      1) It's unambiguous.
      
      2) Avoids potential confusion for admins running cluster(s) in different timezones.
      
      3) Might help debug issues related to DST transitions. (More on that later..)
      
      (To be pedantic, a RFC 5424 timestamp is still a valid ISO 8601 timestamp, but the converse is not necessarily true. So there is RFC 3339 which is a "profile" of ISO 8601, that is a subset, recommended for internet protocols. The RFC 5424 timestamp, in turn, is a subset of the RFC 3339 timestamps.)
      
      The previous behavior of can be used by running configure with the
      
      --disable-rfc5424time
      
      flag.
      4b941731
    • Danny Auble's avatar
      BGQ - validate correct ntasks_per_node · 7eb1a451
      Danny Auble authored
      7eb1a451
    • Danny Auble's avatar
      BGQ - Fix issue when running srun outside of an allocation and only · 9e25da94
      Danny Auble authored
      specifying the number of tasks and not the number of nodes.
      9e25da94
  2. 05 Nov, 2012 2 commits
  3. 02 Nov, 2012 3 commits
  4. 31 Oct, 2012 1 commit
  5. 29 Oct, 2012 2 commits
    • Morris Jette's avatar
      Fix bug with topology/tree and job with min-max node count. · e15cab3f
      Morris Jette authored
      Now try to get max node count rather than minimizing leaf switches used.
      For example, if each leaf switch has 8 nodes then a request for -N4-16
      would allocate 8 nodes (one leaf switch) rather than 16 nodes over two
      leaf switches.
      e15cab3f
    • Morris Jette's avatar
      Cray - Prevent calling basil_confirm more than once per job using a flag. · faa96d55
      Morris Jette authored
          Anyhow, after applying the patch, I was still running into the same difficulty.  Upon a closer look, I saw that I was still receiving the ALPS backend error in the slurmctld.log file.  When I examined the code pertaining this and ran some SLURM-independent tests, I found that we were executing the do_basil_confirm function multiple times in the cases where it would fail.  My independent tests show precisely the same behaviour; that is, if you make a reservation request, then successfully confirm it and then attempt to confirm it again, you receive this error message.  However, the "apstat -rvv" command shows that the ALPS reservation is fine and therefore I concluded that this particular ALPS/BASIL message is more of an informational one and not a "show-stopper."  In other words, I can consider the node ready at this point.
          As a simple work around, I currently just inserted an if-block immediately after the call to "basil_confirm" in function "do_basil_confirm" in ".../src/plugins/select/cray/basil_interface.c."  The if-statment checks for "BE_BACKEND" and if this is the result then it prints an informational message to slurmctld.log and sets the variable rc=0 so that we can consider the node ready.  This, now allows my prolog scripts to run and I can clearly see the SLURM message that I had placed in that if-block.
           However, I am not certain if we really should just allow this error code to pass through as it seems like it could be a fairly generic code and there could be various other causes of it where we would not wish to allow it to pass.  I really only want to limit the number of calls to basil_confirm to one.  Perhaps I could add a field to the job_record so that I can mark whether the ALPS reservation had been confirmed or not.
      faa96d55
  6. 26 Oct, 2012 2 commits
  7. 25 Oct, 2012 2 commits
  8. 24 Oct, 2012 1 commit
  9. 23 Oct, 2012 1 commit
  10. 22 Oct, 2012 3 commits
  11. 19 Oct, 2012 1 commit
  12. 18 Oct, 2012 6 commits
  13. 17 Oct, 2012 2 commits
  14. 16 Oct, 2012 2 commits
  15. 05 Oct, 2012 2 commits
  16. 04 Oct, 2012 1 commit
  17. 02 Oct, 2012 2 commits
    • Morris Jette's avatar
      Correct -mem-per-cpu logic for multiple threads per core · 6a103f2e
      Morris Jette authored
      See bugzilla bug 132
      
      When using select/cons_res and CR_Core_Memory, hyperthreaded nodes may be
      overcommitted on memory when CPU counts are scaled. I've tested 2.4.2 and HEAD
      (2.5.0-pre3).
      
      Conditions:
      -----------
      * SelectType=select/cons_res
      * SelectTypeParameters=CR_Core_Memory
      * Using threads
        - Ex. "NodeName=linux0 Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
      RealMemory=400"
      
      Description:
      ------------
      In the cons_res plugin, _verify_node_state() in job_test.c checks if a node has
      sufficient memory for a job. However, the per-CPU memory limits appear to be
      scaled by the number of threads. This new value may exceed the available memory
      on the node. And, once a node is overcommitted on memory, future memory checks
      in _verify_node_state() will always succeed.
      
      Scenario to reproduce:
      ----------------------
      With the example node linux0, we run a single-core job with 250MB/core
          srun --mem-per-cpu=250 sleep 60
      
      cons_res checks that it will fit: ((real - alloc) >= job mem)
          ((400 - 0) >= 250) and the job starts
      
      Then, the memory requirement is doubled:
          "slurmctld: error: cons_res: node linux0 memory is overallocated (500) for
      job X"
          "slurmd: scaling CPU count by factor of 2"
      
      This job should not have started
      
      While the first job is still running, we submit a second, identical job
          srun --mem-per-cpu=250 sleep 60
      
      cons_res checks that it will fit:
          ((400 - 500) >= 250), the unsigned int wraps, the test passes, and the job
      starts
      
      This second job also should not have started
      6a103f2e
    • Morris Jette's avatar
  18. 27 Sep, 2012 3 commits
  19. 25 Sep, 2012 1 commit