1. 03 Apr, 2014 1 commit
    • Morris Jette's avatar
      launch/poe - fix network value · 01fecf4d
      Morris Jette authored
      if an job step's network value is set by poe, either by directly
      executing poe or srun launching poe, that value was not being
      propagated to the job step creation RPC and the network was not
      being set up for the proper protocol (e.g. mpi, lapi, pami, etc.).
      The previous logic would only work if the srun execute line
      explicitly set the protocol using the --network option.
      01fecf4d
  2. 02 Apr, 2014 1 commit
  3. 31 Mar, 2014 2 commits
  4. 28 Mar, 2014 3 commits
  5. 27 Mar, 2014 3 commits
  6. 26 Mar, 2014 2 commits
  7. 25 Mar, 2014 2 commits
  8. 24 Mar, 2014 4 commits
    • Danny Auble's avatar
      Added sacctmgr mod qos set RawUsage=0 · f7fb80ec
      Danny Auble authored
      f7fb80ec
    • Danny Auble's avatar
    • Morris Jette's avatar
      Add job array hash table · ac7fabc6
      Morris Jette authored
      Previous logic would typically do list search to find job array elements.
      This commit adds two hash tables for job arrays. The first is based upon
      the "base" job ID which is common to all tasks. The second hash table
      is based upon the sum of the "base" job ID plus the task ID in the array.
      This will substantially improve performance for handling dependencies
      with job arrays.
      ac7fabc6
    • Morris Jette's avatar
      job array dependency recovery fix · fca71890
      Morris Jette authored
      When slurmctld restarted, it would not recover dependencies on
      job array elements and would just discard the depenency. This
      corrects the parsing problem to recover the dependency. The old code
      would print a mesage like this and discard it:
      slurmctld: error: Invalid dependencies discarded for job 51: afterany:47_*
      fca71890
  9. 22 Mar, 2014 1 commit
    • Morris Jette's avatar
      Fix sview abort when adding/removing columns · fbfd0e4d
      Morris Jette authored
      When adding or removing columns to most data types (jobs, partitions,
      nodes, etc.) on some system types an abort is generated. This appears
      to be because when columns displayed change, on some systems that
      changes the address of "model", while on others the address does not
      change (like my laptops). This fix explicitly sets the last_model to
      NULL when the columns are changed rather than relying upon the data
      structure's address to change.
      fbfd0e4d
  10. 21 Mar, 2014 4 commits
    • Danny Auble's avatar
      NRT - Fix issue with 1 node jobs. It turns out the network does need to · 440932df
      Danny Auble authored
      be setup for 1 node jobs.  Here are some of the reasons from IBM...
      
      1. PE expects it.
      2. For failover, if there was some challenge or difficulty with the
         shared-memory method of data transfer, the protocol stack might
         want to go through the adapter instead.
      3. For flexibility, the protocol stack might want to be able to transfer
         data using some variable combination of shared memory and adapter-based
         communication, and
      4. Possibly most important, for overall performance, it might be that
         bandwidth or efficiency (BW per CPU cycles) might be better using the
         adapter resources.  (An obvious case is for large messages, it might
         require a lot fewer CPU cycles to program the DMA engines on the
         adapter to move data between tasks, rather than depend on the CPU
         to move the data with loads and stores, or page re-mapping -- and
         a DMA engine might actually move the data more quickly, if it's well
         integrated with the memory system, as it is in the P775 case.)
      440932df
    • Morris Jette's avatar
      get implicit MPMD task count from config file · 718c8479
      Morris Jette authored
      If srun invoked with the --multi-prog option, but no task count, then use
      the task count provided in the MPMD configuration file.
      718c8479
    • Morris Jette's avatar
      Added scontrol errnumstr command · 04bd1b88
      Morris Jette authored
      04bd1b88
    • David Bigagli's avatar
  11. 20 Mar, 2014 3 commits
  12. 19 Mar, 2014 2 commits
  13. 18 Mar, 2014 4 commits
  14. 17 Mar, 2014 4 commits
  15. 16 Mar, 2014 3 commits
    • Morris Jette's avatar
      Export "SLURM*" env vars if --export=NONE · 9b4f3634
      Morris Jette authored
      Previously if the sbatch --export=NONE option was used then several
      Slurm environment variables were not propagated from the sbatch
      command (SLURM_SUBMIT_DIR, SLURM_SUBMIT_HOST, SLURM_JOB_NAME, etc.)
      9b4f3634
    • Morris Jette's avatar
      schedule enhancement for reservation · 08f0f57c
      Morris Jette authored
      Scheduler enhancements for reservations: When a job needs to run in
      reservation, but can not due to busy resources, then do not block all jobs
      in that partition from being scheduled, but only the jobs in that
      reservation.
      08f0f57c
    • Morris Jette's avatar
      Reset node's CpuLoad more frequently · fae55cbe
      Morris Jette authored
      Reset a node's CpuLoad value at least once each SlurmdTimeout seconds.
      Previously the value would not be reset unless communications with the
      slurmd did not happen for at least 1/3 of the SlurmdTimeout value.
      That means nodes that were actively running and terminating jobs would
      not get the CpuLoad value reset in a timely fashion. Added a CpuLoad
      reset timer to prevent this.
      fae55cbe
  16. 15 Mar, 2014 1 commit
    • Morris Jette's avatar
      retry slurm.conf file · 42081d87
      Morris Jette authored
      Add logic to sleep and retry if slurm.conf can't be read.
      Without this, the slurmd daemons may die and when the SlurmdTimeout
      is reached, the nodes will be marked DOWN and their jobs will be
      killed.
      In the long term, it would be good to exit only if the read files
      on program startup, and the daemons keep running with old configuration
      on reconfiguration, but I don't have time to do that work now.
      42081d87