1. 18 Jan, 2013 5 commits
    • Morris Jette's avatar
      Merge branch 'slurm-2.5' · e1d6ee3f
      Morris Jette authored
      Conflicts:
      	doc/html/documentation.shtml
      e1d6ee3f
    • Morris Jette's avatar
      Update rosetta stone · 3cec511a
      Morris Jette authored
      3cec511a
    • Morris Jette's avatar
      a4417570
    • Phil Eckert's avatar
      Permit job with invalid QOS to run if QOS set by administrator · 7aef4f80
      Phil Eckert authored
      About a year ago I submitted a modification that you incorporated
      into SLURM 2.4, which was to allow an admin to modify a job to use
      a QOS even though the user did not have access to the QOS.
      
      However, I must have tested it without having the Accounting set
      to enforce QOS's. So, if an admin modifies a job to a QOS they
      don't have access to, it will be modified, but the job will result
      in a state of InvalidQOS, which is reasonable, since this would
      handle the case where a user has their QOS removed. A problem,
      however, is that even though the scheduler won't schedule the job,
      backfill still will.
      
      One approach would be to fix backfill to be consistent with
      the scheduler (which should probably occur regardless), but
      my thought would be to modify the scheduler to allow the QOS
      as long as it was set by an admin, since that was the intent
      of the modification to begin with.
      
      I believe it  would only take a single line to change, just
      adding a check on the job_ptr->limit_set_qos, to make sure
      it was set by an admin:
      
                      if (job_ptr->qos_id) {
                              slurmdb_association_rec_t *assoc_ptr;
                              assoc_ptr = (slurmdb_association_rec_t *)job_ptr->assoc_ptr;
                              if (assoc_ptr &&
                                  !bit_test(assoc_ptr->usage->valid_qos,
                                            job_ptr->qos_id) &&
                                  !job_ptr->limit_set_qos) {
                                      info("sched: JobId=%u has invalid QOS",
                                              job_ptr->job_id);
                                      xfree(job_ptr->state_desc);
                                      job_ptr->state_reason = FAIL_QOS;
                                      continue;
                              } else if (job_ptr->state_reason == FAIL_QOS) {
                                      xfree(job_ptr->state_desc);
                                      job_ptr->state_reason = WAIT_NO_REASON;
                              }
                      }
      
      Phil
      7aef4f80
    • jette's avatar
      Replace socket shutdown call with linger sockopt · 777bf478
      jette authored
      The shutdown call was causing all pending I/O to be discarded.
      Linger waits for pending I/O to complete before the close call returns.
      777bf478
  2. 17 Jan, 2013 6 commits
    • Morris Jette's avatar
      Merge branch 'slurm-2.5' · d44a7cbd
      Morris Jette authored
      Conflicts:
      	src/sacctmgr/sacctmgr.c
      	src/sreport/sreport.c
      d44a7cbd
    • David Bigagli's avatar
      Terminate sreport on EOF · 892b14aa
      David Bigagli authored
      892b14aa
    • Morris Jette's avatar
      Fix typo in comment · 56a821a7
      Morris Jette authored
      56a821a7
    • David Bigagli's avatar
    • Morris Jette's avatar
      Use shutdown() rather than close() for slurmstepd/srun sockets · 30f31198
      Morris Jette authored
      From Matthieu Hautreux:
      However, after discussing the point with onsite Bull support team and looking
      at the slurmstepd code concerning stdout/err/in redirection we would like to
      recommend two things for future versions of SLURM :
      
      - sutdown(...,SHUT_WR) should be performed when managing the TCP sockets : no
      shutdown(...,SHUT_WR) is performed on the TCP socket in slurmstepd eio
      management. Thus, the close() can not reliably inform the other end of the
      socket that the transmission is done (no TCP_FIN transmitted). As the close is
      followed by an exit(), the kernel is the only entity that is knowing of the
      fact that the close may not have been took into account by the other side (wich
      might be our initial problem) and thus no retry can be performed, letting the
      server side of the socket (srun) in a position where it can wait for a read
      until the end of time.
      
      - TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
      exchanges, thus letting the system potentially deadlocked if a remote host
      dissapear and the local host is waiting on a read (the write would result in a
      EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
      relatively large timeout value (5 minutes), could enhance the resilience of
      SLURM for unexpected packet/connection loss without too much implication on the
      scalability of the solution. The timeout could be configurable in case it is
      find too aggresive for particular configurations.
      30f31198
    • Morris Jette's avatar
      Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce
      Morris Jette authored
      Added "KeepAliveTime" configuration parameter
      
      From Matthieu Hautreux:
      TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
      exchanges, thus letting the system potentially deadlocked if a remote host
      dissapear and the local host is waiting on a read (the write would result in a
      EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
      relatively large timeout value (5 minutes), could enhance the resilience of
      SLURM for unexpected packet/connection loss without too much implication on the
      scalability of the solution. The timeout could be configurable in case it is
      find too aggresive for particular configurations.
      5752c6ce
  3. 16 Jan, 2013 18 commits
  4. 15 Jan, 2013 8 commits
  5. 14 Jan, 2013 3 commits
    • jette's avatar
    • Hongjia Cao's avatar
      Prevent srun abort on task launch failure · 163d9547
      Hongjia Cao authored
      On job step launch failure, the function
      "slurm_step_launch_wait_finish()" will be called twice in launch/slurm,
      which causes srun to be aborted:
      
      srun: error: Task launch for 22495.0 failed on node cn6: Job credential
      expired
      srun: error: Application launch failed: Job credential expired
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      cn5
      cn4
      cn7
      srun: error: Timed out waiting for job step to complete
      srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
      srun: error: Timed out waiting for job step to complete
      srun: bitstring.c:174: bit_test: Assertion `(b) != ((void *)0)' failed.
      Aborted (core dumped)
      
      The attached patch(version 2.5.1) fixes it. But the message of
      "
      Job step aborted: Waiting up to 2 seconds for job step to finish.
      Timed out waiting for job step to complete
      "
      will still be printed twice.
      163d9547
    • Morris Jette's avatar