Commits · d8aaf10a97f99d282f786f3a4fbdaa9e09061d3b · Manuel G. Marciani / ces_slurm_simulator

22 Jan, 2013 5 commits
- Add to contributor list · d8aaf10a
  Morris Jette authored Jan 22, 2013
  
  d8aaf10a
- Merge branch 'slurm-2.5' · 5d6daf81
  jette authored Jan 21, 2013
```
Conflicts:
	doc/html/Makefile.am
	doc/html/Makefile.in
```
  5d6daf81
- In select/cons_res, correct logic to allocate whole sockets to jobs · 47a39777
  Magnus Jonsson authored Jan 21, 2013
  
  47a39777
- select/cons_res plugin: CPU allocation logic fix · a86d2a7a
  jette authored Jan 21, 2013
```
Correction to CPU allocation logic for cores without hyperthreading
Backport of
https://github.com/SchedMD/slurm/commit/1ef41ac9590e018e631eaefb31254622984b7d2d
```
  a86d2a7a
- Remove rosetta.pdf from doc/html/Makefiles · 4b6da14a
  jette authored Jan 21, 2013
  
  4b6da14a
19 Jan, 2013 2 commits
- Merge branch 'slurm-2.5' · 63b0f8a5
  jette authored Jan 18, 2013
  
  63b0f8a5
- Fix formatting problem in salloc,sbatch and srun man pages · 17e06ae1
  jette authored Jan 18, 2013
  
  17e06ae1
18 Jan, 2013 15 commits

Merge branch 'slurm-2.5' · 3eebf015
Morris Jette authored Jan 18, 2013

3eebf015
Remove rosetta stone graphics files from Makefile, manually process the files · 91e4f723
Morris Jette authored Jan 18, 2013

91e4f723
Merge branch 'slurm-2.5' · 4d13f673
Morris Jette authored Jan 18, 2013

4d13f673

Fix topology/tree logic when nodes defined in slurm.conf get re-ordered · 29df4c83

Morris Jette authored Jan 18, 2013

From Chris Holmes, HP:
After several days of brainstorming and debugging, I have identified
a bug in SLURM 2.5.0rc2, related to the 'tree' topology. It was so
early in the execution of the whole SLURM machinery that it took me
some time to figure it out (say, 100 or 200 jobs showing the issue,
with more or less debugging levels increased and extra
instrumentation, with sometimes an uncertain reliability)...

For every “switch” a bitmap of nodes (seen down by the switch) is
built as the topology is discovered through 'topology.conf'.

There is code in read_config.c, executed when the SLURM control
daemon starts, that reorders the nodes (according to their hostname
by default), while the switches table (ie the bitmaps) has already
being built. To reorder the nodes means that the bitmaps of the switches become wrong.

29df4c83

Remove rosetta stone pdf, manage file outside of GIT · 58cb666b
Morris Jette authored Jan 18, 2013

58cb666b
Add link to EMC tutorial · 4ac02b99
Morris Jette authored Jan 18, 2013

4ac02b99
Merge branch 'slurm-2.5' · fbd0c47c
Morris Jette authored Jan 18, 2013

fbd0c47c
Make more variables available to job_submit/lua plugin · 28740196
Morris Jette authored Jan 18, 2013
```
slurm.MEM_PER_CPU, slurm.NO_VAL, etc.
```
28740196
Add to contributor list · a582cb8b
Morris Jette authored Jan 18, 2013

a582cb8b

indeces is not a word, but indices is. However, the choice of indexes has... · 8395178a

Chris Harwell authored Jan 18, 2013

indeces is not a word, but indices is. However, the choice of indexes has already been made in other parts of this code. Standardize: sed -i -e's/indecies/indexes/g' doc/man/man1/sbatch.1 src/sbatch/opt.c src/slurmctld/slurmctld.h src/common/node_conf.c src/common/node_conf.h

8395178a

Merge branch 'slurm-2.5' · e1d6ee3f
Morris Jette authored Jan 17, 2013
```
Conflicts:
	doc/html/documentation.shtml
```
e1d6ee3f
Update rosetta stone · 3cec511a
Morris Jette authored Jan 17, 2013

3cec511a
Add Rosetta Stone table of workload managers · a4417570
Morris Jette authored Jan 17, 2013

a4417570

Permit job with invalid QOS to run if QOS set by administrator · 7aef4f80

Phil Eckert authored Jan 17, 2013

About a year ago I submitted a modification that you incorporated
into SLURM 2.4, which was to allow an admin to modify a job to use
a QOS even though the user did not have access to the QOS.

However, I must have tested it without having the Accounting set
to enforce QOS's. So, if an admin modifies a job to a QOS they
don't have access to, it will be modified, but the job will result
in a state of InvalidQOS, which is reasonable, since this would
handle the case where a user has their QOS removed. A problem,
however, is that even though the scheduler won't schedule the job,
backfill still will.

One approach would be to fix backfill to be consistent with
the scheduler (which should probably occur regardless), but
my thought would be to modify the scheduler to allow the QOS
as long as it was set by an admin, since that was the intent
of the modification to begin with.

I believe it  would only take a single line to change, just
adding a check on the job_ptr->limit_set_qos, to make sure
it was set by an admin:

                if (job_ptr->qos_id) {
                        slurmdb_association_rec_t *assoc_ptr;
                        assoc_ptr = (slurmdb_association_rec_t *)job_ptr->assoc_ptr;
                        if (assoc_ptr &&
                            !bit_test(assoc_ptr->usage->valid_qos,
                                      job_ptr->qos_id) &&
                            !job_ptr->limit_set_qos) {
                                info("sched: JobId=%u has invalid QOS",
                                        job_ptr->job_id);
                                xfree(job_ptr->state_desc);
                                job_ptr->state_reason = FAIL_QOS;
                                continue;
                        } else if (job_ptr->state_reason == FAIL_QOS) {
                                xfree(job_ptr->state_desc);
                                job_ptr->state_reason = WAIT_NO_REASON;
                        }
                }

Phil

7aef4f80

Replace socket shutdown call with linger sockopt · 777bf478

jette authored Jan 17, 2013

The shutdown call was causing all pending I/O to be discarded.
Linger waits for pending I/O to complete before the close call returns.

777bf478

17 Jan, 2013 6 commits

Merge branch 'slurm-2.5' · d44a7cbd
Morris Jette authored Jan 17, 2013
```
Conflicts:
	src/sacctmgr/sacctmgr.c
	src/sreport/sreport.c
```
d44a7cbd
Terminate sreport on EOF · 892b14aa
David Bigagli authored Jan 17, 2013

892b14aa
Fix typo in comment · 56a821a7
Morris Jette authored Jan 17, 2013

56a821a7
sacctmgr: terminate sacctmgr on EOF if readline function missing · 4277f365
David Bigagli authored Jan 17, 2013

4277f365

Use shutdown() rather than close() for slurmstepd/srun sockets · 30f31198

Morris Jette authored Jan 17, 2013

From Matthieu Hautreux:
However, after discussing the point with onsite Bull support team and looking
at the slurmstepd code concerning stdout/err/in redirection we would like to
recommend two things for future versions of SLURM :

- sutdown(...,SHUT_WR) should be performed when managing the TCP sockets : no
shutdown(...,SHUT_WR) is performed on the TCP socket in slurmstepd eio
management. Thus, the close() can not reliably inform the other end of the
socket that the transmission is done (no TCP_FIN transmitted). As the close is
followed by an exit(), the kernel is the only entity that is knowing of the
fact that the close may not have been took into account by the other side (wich
might be our initial problem) and thus no retry can be performed, letting the
server side of the socket (srun) in a position where it can wait for a read
until the end of time.

- TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
exchanges, thus letting the system potentially deadlocked if a remote host
dissapear and the local host is waiting on a read (the write would result in a
EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
relatively large timeout value (5 minutes), could enhance the resilience of
SLURM for unexpected packet/connection loss without too much implication on the
scalability of the solution. The timeout could be configurable in case it is
find too aggresive for particular configurations.

30f31198

Add support for configurable keep alive time for srun/slurmstep communications · 5752c6ce

Morris Jette authored Jan 16, 2013

Added "KeepAliveTime" configuration parameter

From Matthieu Hautreux:
TCP_KEEPALIVE addition. No TCP_KEEPALIVE seems to be configured in SLURM TCP
exchanges, thus letting the system potentially deadlocked if a remote host
dissapear and the local host is waiting on a read (the write would result in a
EPIPE or SIGPIPE depending on the masked signals). Adding keepalive with a
relatively large timeout value (5 minutes), could enhance the resilience of
SLURM for unexpected packet/connection loss without too much implication on the
scalability of the solution. The timeout could be configurable in case it is
find too aggresive for particular configurations.

5752c6ce

16 Jan, 2013 12 commits
- Merge branch 'slurm-2.5' · 0367b663
  Morris Jette authored Jan 16, 2013
```
Conflicts:
	src/slurmctld/proc_req.c
```
  0367b663
- Clarify use of job_submit/lua plugin · e7a7d483
  Morris Jette authored Jan 16, 2013
  
  e7a7d483
- Terminate sacctmgr on stdin EOF · fe2f22c1
  David Bigagli authored Jan 16, 2013
  
  fe2f22c1
- Improve example of SallocDefaultCommand in slurm.conf man page · f624d8d0
  Morris Jette authored Jan 16, 2013
  
  f624d8d0
- Add links to LBL Node Health Check program in FAQ and Download web pages · dd7fd98a
  Morris Jette authored Jan 16, 2013
  
  dd7fd98a
- Fix for scheduling batch jobs in multiple partitions · 04fbf26a
  Morris Jette authored Jan 16, 2013
```
Without this change a high priority batch job may not start at submit
time. In addtion, a pending job with mutltiple partitions be cancelled
when the scheduler runs if any of it's partitions can not be used by
the job.
```
  04fbf26a
- In sacctmgr, purge local qos list after update · a5310890
  David Bigagli authored Jan 16, 2013
  
  a5310890
- Remove commit 0ca48b70 · 111fd8b2
  Morris Jette authored Jan 16, 2013
```
The original work this was based upon has been replaced with new
logic.
```
  111fd8b2
- Fix for job request with multiple partitions and node features · d84ad203
  Morris Jette authored Jan 16, 2013
```
Without this patch, if the first listed partition lacks nodes
with required features the job would be rejected.
```
  d84ad203
- Revert commit 3d69c635 · 0b9c3c58
  Morris Jette authored Jan 16, 2013
```
While this will validate job at submit time, it results in redundant
looping when scheduling jobs. Working on alternate patch now.
```
  0b9c3c58
- Missed goto for last patch · 0ca48b70
  Danny Auble authored Jan 16, 2013
  
  0ca48b70
- Check all requested partitions to make sure at least one is valid on job · 3d69c635
  Danny Auble authored Jan 16, 2013
```
submission.
```
  3d69c635