Commits · bf2f57c9f0edcf7467b67632f683746ed222bad8 · Manuel G. Marciani / ces_slurm_simulator

03 Apr, 2011 3 commits

select/cray: more carefully test for NULL job pointer · bf2f57c9

Moe Jette authored Apr 03, 2011

This audits the select/cray code so that it does not accidentally dereference a NULL job_ptr.
This instance happens once, upon restart of slurmctld (detailed description below).
Similar checks are also in place in other select plugins, in any case it is better to check this.
Almost all cases use xassert(), the only exception is p_job_fini(), which assumes NULL means
there is nothing to be finalized.

bf2f57c9

multiple frontend mode: avoid unnecesary slurmd configuration warning · 7d15aa3d

Moe Jette authored Apr 03, 2011

When running in multiple-slurmd mode, the actual hardware configuration reported
by the slurmd is ignored, and internal entries (via register_front_ends() just
use 1 as dummy value for CPUs, sockets, cores, and threads.

On a dual-core service node this lead to continual warning messages like

[2011-04-01T10:06:40] Node configuration differs from hardware
   Procs=1:2(hw) Sockets=1:1(hw)
   CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
[2011-04-01T10:07:24] Node configuration differs from hardware
   Procs=1:2(hw) Sockets=1:1(hw)
   CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)

Since validate_nodes_via_front_end() ignores the reported values, it is safe
to use the actual hardware configuration here, which also helps with taking
stock of the current cluster configuration (e.g. via scontrol show slurmd).

After applying this patch, the slurmds report without warnings as

[2011-04-01T12:03:38] slurmd version 2.3.0-pre4 started
[2011-04-01T12:03:38] slurmd started on Fri 01 Apr 2011 12:03:38 +0200
[2011-04-01T12:03:38] Procs=2 Sockets=1 Cores=2 Threads=1 Memory=3886 TmpDisk=1943 Uptime=14355

7d15aa3d

select/cray: attempt to free non-allocated storage · 31df4987
Moe Jette authored Apr 03, 2011
```
This caused segfaults/core dumps when the slurmd/slurmctld unloaded the select/cray plugin.
```
31df4987

02 Apr, 2011 1 commit
- Fixed bug in xcgroup.c's xcgroup_ns_find_by_pid() · a8177b6b
  Don Lipari authored Apr 01, 2011
  
  a8177b6b
01 Apr, 2011 4 commits
- BLUEGENE - fixes for the wrap of runjob by srun · f4a92565
  Danny Auble authored Apr 01, 2011
  
  f4a92565
- modifications to the runjob plugin to make it work correctly · 07461497
  Danny Auble authored Apr 01, 2011
  
  07461497
- BLUEGENE - better boot process with better checking · c16067f6
  Danny Auble authored Mar 31, 2011
  
  c16067f6
- BLUEGENE - Improve speed of start up when removing blocks at the beginning. · 94205fee
  Danny Auble authored Mar 31, 2011
  
  94205fee
31 Mar, 2011 10 commits
- removed commented out code · a8469311
  Danny Auble authored Mar 31, 2011
  
  a8469311
- simplify check for new blocks · 8d36744d
  Danny Auble authored Mar 31, 2011
  
  8d36744d
- Move log_init() higher in the code for slurmd and slurmstepd. · da46b821
  Moe Jette authored Mar 31, 2011
```
Also add the function to sview. log_init() is now one of the
first statements for all commands and daemons.
```
  da46b821
- largely revert previous change with respect to cray.conf. · cd85c6f3
  Moe Jette authored Mar 31, 2011
  
  cd85c6f3
- modify the default values in cray.conf so that the file will not normally · c85a7a1a
  Moe Jette authored Mar 31, 2011
```
be required.
```
  c85a7a1a
- minor cosmetic change · 44428ac8
  Moe Jette authored Mar 31, 2011
  
  44428ac8
- select/cray: we also need apbasil in stepdmgr · 4ebb7418
  Moe Jette authored Mar 31, 2011
```
This fixes a bug in adding the configuration file parer. The cray_conf
structure must always be created, since we are also using the plugin in
stepdmgr context.

The observed causes were core dumps and the inability to run batch jobs
(since trying to confirm the ALPS reservation with a NULL cray_conf->apbasil
 resulted in segfaults).
```
  4ebb7418
- better debug · c295baf6
  Danny Auble authored Mar 30, 2011
  
  c295baf6
- BLUEGENE - better checking for syncing blocks · 547254ab
  Danny Auble authored Mar 30, 2011
  
  547254ab
- fix typo · 94276eba
  Danny Auble authored Mar 30, 2011
  
  94276eba
30 Mar, 2011 19 commits
- BLUEGENE - fixes for BGL/P error handling · 95a1c0c5
  Danny Auble authored Mar 30, 2011
  
  95a1c0c5
- BLUEGENE - fixed some issues where a block could mistakenly be freed in memory... · 732c32e5
  Danny Auble authored Mar 30, 2011
```
BLUEGENE - fixed some issues where a block could mistakenly be freed in memory when it shouldn't of.
```
  732c32e5
- remove vestigial bracket · c84f4b33
  Moe Jette authored Mar 30, 2011
  
  c84f4b33
- start news for 2.3.0-pre5 · 888604bf
  Moe Jette authored Mar 30, 2011
  
  888604bf
- update META for v2.3.0-pre4 tag · ee006758
  Moe Jette authored Mar 30, 2011
  
  ee006758
- ping nodes after restart even if SlurmdTimeout==0 · 973865bd
  Moe Jette authored Mar 30, 2011
  
  973865bd
- -- Fix slurmctld restart logic that could leave nodes in UNKNOWN state for a · d354206b
  Moe Jette authored Mar 30, 2011
```
    longer time than necessary after restart.
```
  d354206b
- fixed missed spot for fuzzy check · 9c12447c
  Danny Auble authored Mar 30, 2011
  
  9c12447c
- svn merge -r22929:22954 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · e32d6b0c
  Danny Auble authored Mar 30, 2011
  
  e32d6b0c
- BLUEGENE - Added back a lock when creating dynamic blocks to be more thread... · e2000a17
  Danny Auble authored Mar 30, 2011
```
BLUEGENE - Added back a lock when creating dynamic blocks to be more thread safe on larger systems with heavy load.
```
  e2000a17
- convert remaining double compare from "==" to fuzzy_equal() · b02809fe
  Moe Jette authored Mar 30, 2011
  
  b02809fe
- Replace comparision of qos_ptr->usage_thres == X with · 7cc9de99
  Moe Jette authored Mar 30, 2011
```
fizzy_equal(qos_ptr->usage_thres, X). This deals with
imprecision in storage for a double, especially with 
respect to un/pack across machine architectures.
```
  7cc9de99
- mods to initialize some objects and fix pack to pack NO_VAL and INFINITE correctly · a1994610
  Danny Auble authored Mar 30, 2011
  
  a1994610
- news from last checkin · 3daa5cc0
  Danny Auble authored Mar 30, 2011
  
  3daa5cc0
- Fix associations/qos for when adding back a previously deleted object the... · 8e277373
  Danny Auble authored Mar 30, 2011
```
Fix associations/qos for when adding back a previously deleted object the object will be cleared of all old limits.
```
  8e277373
- update copyright · 249bcb1d
  Danny Auble authored Mar 30, 2011
  
  249bcb1d
- Change #include for the slurm headers from · 2f1aefc9
  Moe Jette authored Mar 29, 2011
```
#include <slurm/slurm.h>
to 
#include "slurm/slurm.h"
so that the original source gets searched first
```
  2f1aefc9
- merged https://eris.llnl.gov/svn/slurm/branches/cgroups_Matthieu/ · ed19e5bc
  Don Lipari authored Mar 29, 2011
```
+ -- Added proctrack/cgroup and task/cgroup plugins written by Matthieu
+    Hautreux, CEA.
```
  ed19e5bc
- Create v2.2.4 tag · 10b07142
  Moe Jette authored Mar 29, 2011
  
  10b07142
29 Mar, 2011 3 commits

svn merge -r22838:22929 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 6fd29f18
Danny Auble authored Mar 29, 2011

6fd29f18

minor tweak in slurm.conf man page: · d6952e1c

Moe Jette authored Mar 29, 2011

The man page for slurm.conf, select/cons_res parameter SelectTypeParameters, values CR_Socket and CR_Socket_Memory states the following:

"Note that jobs requesting one CPU will only be given access to that one CPU"

I think this statement is incorrect, or at least very misleading to users. A job requesting one CPU will only be allocated one CPU, but unless task/affinity is enabled or some other CPU binding mechanism is used, the job can access all of the CPUs on the node. That is, a task that is distributed to the node can run on any of the CPUs on the node, not just on the one CPU that was allocated to its job. I propose the following patch to replace "given access to" with "allocated".

Regards,
Martin Perry

d6952e1c

minor re-org of the code in tests job so that the CPU_Bind debug messages get · 17d00b87
Moe Jette authored Mar 29, 2011
```
printed if memory is not allocated.
```
17d00b87