Commits · f0da1597b30edeed14159b3e2f1d80706b9eeea6 · Manuel G. Marciani / ces_slurm_simulator

08 Mar, 2011 11 commits
- svn merge -r22656:22709 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · f0da1597
  Moe Jette authored Mar 08, 2011
  
  f0da1597
- note that shared midplane reservations are still under development · c3beb3b5
  Moe Jette authored Mar 08, 2011
  
  c3beb3b5
- added missing files · 734c0fc6
  Danny Auble authored Mar 08, 2011
  
  734c0fc6
- moved helper functions to a new file · ff7bd42f
  Danny Auble authored Mar 08, 2011
  
  ff7bd42f
- Add job_submit/cnode plugin to support resource reservations of less than · b227a7d9
  Moe Jette authored Mar 08, 2011
```
    a full midplane on BlueGene computers. Treat cnodes as liceses which can
    be reserved and are consumed by jobs.
```
  b227a7d9
- When building contribs/perlapi ignore INSTALL_BASE and use PREFIX instead. · 95a7d376
  Moe Jette authored Mar 08, 2011
```
    Perl can not build if both are set.
```
  95a7d376
- result of running autogen.sh with new cray m4 · ef391513
  Moe Jette authored Mar 08, 2011
  
  ef391513
- slurmctld: cosmetics in _valid_node_state_change · 2c143381
  Moe Jette authored Mar 08, 2011
```
This removes a redundant test for NODE_RESUME if the
old state was NODE_STATE_UNKNOWN, and an unreached
break.
```
  2c143381
- select/cray: update for patch 09, which tests for Cray binaries 'apbasil' and 'apkill' · cd0818dd
  Moe Jette authored Mar 08, 2011
```
Just a suggestion, also updates comment text.
```
  cd0818dd
- shorten states · 47bf59c1
  Danny Auble authored Mar 08, 2011
  
  47bf59c1
- added a bit of logic to restore state · 544b37c6
  Danny Auble authored Mar 08, 2011
  
  544b37c6
07 Mar, 2011 6 commits
- -- In accounting_storage/filetxt plugin, substitute spaces within job names, · e520268b
  Moe Jette authored Mar 07, 2011
```
    step names, and account names with an underscore to insure proper parsing.
```
  e520268b
- Flesh out the SUG registration form a bit. · 58c1142f
  Moe Jette authored Mar 07, 2011
  
  58c1142f
- Convert calls to ALPS to use proper directory · ad29b646
  Moe Jette authored Mar 07, 2011
  
  ad29b646
- Results of autogen.sh · afc014a5
  Moe Jette authored Mar 07, 2011
  
  afc014a5
- Add test for ALPS execute files · d216c9bb
  Moe Jette authored Mar 07, 2011
  
  d216c9bb
- Note recent Cray-specific mods · 89bea660
  Moe Jette authored Mar 07, 2011
  
  89bea660
06 Mar, 2011 10 commits

salloc: disable --no-shell mode · ca906680

Moe Jette authored Mar 06, 2011

Since "aprun" is used on Cray instead of srun, the --no-shell option does not
make any difference: with or without this option, the ALPS reservation is made,
and since it is confirmed using the SID of the current shell, aprun will run
even if the BASIL_RESERVATION_ID is not set.

NB: the patch aborts with an error message. If deciding to turn this into a
warning, and continue processing, opt.no_shell should be disabled, since
otherwise interactive mode (and thus job control) is disabled.

ca906680

slurmctld: remove dead code · 7b5a5dee

Moe Jette authored Mar 06, 2011

return_hostlist is not populated in validate_nodes_via_front_end,
hence never printed out.

7b5a5dee

select/cray: typos and outdated comments · 10f20cfc

Moe Jette authored Mar 06, 2011

This 
 * removes outdated and no longer applicable comments regarding
   consecutive node numbering (dating from an earlier revision);
 * fixes a typo and clarifies condition on XT/SeaStar systems.

10f20cfc

libalps: use proper type for timestamps · d089c7c9

Moe Jette authored Mar 06, 2011

This fixes an inconsistency: time_t is not necessarily u32, use a separate
routine to parse the absolute value and use proper time_t type.

Also tidied up code where possible.

d089c7c9

select/cray: handling errors in do_basil_release() · 70869e06

Moe Jette authored Mar 06, 2011

This reduces the amount of error text printed on failure of do_basil_release():
 * parameter failures are caught by the existing calls to error(),
 * internal (ALPS) errors are printed by basil_release(),
 * there is no need to return additional error information via errno,
 * functions calling select_g_job_fini() just interpret the error, but no
   further action is taken, hence it is not necessary to indicate failure
   more than once.

The following shows how setting SLURM_ERROR/errno produces unnecessarily long error text:

 [2011-02-09T18:19:51] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP uid=21215
 [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286
 [2011-02-09T18:19:51] error: releasing ALPS resId 286 for JobId 2940 FAILED with -5
 [2011-02-09T18:19:51] error: select_g_job_fini(2940): No error

With the patch, only						       
 [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286
would be printed, which is sufficient to diagnose the problem (resId 286 had been
terminated by ALPS internally, after not receiving a confirmation quickly enough).

70869e06

select/cray: perform "safe" release of ALPS reservations · e0801444

Moe Jette authored Mar 06, 2011

This introduces a function which first checks prior to releasing an ALPS
reservation if there are any application APIDs still associated with it.

If there are, it attempts to kill those presumably stray job steps using the
Cray apkill(1) binary. In most cases this is sufficient to successfully release
the reservation. If on the other hand the reservation is formally released
while still APIDs are associated with it, the reservation will remain (and
its resources not released back) until the associated applications (APIDs)
have terminated.

Use of this function is restricted to cleaning up orphaned reservations. When
trying to also use this for normal (non-abortive) job termination, it resulted
in error messages, where the APID was still associated with the reservation,
but had just shortly before been released, i.e. it generated false positives.

The patch passed the following test case:
 1. set up an ALPS reservation: salloc -N 12
 2. spawn long-running apruns:  for i in {1..13};do aprun sleep 3600&done
 3. (in a different window)     kill -9 $(pidof salloc)
                                scancel -u $USER
 4. after the job had completed within slurm, the following cleanup happened:
    [2011-03-05T13:19:07] debug2: purge_old_job: purged 1 old job records
    [2011-03-05T13:19:37] debug:  BASIL 3.1 INVENTORY: 128/176 batch nodes available
    [2011-03-05T13:19:37] debug:  ALPS: 12 node(s) still held
    [2011-03-05T13:19:37] error: orphaned ALPS reservation 147, trying to remove
    [2011-03-05T13:19:37] error: apkill live apid 168913 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168912 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168911 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168910 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168909 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168908 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168907 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168906 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168905 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168904 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168903 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168902 of ALPS resId 147

 ==> Subsequently, reservation 147 was released, and a new salloc could be granted.

e0801444

select/cray: do not override the node reason field · cc4dc6ac

Moe Jette authored Mar 06, 2011

With the current configuration, setting DownNodes in slurm.conf was not possible,
since node_ptr->reason gets overwritten by basil_get_initial_state().

The patch updates setting the initial state so that
 * initial 'reason' fields remain untouched;
 * a new 'reason' is set only if 
   - the node is not already recognized as down or
   - no reason has been set so far;
 * it frees any previously set 'reason' if the node is allocated or idle.

This code has been tested to work while we were waiting for a missing replacement
blade (marked as 'DownNodes' in slurm.conf).

cc4dc6ac

select/cray: check on the ALPS side whether node is allocated · dd03bb07

Moe Jette authored Mar 06, 2011

This fixes a bug in handling nodes: the code so far ignored whether nodes
                                    are still allocated to jobs.

The patch therefore adds the following ALPS test:

 "If any node still has an ALPS reservation for CPUs or memory, it is
  considered allocated (has an active ALPS reservation associated with it)."

Details of changes:
-------------------
 1. general: resurrected node_is_allocated() libalps function
    - returns true if there is an ALPS reservation for CPUs/memory on a node;
 2. basil_get_initial_state():
    - clarified reliance on reset_job_bitmaps() and _sync_nodes_to_jobs(), to
      clean up associated jobs (the latter function to kill jobs on DOWN nodes),
    - added missing case for nodes that are still allocated after SLURM restart,
    - fixed an error in documentation: comment about allocation was wrong!;
 3. basil_inventory():
    - now looks at both SLURM/ALPS node-allocation state,
    - if ALPS-allocated and not SLURM-allocated, sets 'mismatch' flag (if this
      case is triggered by an orphaned ALPS reservation, the flag is set again),
    - if there is a SLURM/ALPS mismatch, scheduling is deferred.

dd03bb07

select/cray: better resiliency against bad nodes · 0a25539d

Moe Jette authored Mar 06, 2011

This lets the select/cray code deal more gracefully with bad nodes:
 * avoid sscanf(NULL, ...) in basil_geometry();
 * avoid fatal() if node_ptr->name[0] == '\0'.

The other three functions,
 * basil_node_ranking(),
 * basil_get_initial_state(), and
 * basil_inventory()
rely on find_node_record() to return NULL on finding a bad node - which will
trigger an error condition, but not cause the program to abort.

0a25539d

select/cray: fix error in 'is_gemini' logic · 6c927b3f

Moe Jette authored Mar 06, 2011

The is_gemini logic is too simple: as just observed on a SeaStar system, it can
be fooled into the wrong result if more than 1 row has NULL coordinates. 

This case happens if a blade has been powered down completely, so that the SeaStar
network chip is also powered off. The routing system recognizes this case and 
routes around the powered-down node in the torus. It is plausible that in such a
case the torus coordinates are NULL, since the node(s) are no longer part of the
torus. 

(It is also possible to set all nodes on a blade down, but leave power switched
 on. The SeaStar chip, which is independent of the motherboard, will continue to
 provide routing connectivity, i.e. the torus coordinates would all be non-NULL,
 but no computing can be done by the node, the ALPS state is "ROUTING".)

Here is the example which revealed this behaviour: one blade, nodes 804-807,
had been powered down after system failure.

mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor;
+----------+-----------------------------------------+
| COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
+----------+-----------------------------------------+
|     1882 |                                    1878 | 
+----------+-----------------------------------------+

==> There are 4 more node IDs than there are distinct coordinates.

mysql> select processor_id,x_coord,y_coord,z_coord from processor\
       WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL;
+--------------+---------+---------+---------+
| processor_id | x_coord | y_coord | z_coord |
+--------------+---------+---------+---------+
|          804 |    NULL |    NULL |    NULL | 
|          805 |    NULL |    NULL |    NULL | 
|          806 |    NULL |    NULL |    NULL | 
|          807 |    NULL |    NULL |    NULL | 
+--------------+---------+---------+---------+

==> The corrected query now also gives the correct result (equality):
mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\
       WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL;
+----------+-----------------------------------------+
| COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) |
+----------+-----------------------------------------+
|     1878 |                                    1878 | 
+----------+-----------------------------------------+

6c927b3f

04 Mar, 2011 13 commits
- svn merge -r22620:22656 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 3d96a32e
  Danny Auble authored Mar 04, 2011
  
  3d96a32e
- fix for handling extremely overloaded system · 261ee46c
  Danny Auble authored Mar 04, 2011
  
  261ee46c
- BLUEGENE - Fix for handling extremely overloaded system on Dynamic system... · 1826c82c
  Danny Auble authored Mar 04, 2011
```
BLUEGENE - Fix for handling extremely overloaded system on Dynamic system dealing with starting jobs on overlapping blocks.  Previous fallout was job would be requeued.  (happens very rarely)
```
  1826c82c
- avoid printing an error from sview if it reads configuration parameters that it does · 88f64c5e
  Moe Jette authored Mar 04, 2011
```
not recognize. The options may be from a newer version of sview.
```
  88f64c5e
- better logic for dealing with marking midplanes temorarily used · 55d37b2e
  Danny Auble authored Mar 04, 2011
  
  55d37b2e
- migrated over to new bitmap remove functions · 6fd811d3
  Danny Auble authored Mar 04, 2011
  
  6fd811d3
- Add registration for for meetings · 5f25cb2c
  Moe Jette authored Mar 04, 2011
  
  5f25cb2c
- fixed multicluster mode · 7a1f34bc
  Danny Auble authored Mar 04, 2011
  
  7a1f34bc
- fixed geo rotate for Q · 5b336b20
  Danny Auble authored Mar 04, 2011
  
  5b336b20
- Add info about SLURM User Group Meeting · 46d5fe78
  Moe Jette authored Mar 04, 2011
  
  46d5fe78
- Add ability for sview to modify DebugFlags · 04216e9f
  Moe Jette authored Mar 04, 2011
  
  04216e9f
- now setting and clearing each DebugFlag · 9af85d75
  Moe Jette authored Mar 04, 2011
  
  9af85d75
- modify sview to report proper DebugFlags · b9584805
  Moe Jette authored Mar 04, 2011
  
  b9584805