- 09 Mar, 2011 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
- 08 Mar, 2011 16 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
a full midplane on BlueGene computers. Treat cnodes as liceses which can be reserved and are consumed by jobs.
-
Moe Jette authored
Perl can not build if both are set.
-
Moe Jette authored
-
Moe Jette authored
This removes a redundant test for NODE_RESUME if the old state was NODE_STATE_UNKNOWN, and an unreached break.
-
Moe Jette authored
Just a suggestion, also updates comment text.
-
Danny Auble authored
-
Danny Auble authored
-
- 07 Mar, 2011 6 commits
- 06 Mar, 2011 10 commits
-
-
Moe Jette authored
Since "aprun" is used on Cray instead of srun, the --no-shell option does not make any difference: with or without this option, the ALPS reservation is made, and since it is confirmed using the SID of the current shell, aprun will run even if the BASIL_RESERVATION_ID is not set. NB: the patch aborts with an error message. If deciding to turn this into a warning, and continue processing, opt.no_shell should be disabled, since otherwise interactive mode (and thus job control) is disabled.
-
Moe Jette authored
return_hostlist is not populated in validate_nodes_via_front_end, hence never printed out.
-
Moe Jette authored
This * removes outdated and no longer applicable comments regarding consecutive node numbering (dating from an earlier revision); * fixes a typo and clarifies condition on XT/SeaStar systems.
-
Moe Jette authored
This fixes an inconsistency: time_t is not necessarily u32, use a separate routine to parse the absolute value and use proper time_t type. Also tidied up code where possible.
-
Moe Jette authored
This reduces the amount of error text printed on failure of do_basil_release(): * parameter failures are caught by the existing calls to error(), * internal (ALPS) errors are printed by basil_release(), * there is no need to return additional error information via errno, * functions calling select_g_job_fini() just interpret the error, but no further action is taken, hence it is not necessary to indicate failure more than once. The following shows how setting SLURM_ERROR/errno produces unnecessarily long error text: [2011-02-09T18:19:51] debug2: Processing RPC: REQUEST_CANCEL_JOB_STEP uid=21215 [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286 [2011-02-09T18:19:51] error: releasing ALPS resId 286 for JobId 2940 FAILED with -5 [2011-02-09T18:19:51] error: select_g_job_fini(2940): No error With the patch, only [2011-02-09T18:19:51] error: PERMANENT ALPS BACKEND error: ALPS error: apsched: No entry for resId 286 would be printed, which is sufficient to diagnose the problem (resId 286 had been terminated by ALPS internally, after not receiving a confirmation quickly enough).
-
Moe Jette authored
This introduces a function which first checks prior to releasing an ALPS reservation if there are any application APIDs still associated with it. If there are, it attempts to kill those presumably stray job steps using the Cray apkill(1) binary. In most cases this is sufficient to successfully release the reservation. If on the other hand the reservation is formally released while still APIDs are associated with it, the reservation will remain (and its resources not released back) until the associated applications (APIDs) have terminated. Use of this function is restricted to cleaning up orphaned reservations. When trying to also use this for normal (non-abortive) job termination, it resulted in error messages, where the APID was still associated with the reservation, but had just shortly before been released, i.e. it generated false positives. The patch passed the following test case: 1. set up an ALPS reservation: salloc -N 12 2. spawn long-running apruns: for i in {1..13};do aprun sleep 3600&done 3. (in a different window) kill -9 $(pidof salloc) scancel -u $USER 4. after the job had completed within slurm, the following cleanup happened: [2011-03-05T13:19:07] debug2: purge_old_job: purged 1 old job records [2011-03-05T13:19:37] debug: BASIL 3.1 INVENTORY: 128/176 batch nodes available [2011-03-05T13:19:37] debug: ALPS: 12 node(s) still held [2011-03-05T13:19:37] error: orphaned ALPS reservation 147, trying to remove [2011-03-05T13:19:37] error: apkill live apid 168913 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168912 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168911 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168910 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168909 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168908 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168907 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168906 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168905 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168904 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168903 of ALPS resId 147 [2011-03-05T13:19:37] error: apkill live apid 168902 of ALPS resId 147 ==> Subsequently, reservation 147 was released, and a new salloc could be granted.
-
Moe Jette authored
With the current configuration, setting DownNodes in slurm.conf was not possible, since node_ptr->reason gets overwritten by basil_get_initial_state(). The patch updates setting the initial state so that * initial 'reason' fields remain untouched; * a new 'reason' is set only if - the node is not already recognized as down or - no reason has been set so far; * it frees any previously set 'reason' if the node is allocated or idle. This code has been tested to work while we were waiting for a missing replacement blade (marked as 'DownNodes' in slurm.conf).
-
Moe Jette authored
This fixes a bug in handling nodes: the code so far ignored whether nodes are still allocated to jobs. The patch therefore adds the following ALPS test: "If any node still has an ALPS reservation for CPUs or memory, it is considered allocated (has an active ALPS reservation associated with it)." Details of changes: ------------------- 1. general: resurrected node_is_allocated() libalps function - returns true if there is an ALPS reservation for CPUs/memory on a node; 2. basil_get_initial_state(): - clarified reliance on reset_job_bitmaps() and _sync_nodes_to_jobs(), to clean up associated jobs (the latter function to kill jobs on DOWN nodes), - added missing case for nodes that are still allocated after SLURM restart, - fixed an error in documentation: comment about allocation was wrong!; 3. basil_inventory(): - now looks at both SLURM/ALPS node-allocation state, - if ALPS-allocated and not SLURM-allocated, sets 'mismatch' flag (if this case is triggered by an orphaned ALPS reservation, the flag is set again), - if there is a SLURM/ALPS mismatch, scheduling is deferred.
-
Moe Jette authored
This lets the select/cray code deal more gracefully with bad nodes: * avoid sscanf(NULL, ...) in basil_geometry(); * avoid fatal() if node_ptr->name[0] == '\0'. The other three functions, * basil_node_ranking(), * basil_get_initial_state(), and * basil_inventory() rely on find_node_record() to return NULL on finding a bad node - which will trigger an error condition, but not cause the program to abort.
-
Moe Jette authored
The is_gemini logic is too simple: as just observed on a SeaStar system, it can be fooled into the wrong result if more than 1 row has NULL coordinates. This case happens if a blade has been powered down completely, so that the SeaStar network chip is also powered off. The routing system recognizes this case and routes around the powered-down node in the torus. It is plausible that in such a case the torus coordinates are NULL, since the node(s) are no longer part of the torus. (It is also possible to set all nodes on a blade down, but leave power switched on. The SeaStar chip, which is independent of the motherboard, will continue to provide routing connectivity, i.e. the torus coordinates would all be non-NULL, but no computing can be done by the node, the ALPS state is "ROUTING".) Here is the example which revealed this behaviour: one blade, nodes 804-807, had been powered down after system failure. mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor; +----------+-----------------------------------------+ | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) | +----------+-----------------------------------------+ | 1882 | 1878 | +----------+-----------------------------------------+ ==> There are 4 more node IDs than there are distinct coordinates. mysql> select processor_id,x_coord,y_coord,z_coord from processor\ WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL; +--------------+---------+---------+---------+ | processor_id | x_coord | y_coord | z_coord | +--------------+---------+---------+---------+ | 804 | NULL | NULL | NULL | | 805 | NULL | NULL | NULL | | 806 | NULL | NULL | NULL | | 807 | NULL | NULL | NULL | +--------------+---------+---------+---------+ ==> The corrected query now also gives the correct result (equality): mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\ WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL; +----------+-----------------------------------------+ | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) | +----------+-----------------------------------------+ | 1878 | 1878 | +----------+-----------------------------------------+
-
- 04 Mar, 2011 6 commits
-
-
-
Danny Auble authored
-
Danny Auble authored
BLUEGENE - Fix for handling extremely overloaded system on Dynamic system dealing with starting jobs on overlapping blocks. Previous fallout was job would be requeued. (happens very rarely)
-
Moe Jette authored
not recognize. The options may be from a newer version of sview.
-
Danny Auble authored
-
Danny Auble authored
-