- 06 Mar, 2011 1 commit
-
-
Moe Jette authored
The is_gemini logic is too simple: as just observed on a SeaStar system, it can be fooled into the wrong result if more than 1 row has NULL coordinates. This case happens if a blade has been powered down completely, so that the SeaStar network chip is also powered off. The routing system recognizes this case and routes around the powered-down node in the torus. It is plausible that in such a case the torus coordinates are NULL, since the node(s) are no longer part of the torus. (It is also possible to set all nodes on a blade down, but leave power switched on. The SeaStar chip, which is independent of the motherboard, will continue to provide routing connectivity, i.e. the torus coordinates would all be non-NULL, but no computing can be done by the node, the ALPS state is "ROUTING".) Here is the example which revealed this behaviour: one blade, nodes 804-807, had been powered down after system failure. mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor; +----------+-----------------------------------------+ | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) | +----------+-----------------------------------------+ | 1882 | 1878 | +----------+-----------------------------------------+ ==> There are 4 more node IDs than there are distinct coordinates. mysql> select processor_id,x_coord,y_coord,z_coord from processor\ WHERE x_coord IS NULL OR y_coord IS NULL OR z_coord IS NULL; +--------------+---------+---------+---------+ | processor_id | x_coord | y_coord | z_coord | +--------------+---------+---------+---------+ | 804 | NULL | NULL | NULL | | 805 | NULL | NULL | NULL | | 806 | NULL | NULL | NULL | | 807 | NULL | NULL | NULL | +--------------+---------+---------+---------+ ==> The corrected query now also gives the correct result (equality): mysql> select COUNT(*), COUNT(DISTINCT x_coord,y_coord,z_coord) FROM processor\ WHERE x_coord IS NOT NULL AND y_coord IS NOT NULL AND z_coord IS NOT NULL; +----------+-----------------------------------------+ | COUNT(*) | COUNT(DISTINCT x_coord,y_coord,z_coord) | +----------+-----------------------------------------+ | 1878 | 1878 | +----------+-----------------------------------------+
-
- 04 Mar, 2011 23 commits
-
-
-
Danny Auble authored
-
Danny Auble authored
BLUEGENE - Fix for handling extremely overloaded system on Dynamic system dealing with starting jobs on overlapping blocks. Previous fallout was job would be requeued. (happens very rarely)
-
Moe Jette authored
not recognize. The options may be from a newer version of sview.
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
This sets cursor visibility to 0 in smap mode, it is restored after the call to endwin().
-
Moe Jette authored
This restores the if condition so that do_basil_confirm() is also executed in the stepdmgr (otherwise batch jobs would fail).
-
Moe Jette authored
In file included from nodespec.c:10: basil_alps.h:31:19: error: expat.h: No such file or directory make[1]: *** [select_cray_la-nodespec.lo] Error 1 make[1]: Leaving directory `/root/src/slurm/slurm-2.3.0-0.pre3/src/plugins/select/cray'
-
Moe Jette authored
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
- 03 Mar, 2011 16 commits
-
-
Moe Jette authored
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
given this time interval before termination. Work by Bill Brophy, Bull.
-
Moe Jette authored
-
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
Moe Jette authored
"%j" in the job's output or error file specification.
-