- 18 May, 2011 9 commits
-
-
Danny Auble authored
-
Morris Jette authored
This improves the initial configuration code: a) Better handling of DownNodes lines The previous basil_geometry() would set the node Reason field on failure, irrespective of whether that node has been marked using a DownNode line. b) Check all cases of nodes being invisible to ALPS Up until now basil_geometry() had to be fixed each time a new source of discrepancy between ALPS and SDB state had been discovered (most recent case was NULL coordinates when taking out a blade). Depending on ALPS interface changes, there may be other possibilities. Instead of fixing the SLURM code for each new case, it is better to check whether SLURM and ALPS agree. The price is some tiny delay at SLURM initialisation time (since each node is first looked up in the ALPS inventory), but it pays well off as it eases system administration by pointing to the source of error. Any node that has suddenly disappeared from ALPS horizon will now show up in the logs, and also in marked down in sinfo. c) At initialisation time, give a summary as to how many ALPS nodes are online. d) Turn ALPS-node-invisibility error into warning message, since such nodes may already have been covered in a DownNodes statement. By merging basil_get_initial_state() into basil_geometry(), the previously separate knowledge about system state (database state, ALPS inventory) is combined, allowing to more easily identify sources of failure. Patch from Gerrit Renker, CSCS.
-
Moe Jette authored
-
Moe Jette authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Moe Jette authored
Logging message was misleading and incorrect pointer used in another.
-
Moe Jette authored
Former logic failed to properly allocate resources to a job step when specifying both a task count and a node count range on a heterogeneous cluster.
-
- 17 May, 2011 15 commits
-
-
Danny Auble authored
-
Danny Auble authored
BLUEGENE - Fixed print of geo portion of the select_jobinfo struct to work correctly with the regession tests.
-
Danny Auble authored
-
Moe Jette authored
-
Danny Auble authored
-
Moe Jette authored
-
Danny Auble authored
-
Danny Auble authored
-
Morris Jette authored
-
Morris Jette authored
Latest Cray-specific modifiations
-
Morris Jette authored
The enum is only needed and referenced in basil_geometry() and has otherwise no special meaning since it directly depends on the selected output columns. Patch from Gerrit Renker, CSCS.
-
Morris Jette authored
This case was observed after taking a blade out of a CLE 2.x system. ALPS does not list the removed nodes, but they still appear in the XTAdmin.processor table, with NULL coordinates. Hence set node down if at least one coordinate is NULL. Also add a check to compare how many out of the nodes in slurm.conf are visible to ALPS (the absence of this test masked the bug), always list DOWN nodes at startup, and clarify that not failing due to ALPS errors during the initial SLURM configuration is not an option. On the system which is missing a blade, the log information now is [2011-05-16T16:09:54] error: ALPS sees only 12/16 slurm.conf nodes [2011-05-16T16:09:54] Recovered state of 16 nodes [2011-05-16T16:09:54] Recovered state of 2 front_end nodes [2011-05-16T16:09:54] Recovered information about 0 jobs [2011-05-16T16:09:54] error: nid00028: unknown coordinates - hardware failure? [2011-05-16T16:09:54] error: nid00029: unknown coordinates - hardware failure? [2011-05-16T16:09:54] error: nid00030: unknown coordinates - hardware failure? [2011-05-16T16:09:54] error: nid00031: unknown coordinates - hardware failure? Patch from Gerrit Renker, CSCS.
-
Morris Jette authored
This fixes some errors in the documentation of how memory is allocated, and adds missing bits. Patch from Gerrit Renker, CSCS.
-
Morris Jette authored
-
Danny Auble authored
BLUEGENE - Added block node cnt to be able to differentiate between a sub-block job and a regular full block job.
-
- 16 May, 2011 6 commits
-
-
Danny Auble authored
Conflicts: src/plugins/select/bluegene/bg_record_functions.c
-
Danny Auble authored
BLUEGENE - if a block goes into an error state. Fix issue where accounting wasn't updated correctly when the block was resumed.
-
Moe Jette authored
Clearly document that only PreemptType=preempt/partition_prio can be used with PreemptMode=suspend. Only partition data structures exist in the module that suspends and resumes jobs.
-
Moe Jette authored
The node state was formerly reported "UNKNOWN" on node state change request errors.
-
Moe Jette authored
-
- 14 May, 2011 1 commit
-
-
Morris Jette authored
-
- 13 May, 2011 9 commits
-
-
Moe Jette authored
-
Moe Jette authored
Remove local timer functions and make use of functions in src/common/timers.[ch]
-
Moe Jette authored
Change diff_tv_str() to slurm_diff_tv_str() and diff_tv() to slurm_diff_tv() so the symbols are exported for use in the pmi library
-
Moe Jette authored
Remove "inline" declarations on extern functions in src/common so that the symbols are available to plugins.
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
When enforcing accounting, fix polling for unknown uids for users after the slurmctld started. Previously one would have to issue a reconfigure to the slurmctld to have it look for new uids.
-
Moe Jette authored
-
Moe Jette authored
-