- 12 Dec, 2013 4 commits
-
-
Morris Jette authored
There were some parsing issues and the test was not as general as it should have been
-
Danny Auble authored
-
Danny Auble authored
throw away initialized variable.
-
Morris Jette authored
Without this patch, free() is called on a random memory location (i.e. whatever is on the stack), which can result in slurmstepd dying and a completed job not being purged in a timely fashion.
-
- 11 Dec, 2013 3 commits
-
-
Morris Jette authored
-
Danny Auble authored
-
Morris Jette authored
Fix race condition in authentication credential creation that could corrupt memory. (NOTE: This race condition has existed since 2003 and would be exceedingly rare.)
-
- 10 Dec, 2013 3 commits
-
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
- 09 Dec, 2013 2 commits
-
-
Morris Jette authored
This is needed for job arrays with discontiguous task ID values (e.g. "123_[1,3,5,...99999]")
-
Morris Jette authored
Previously job arrays were only listed with their native job ID (e.g. 123_0 listed as 123, 123_1 as 124, etc). Now lists the job ID using both format (e.g. "123_1 (124)"). The same format is used for job step IDs (e.g. "123_1.2 (124.2)").
-
- 08 Dec, 2013 2 commits
-
-
jette authored
-
jette authored
If the GRES is associated with specific files AND the GRES count is reset using scontrol AND the slurmd is restarted either without a gres.conf file or with a count and no specific files AND the GRES count is then increased using scontrol the GRES bitmap will not match its count This fixes the root cause of the mismatch between bitmap size and GRES count and should render the rebuilding of the bitmap unnecessary. The rebuilding was handled in the following commits commit ec4df3bf commit 1712d619
-
- 07 Dec, 2013 2 commits
-
-
Danny Auble authored
-
Philip D. Eckert authored
-
- 06 Dec, 2013 5 commits
-
-
Jason Bacon authored
Using CPU: Intel(R) Pentium(R) 4 CPU 2.40GHz (2392.04-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf27 Family = f Model = 2 Stepping = 7 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> It's also using an older version of hwloc (1.3.1) and I have not yet tested it with a newer one, but since 0 and -1 are legitimate returns values for hwloc_get_nbobjs_by_type(), I think they should be handled in any case. From the hwloc_get_nbobjs_by_type() man page: static inline int hwloc_get_nbobjs_by_type (hwloc_topology_ttopology, hwloc_obj_type_ttype) [static] Returns the width of level type type. If no object for that type exists, 0 is returned. If there are several levels with objects of that type, -1 is returned. I'm attaching a smarter patch that handles both 0 and -1 return values for both CORE and SOCKET. It logs a warning if it has to fudge a 0 return code and bails out with a helpful error message for -1, which I have no idea how to handle. At least people won't have to waste time tracking down the problem this way. Happy Friday, Jason
-
Trofinoff Stephen authored
This adds a mechanism to kill a hung apbasil command
-
Morris Jette authored
error introduced in commit ec4df3bf
-
Jason Bacon authored
-
Morris Jette authored
A abort has been reported if the node's gres count differs from it's bitmap. This has been induced by changing the count manually (e.g. scontrol update nodename=tux123 gres=gpu:4"). I have not been able to reproduce this problem, but this will resize the bitmap in order to avoid the assert failure.
-
- 05 Dec, 2013 2 commits
-
-
Danny Auble authored
-
Danny Auble authored
news.html.
-
- 04 Dec, 2013 2 commits
-
-
Morris Jette authored
-
Morris Jette authored
Previous logic never reopened the file, preventing proper functioning of logrotate.
-
- 03 Dec, 2013 4 commits
-
-
Morris Jette authored
Use hash function to locate job records for improved performance.
-
Morris Jette authored
Change partition write lock to a read lock as we use a different mechanism for hidden partitions in getting individual jobs.
-
Morris Jette authored
-
Morris Jette authored
Correct logic returning remaining job dependencies in job information reported by scontrol and squeue. Eliminates vestigial descriptors with no job ID values (e.g. "afterany"). As depdencies are removed, the job ID values were removed from the strings, but not the descriptors. This eliminates both. It also checks the full job ID to make sure we do not remove "afterany:1234" when job "123" completes.
-
- 02 Dec, 2013 3 commits
-
-
Morris Jette authored
Fix race condition on batch job termination that could result in a job exit code of 0xfffffffe if the slurmd on node zero registers its active jobs at the same time that slurmstepd is recording the job's exit code. but 535
-
Morris Jette authored
-
David Bigagli authored
-
- 29 Nov, 2013 6 commits
-
-
Morris Jette authored
-
Morris Jette authored
There was already cgroup locking in the version 14.03 code base using different variable names and slighly different logic from that in commit 3f6d9e36. This commit is a variant of that commit in order to make the logic in version 2.6 match that of our next release (logic which is already pretty well tested). bug 447
-
Morris Jette authored
proctrack/cgroup - Add locking to prevent race condition where one job step is ending for a user or job at the same time another job stepsis starting and the user or job container is deleted from under the starting job step. bug 447
-
Morris Jette authored
This eliminates some now redundant arrays and variable copying introduced in commit 74d1a4b4 bug 525
-
David Bigagli authored
Substantial performance improvement for systems with Shared=YES or FORCE and large numbers of running jobs (replace bubble sort with quick sort). Bug 525
-
David Bigagli authored
Remove trailing spaces No changes in logic
-
- 27 Nov, 2013 2 commits
-
-
Morris Jette authored
Original code worked only for Cray systems. For other systems it set gres_alloc to the total number of each GRES allocated on each node to any job
-
Morris Jette authored
-