- 16 Aug, 2013 2 commits
-
-
Danny Auble authored
Conflicts: src/common/stepd_api.c
-
Danny Auble authored
-
- 15 Aug, 2013 11 commits
-
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
-
Danny Auble authored
Conflicts: src/common/read_config.c src/plugins/accounting_storage/pgsql/accounting_storage_pgsql.c src/plugins/jobcomp/pgsql/jobcomp_pgsql.c src/slurmctld/job_mgr.c
-
Danny Auble authored
-
Danny Auble authored
could end up before the job started. Bug 371
-
Danny Auble authored
-
Morris Jette authored
This function can now be called to test for processes which are dumping in order to avoid sending them a SIGKILL until dump completes. Change in logic required for job_container/cray.
-
Morris Jette authored
Handle invalid process ID better (probably never happens) Re-arrange logic to eliminate duplicate file close on error
-
Danny Auble authored
-
- 14 Aug, 2013 16 commits
-
-
Danny Auble authored
-
Danny Auble authored
plugins more maintainable.
-
https://github.com/SchedMD/slurmjette authored
-
jette authored
We now reject jobs with an invalid accounting frequency at submit time rather than launch time, so the error is slightly different and the test needs to change for that.
-
Morris Jette authored
-
Morris Jette authored
This avoids waiting for the job's initiation to fail.
-
Morris Jette authored
Only cancel the job.
-
Danny Auble authored
-
Morris Jette authored
Fix job state recovery logic in which a job's accounting frequency was not set. This would result in a value of 65534 seconds being used (the equivalent of NO_VAL in uint16_t), which could result in the job being requeued or aborted.
-
Morris Jette authored
-
Danny Auble authored
-
jette authored
-
David Bigagli authored
-
Morris Jette authored
Problem reported by BYU. slurm.conf included a file one byte in length. Logic created a buffer one byte long and used fgets() to read the file. fgets() reads one byte less than the buffer size to include a trailing '\0', so it fails to read the file.
-
Danny Auble authored
Basically the system size has to be set up before you call the priority/multifactor plugin. If a job is finishing while the slurmctld is starting then it would fatal on the init if it wasn't set up.
-
Danny Auble authored
-
- 13 Aug, 2013 11 commits
-
-
Morris Jette authored
-
Morris Jette authored
core reservations and reservation prolog/epilog
-
John Thiltges authored
-
Morris Jette authored
-
Morris Jette authored
-
Morris Jette authored
-
Michael Gutteridge authored
I'm running Slurm 2.6.0 and MWM 7.2.4 in our test cluster at the moment. I happened to notice that node load reporting wasn't consistent- periodically you'd see a "sane" load reported in Moab, but most of the time the reported load was zero despite an accurate CPULoad value reported by "scontrol show node". Finally got to digging into this. It appears that the only time load was being reported properly was in the Moab scheduling cycle directly after slurmctld did a node ping. In subsequent scheduling cycles the load (again, as reported by Moab) was back to zero. The node ping is significant as that is the only time the node is updated- since the wiki2 interface only reports records that change, and the load record isn't changed, it isn't reported in the queries after the node ping. Judging from this behavior, I'm guessing that Moab does not store the load value- every time it queries resources in Slurm it sets the node's load back to zero. I've altered src/plugins/sched/wiki2/get_nodes.c slightly- basically moved the section that reports CPULOAD above the check for updated info (update_time > last_node_update). So I don't know if this is the appropriate way to fix it. The wiki specification that Adaptive has published doesn't seem to indicate how this should function. Either MWM should assume the last value reported is still accurate or Slurm needs to report it for every wiki GETNODES command. Anyway, the patch is attached, it seems to be working for me, and I've rolled it into our debian build directory. YMMV. Michael
-
jette authored
I don't see how this could happen, but it might explain something reported by Harvard University. In any case, this could prevent an infinite loop if the task distribution funciton is passed a job allocation with zero nodes.
-
Morris Jette authored
-
jette authored
This problem was reported by Harvard University and could be reproduced with a command line of "srun -N1 --tasks-per-node=2 -O id". With other job types, the error message could be logged many times for each job. This change logs the error once per job and only if the job request does not include the -O/--overcommit option.
-
Danny Auble authored
went from the old single table per enterprise style to that of separate tables per clusters. (2.0 -> 2.*). If people are still running <2.2 they really need to upgrade (long before this), and can get the translations by upgrading to >=2.1.0 before they install this version.
-