Commits · e0d92b8a1a0ed87c3fbbbc529f33aecf40f44144 · Manuel G. Marciani / ces_slurm_simulator

10 Apr, 2011 4 commits

slurmstepd: avoid coredump in case of NULL job · e0d92b8a

Moe Jette authored Apr 09, 2011

We build slurm with --enable-memory-leak-debug and encountered twice the same core
dump when user 'root' was trying to run jobs during a maintenance session. 

The root user is not in the accounting database, which explains the errors seen
below. The gdb session shows that in this invocation 

palu7:0 log>stat /var/crash/palu7-slurmstepd-6602.core 
...
Modify: 2011-04-04 19:34:44.000000000 +0200

slurmctld.log
[2011-04-04T19:34:44] _slurm_rpc_submit_batch_job JobId=3254 usec=1773
[2011-04-04T19:34:44] ALPS RESERVATION #5, JobId 3254: BASIL -n 1920 -N 0 -d 1 -m 1333
[2011-04-04T19:34:44] sched: Allocate JobId=3254 NodeList=nid000[03-13,18-29,32-88] #CPUs=1920
[2011-04-04T19:34:44] error: slurmd error 4005 running JobId=3254 on front_end=palu7: User not found on host
[2011-04-04T19:34:44] update_front_end: set state of palu7 to DRAINING
[2011-04-04T19:34:44] completing job 3254
[2011-04-04T19:34:44] Requeue JobId=3254 due to node failure
[2011-04-04T19:34:44] sched: job_complete for JobId=3254 successful
[2011-04-04T19:34:44] requeue batch job 3254
[2011-04-04T20:28:43] sched: Cancel of JobId=3254 by UID=0, usec=57285

(gdb) core-file palu7-slurmstepd-6602.core 
[New Thread 6604]
Core was generated by `/opt/slurm/2.3.0/sbin/slurmstepd'.
Program terminated with signal 11, Segmentation fault.
#0  main (argc=1, argv=0x7fffd65a1fd8) at slurmstepd.c:413
413             jobacct_gather_g_destroy(job->jobacct);
(gdb) print job
$1 = (slurmd_job_t *) 0x0
(gdb) list
408
409     #ifdef MEMORY_LEAK_DEBUG
410     static void
411     _step_cleanup(slurmd_job_t *job, slurm_msg_t *msg, int rc)
412     {
413             jobacct_gather_g_destroy(job->jobacct);
414             if (!job->batch)
415                     job_destroy(job);
416             /*
417              * The message cannot be freed until the jobstep is complete
(gdb) print msg
$2 = (slurm_msg_t *) 0x916008
(gdb) print rc
$3 = -1
(gdb) 

The patch tests for a NULL job argument for the calls that need to dereference the job pointer.

e0d92b8a

select/cray: zero reservation ID is not an error · 03f984aa

Moe Jette authored Apr 09, 2011

This avoids meaningless error messages that warn about a zero reservation ID:

 [2011-04-07T15:31:26] _slurm_rpc_submit_batch_job JobId=2870 usec=33390
                       ... a minute later the user decides to scancel the queued job:
 [2011-04-07T15:32:34] error: JobId=2870 has invalid (ZERO) resId
 [2011-04-07T15:32:34] sched: Cancel of JobId=2870 by UID=21770, usec=230

To keep things simple, that test has been removed.

(The patch is in particular also necessary since now job_signal() may trigger
 a basil_release() of a pending job which has no ALPS reservation yet.)

03f984aa

select/cray: release ALPS reservation on termination signals · 12772a3a

Moe Jette authored Apr 09, 2011

On rosa we experienced severe problems when jobs got killed via scancel or
as a result of job timeout. Job cleanup took several minutes, created stray
processes that consumed resources on the slurmd node, keeping the system
for long spans unable from scheduling.

This problem did not show up on the smaller 2-cabinet XE system (which also
runs a more recent ALPS version). The fix for the problem is to keep new
script lines from starting by sending apkill only after formally releasing
the reservation.

For all signals whose default disposition is to terminate or to dump core,
the reservation is released before signalling the aprun job steps. This
prevents a race condition where further aprun lines get executed while the
apkill of the current aprun line in the job script is in progress.

We did a before/after test on rosa under full load and the problem disappeared.

12772a3a

add testimonial from CSCS · 44bec602
Moe Jette authored Apr 09, 2011

44bec602

09 Apr, 2011 4 commits
- First attempt to clean up make install of man page html files · 1a05cbd6
  Don Lipari authored Apr 08, 2011
  
  1a05cbd6
- correct bad hyperlink · ea0026bb
  Moe Jette authored Apr 08, 2011
  
  ea0026bb
- -- Add RPCs to get the SPANK environment variables from the slurmctld daemon. · a99a66af
  Moe Jette authored Apr 08, 2011
```
    Patch from Andrej N. Gritsenko.
```
  a99a66af
- Note Support for BGQ and Sun XT/XE · 9033b096
  Moe Jette authored Apr 08, 2011
  
  9033b096
08 Apr, 2011 5 commits
- various fixes to run tests on bgq system. · 3646b683
  Moe Jette authored Apr 08, 2011
```
Most of the changes were to support hostname with 4 digit suffix.
Some other tests are failing, but will leave for now
```
  3646b683
- modifications to address dependency information being cleared from the job info as · 73c1ec85
  Moe Jette authored Apr 08, 2011
```
they are satisfied.
```
  73c1ec85
- Modify test to work with the dependency string getting cleared as dependencies are satisfied. · 6b4d73b7
  Moe Jette authored Apr 08, 2011
  
  6b4d73b7
- -- Job dependency information will only show the currently active dependencies · e91dd030
  Moe Jette authored Apr 08, 2011
```
    rather than the original dependencies. From Dan Rusak, Bull.
```
  e91dd030
- define argument to a function as void · 1d6c10d5
  Moe Jette authored Apr 07, 2011
```
add some more logging information
```
  1d6c10d5
07 Apr, 2011 12 commits
- -- Added man pages to html pages and the new cpu_management.html page. · a1ae775e
  Don Lipari authored Apr 07, 2011
```
   Submitted by Martin Perry / Rod Schultz, Bull.
```
  a1ae775e
- improve wording of an error · 50daada9
  Moe Jette authored Apr 07, 2011
  
  50daada9
- -- Fix logic in BackupController to properly recover front-end node state and · 6ea9e5e1
  Moe Jette authored Apr 07, 2011
```
    avoid purging active jobs.
```
  6ea9e5e1
- -- Eliminate "error from _trigger_slurmctld_event in backup.c" due to lack of · b69bc2a3
  Moe Jette authored Apr 07, 2011
```
    event triggers.
```
  b69bc2a3
- -- Fix bug in front-end configurations which reports job_cnt_comp underflow · 3c8bb61f
  Moe Jette authored Apr 07, 2011
```
    errors after slurmctld restarts.
```
  3c8bb61f
- improve information logged with respect to event triggers · ab679155
  Moe Jette authored Apr 07, 2011
  
  ab679155
- update the cray documentation. · 5bb05e25
  Moe Jette authored Apr 07, 2011
  
  5bb05e25
- start to properly map srun options into aprun options · 09655d17
  Moe Jette authored Apr 07, 2011
  
  09655d17
- fix for dealing with a NULL jobinfo structure · 7bb26622
  Danny Auble authored Apr 07, 2011
  
  7bb26622
- svn merge -r23022:23028 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 170ad087
  Danny Auble authored Apr 06, 2011
  
  170ad087
- Fix so slurmctld will pack correctly 2.1 step information. (Only needed if a... · 767898e7
  Danny Auble authored Apr 06, 2011
```
Fix so slurmctld will pack correctly 2.1 step information. (Only needed if a 2.1 client is talking to a 2.2 slurmctld.)
```
  767898e7
- Fix for when configuring a node with more resources than in real life and using task/affinity. · 8017a113
  Danny Auble authored Apr 06, 2011
  
  8017a113
06 Apr, 2011 6 commits
- now have all srun options in the wrapper. · 0cbcfba6
  Moe Jette authored Apr 06, 2011
  
  0cbcfba6
- next batch of options · 05eb4182
  Moe Jette authored Apr 06, 2011
  
  05eb4182
- add more options to wrapper · ad97446b
  Moe Jette authored Apr 06, 2011
  
  ad97446b
- svn merge -r23010:23022 https://eris.llnl.gov/svn/slurm/branches/slurm-2.2 · 6c163a5c
  Moe Jette authored Apr 05, 2011
  
  6c163a5c
- Minor changes to salloc/sbatch/srun man pages · 95cbc075
  Don Lipari authored Apr 05, 2011
  
  95cbc075
- Applied Martin Perry's corrections to the srun man page. · 505b9118
  Don Lipari authored Apr 05, 2011
  
  505b9118
05 Apr, 2011 9 commits
- -- Fix memory leak in select/cons_res when backfill scheduling generic · 182fdc03
  Moe Jette authored Apr 05, 2011
```
    resources (gres).
```
  182fdc03
- Fix typo · 9e156df0
  Moe Jette authored Apr 05, 2011
  
  9e156df0
- -- Fix bug in select/cons_res with respect to generic resource (gres) · de0e3c70
  Moe Jette authored Apr 05, 2011
```
    scheduling which prevented some jobs from starting as soon as possible.
```
  de0e3c70
- BLUEGENE - ok, most is in place to make runjob work well. · 71e30677
  Danny Auble authored Apr 05, 2011
  
  71e30677
- fix for srun wrapper to print out correct labels · 62623e5c
  Danny Auble authored Apr 05, 2011
  
  62623e5c
- Add info about Rosa · b8971f91
  Moe Jette authored Apr 05, 2011
  
  b8971f91
- switch a couple of options to put into alphabetic order. · 2a83650b
  Moe Jette authored Apr 05, 2011
```
improve wording on an option.
```
  2a83650b
- added a bunch more srun options · 425e4422
  Moe Jette authored Apr 05, 2011
  
  425e4422
- add more srun/aprun options · af6b8020
  Moe Jette authored Apr 05, 2011
  
  af6b8020