Commit e0801444 authored by Moe Jette's avatar Moe Jette
Browse files

select/cray: perform "safe" release of ALPS reservations

This introduces a function which first checks prior to releasing an ALPS
reservation if there are any application APIDs still associated with it.

If there are, it attempts to kill those presumably stray job steps using the
Cray apkill(1) binary. In most cases this is sufficient to successfully release
the reservation. If on the other hand the reservation is formally released
while still APIDs are associated with it, the reservation will remain (and
its resources not released back) until the associated applications (APIDs)
have terminated.

Use of this function is restricted to cleaning up orphaned reservations. When
trying to also use this for normal (non-abortive) job termination, it resulted
in error messages, where the APID was still associated with the reservation,
but had just shortly before been released, i.e. it generated false positives.

The patch passed the following test case:
 1. set up an ALPS reservation: salloc -N 12
 2. spawn long-running apruns:  for i in {1..13};do aprun sleep 3600&done
 3. (in a different window)     kill -9 $(pidof salloc)
                                scancel -u $USER
 4. after the job had completed within slurm, the following cleanup happened:
    [2011-03-05T13:19:07] debug2: purge_old_job: purged 1 old job records
    [2011-03-05T13:19:37] debug:  BASIL 3.1 INVENTORY: 128/176 batch nodes available
    [2011-03-05T13:19:37] debug:  ALPS: 12 node(s) still held
    [2011-03-05T13:19:37] error: orphaned ALPS reservation 147, trying to remove
    [2011-03-05T13:19:37] error: apkill live apid 168913 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168912 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168911 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168910 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168909 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168908 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168907 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168906 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168905 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168904 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168903 of ALPS resId 147
    [2011-03-05T13:19:37] error: apkill live apid 168902 of ALPS resId 147

 ==> Subsequently, reservation 147 was released, and a new salloc could be granted.
parent cc4dc6ac
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment