1. 07 Apr, 2011 6 commits
  2. 06 Apr, 2011 6 commits
  3. 05 Apr, 2011 10 commits
  4. 04 Apr, 2011 8 commits
  5. 03 Apr, 2011 8 commits
    • Moe Jette's avatar
      8d7290f7
    • Moe Jette's avatar
    • Moe Jette's avatar
      version 1 of srun wrapper for aprun, still needs support added for many · 4bc12c62
      Moe Jette authored
      options and not really tested, but this is a good start
      4bc12c62
    • Moe Jette's avatar
      slurmctld: ignore secondary group members that have no valid login · 2f03a66f
      Moe Jette authored
      This avoids a warning message which is repeated on each reconfiguration of slurm
      and which is due to a dangling group configuration in LDAP entries.
      
      The error occurs when traversing the secondary group members of a given group
      name, when trying to add these to a configured group. If these secondary group
      members have no valid login (e.g. disabled via LDAP configuration), the error
      is repeated on each reconfigure of slurm.
      
      The error is harmless: since the users have no valid login, they can not log into
      the system anyway. I have raised the issue described below with our LDAP admin,
      there was no reply (likely since not important enough).
      
      Since slurm is not a tool to debug the work of system administrators, and since
      the secondary group members can not log in anyway, this patch replaces the error
      message with a comment; it leaves untouched the positive case of found secondary
      group members that have successfully been added to a configured group due to 
      having a valid passwd/LDAP login entry.
      
      Here is the case which gets repeated on our system, showing that each error message
      corresponds to a 'no such user' error when trying to look up the user id:
      
      -----------------------------------------------------------------------------------------------
      [2011-03-29T08:19:35] error: Could not find user baradmin in configured group csstaff
      [2011-03-29T08:19:35] error: Could not find user mvalle in configured group csstaff
      [2011-03-29T08:19:35] error: Could not find user puradm in configured group csstaff
      [2011-03-29T08:19:35] error: Could not find user ggobbi in configured group csappli
      [2011-03-29T08:19:35] error: Could not find user mvalle in configured group csappli
      -----------------------------------------------------------------------------------------------
      
      palu2:0 ~>getent group csstaff
      csstaff:*:1000:baradmin,biddisco,jfavre,mvalle,puradm
      palu2:0 ~>id baradmin
      id: baradmin: No such user
      palu2:1 ~>id mvalle
      id: mvalle: No such user
      palu2:1 ~>id puradm
      id: puradm: No such user
      
      ==> The secondary group members 'biddisco' and 'jfavre' are ok, no warnings.  
      -----------------------------------------------------------------------------------------------
      palu2:1 ~>getent group csappli
      csappli:*:1010:ajocksch,alam,amangili,annaloro,biddisco,cordery,cponti,fgilles,ggobbi,grenker,jfavre,mgg,mvalle,nstring,piccinal,robinson,soumagne,tack,tadrian,uvaretto,wsawyer
      palu2:0 ~>id ggobbi
      id: ggobbi: No such user
      2f03a66f
    • Moe Jette's avatar
      multiple frontend mode: avoid unnecesary slurmd configuration warning · adb3806b
      Moe Jette authored
      When running in multiple-slurmd mode, the actual hardware configuration reported
      by the slurmd is ignored, and internal entries (via register_front_ends() just
      use 1 as dummy value for CPUs, sockets, cores, and threads.
      
      On a dual-core service node this lead to continual warning messages like
      
      [2011-04-01T10:06:40] Node configuration differs from hardware
         Procs=1:2(hw) Sockets=1:1(hw)
         CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
      [2011-04-01T10:07:24] Node configuration differs from hardware
         Procs=1:2(hw) Sockets=1:1(hw)
         CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
      adb3806b
    • Moe Jette's avatar
      select/cray: more carefully test for NULL job pointer · bf2f57c9
      Moe Jette authored
      This audits the select/cray code so that it does not accidentally dereference a NULL job_ptr.
      This instance happens once, upon restart of slurmctld (detailed description below). 
      Similar checks are also in place in other select plugins, in any case it is better to check this.
      Almost all cases use xassert(), the only exception is p_job_fini(), which assumes NULL means
      there is nothing to be finalized.
      bf2f57c9
    • Moe Jette's avatar
      multiple frontend mode: avoid unnecesary slurmd configuration warning · 7d15aa3d
      Moe Jette authored
      When running in multiple-slurmd mode, the actual hardware configuration reported
      by the slurmd is ignored, and internal entries (via register_front_ends() just
      use 1 as dummy value for CPUs, sockets, cores, and threads.
      
      On a dual-core service node this lead to continual warning messages like
      
      [2011-04-01T10:06:40] Node configuration differs from hardware
         Procs=1:2(hw) Sockets=1:1(hw)
         CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
      [2011-04-01T10:07:24] Node configuration differs from hardware
         Procs=1:2(hw) Sockets=1:1(hw)
         CoresPerSocket=1:2(hw) ThreadsPerCore=1:1(hw)
      
      Since validate_nodes_via_front_end() ignores the reported values, it is safe
      to use the actual hardware configuration here, which also helps with taking
      stock of the current cluster configuration (e.g. via scontrol show slurmd).
      
      After applying this patch, the slurmds report without warnings as
      
      [2011-04-01T12:03:38] slurmd version 2.3.0-pre4 started
      [2011-04-01T12:03:38] slurmd started on Fri 01 Apr 2011 12:03:38 +0200
      [2011-04-01T12:03:38] Procs=2 Sockets=1 Cores=2 Threads=1 Memory=3886 TmpDisk=1943 Uptime=14355
      7d15aa3d
    • Moe Jette's avatar
      select/cray: attempt to free non-allocated storage · 31df4987
      Moe Jette authored
      This caused segfaults/core dumps when the slurmd/slurmctld unloaded the select/cray plugin.
      31df4987
  6. 02 Apr, 2011 2 commits