采用统计方法在大规模调试中寻找死掉的进程



  • 大规模调试中可能会经常遇到一些神奇的错误,比较恶心的情况是有的进程死在什么地方了。
    下面是一个很小的例子,会由于rank1的进程死循环导致后面的MPI_Barrier卡死:

    //broken.c
    #include <mpi.h>
    #include <string.h>
    #include <stdlib.h>
    int main(int argc, char **argv){
      MPI_Init(&argc, &argv);
      int rank, size;
      MPI_Comm_rank(MPI_COMM_WORLD, &rank);
      while (rank == 1);
      MPI_Barrier(MPI_COMM_WORLD);
    }
    

    假设提交64进程:

    [swsduhpc@psn004 example]$ mpicc broken.c
    [swsduhpc@psn004 example]$ bsub -I -n 64 ./a.out
    

    打开一个新的终端,先找到进程的作业号:

    [swsduhpc@psn004 example]$ bjobs
    JOBID   STAT     USER       JOB_NAME        QUEUE             FROM  SUBMIT_TIME    START_TIME     NODENUM NODELIST
    -------------------------------------------------------------------------------------------------------------------
    42191725 RUN      swsduhpc   a.out           q_sw_expr         psn0* May 04 15:18   May 04 15:18   16      58-60,63,65-67,81,92,96-99,101-103
    

    此时先对节点制造一个段错误:

    [swsduhpc@psn004 example]$ bsignal -s 11 4219725
    

    则会导致卡死的进程段错误,得到类似这样的输出:

     CATCHSIG: Myid = 0(CPU   58,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff047b700)
     CATCHSIG: Myid = 1(CPU   58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70)
     CATCHSIG: Myid = 2(CPU   58,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff0625780)
    [vn000058:mpi_rank_2][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 2: No such file or directory (2)
     CATCHSIG: Myid = 3(CPU   58,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff047b578)
    [vn000058:mpi_rank_3][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 3: No such file or directory (2)
     CATCHSIG: Myid = 4(CPU   59,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff04ac900)
     CATCHSIG: Myid = 5(CPU   59,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0628780)
    [vn000059:mpi_rank_5][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 5: No such file or directory (2)
     CATCHSIG: Myid = 6(CPU   59,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047cebc)
    [vn000059:mpi_rank_6][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 6: No such file or directory (2)
     CATCHSIG: Myid = 7(CPU   59,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff04aca9c)
    [vn000059:mpi_rank_7][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 7: No such file or directory (2)
     CATCHSIG: Myid = 9(CPU   60,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff047b66c)
     CATCHSIG: Myid = 10(CPU   60,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047b63c)
     CATCHSIG: Myid = 11(CPU   60,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff04a9390)
     CATCHSIG: Myid = 13(CPU   63,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff04acba8)
     CATCHSIG: Myid = 14(CPU   63,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047ceb4)
     CATCHSIG: Myid = 15(CPU   63,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff06257f0)
     CATCHSIG: Myid = 25(CPU   67,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0625990)
     CATCHSIG: Myid = 26(CPU   67,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047ce40)
     CATCHSIG: Myid = 27(CPU   67,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff047b620)
    ...
    
    

    这时候我们应该想个办法来处理这堆PC值从而寻找不对的东西,bpeek是其他终端查看作业输出的工具,使用grep加正则表达式可以找出所有PC = xxxx的内容,使用sort进行排序,再使用uniq -c来统计每个PC为x的进程有多少:

    [swsduhpc@psn004 example]$ bpeek 42191725  | grep -o -E "PC = 0x[0-9a-f]*" | sort | uniq -c
          2 PC = 0x4ff047b588
          1 PC = 0x4ff047b5f0
          1 PC = 0x4ff047b600
          2 PC = 0x4ff047b604
          1 PC = 0x4ff047b610
          1 PC = 0x4ff047b620
          1 PC = 0x4ff047b64c
          5 PC = 0x4ff047b6ec
          2 PC = 0x4ff047b728
          1 PC = 0x4ff047ce50
          1 PC = 0x4ff047cec4
          1 PC = 0x4ff047cf58
          2 PC = 0x4ff047cf74
          1 PC = 0x4ff04a93a0
          1 PC = 0x4ff04a93c0
          1 PC = 0x4ff04a94f0
          2 PC = 0x4ff04a951c
          1 PC = 0x4ff04a97a0
          1 PC = 0x4ff04ac8c0
          1 PC = 0x4ff04ac8f0
          1 PC = 0x4ff04ac930
          1 PC = 0x4ff04ac940
          1 PC = 0x4ff04ac950
          2 PC = 0x4ff04ac98c
          1 PC = 0x4ff04acab0
          1 PC = 0x4ff04acb18
          2 PC = 0x4ff04acb2c
          1 PC = 0x4ff04acbb0
          1 PC = 0x4ff04acbb8
          1 PC = 0x4ff04acde0
          1 PC = 0x4ff04e02d0
          1 PC = 0x4ff04e0310
          1 PC = 0x4ff04e0340
          1 PC = 0x4ff04e1038
          1 PC = 0x4ff04e1040
          1 PC = 0x4ff0625178
          1 PC = 0x4ff06251a4
          1 PC = 0x4ff06251c0
          2 PC = 0x4ff06251e0
          1 PC = 0x4ff0625790
          2 PC = 0x4ff06257dc
          3 PC = 0x4ff06257f0
          1 PC = 0x4ff0625800
          1 PC = 0x4ff0625970
          1 PC = 0x4ff062597c
          2 PC = 0x4ff06259a0
          1 PC = 0x4ff06287b0
          1 PC = 0x4ff0629b6c
    

    在此时效果不是很明显,但是我们很容易发现有些行的PC很接近,但是有些会比较异常。
    在懒的时候,我们可以修改grep的正则表达式来只筛选PC的前几位,从而获得更为显著的结果:

    [swsduhpc@psn004 swpf]$ bpeek 42191725  | grep -o -E "PC = 0x[0-9a-f]{6}" | sort | uniq -c
          1 PC = 0x4ff041
         21 PC = 0x4ff047
         21 PC = 0x4ff04a
          3 PC = 0x4ff04e
         18 PC = 0x4ff062
    

    鉴于我们更喜欢相信大多数进程是好的,那么异常进程应该为PC出现的次数较少的,这时候我们可以进行人工排查,这样,我们先找出完整的PC:

    [swsduhpc@psn004 example]$ bpeek 42191725 | grep 0x4ff0416
     CATCHSIG: Myid = 1(CPU   58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70)
    

    然后考虑通过addr2line定位出错的位置:

    [swsduhpc@psn004 example]$ addr2line -e ../swdb/a.out 0x4ff0416b70
    [这是一段马赛克]/example/broken.c:8
    

    或者在没有调试信息的情况下使用sw5objdump

    [swsduhpc@psn004 example]$ sw5objdump -d ./a.out | less
    

    将会在less中看到类似下面的输出:

    a.out:     file format elf64-sw_64
    
    Disassembly of section .text1:
    
    0000004ff0410120 <slave___phy_put_for_host>:
      4ff0410120:   e7 0f bb ff     ldih    gp,4071(t12)
      4ff0410124:   40 72 bd fb     ldi     gp,29248(gp)
      4ff0410128:   78 92 7d 8f     ldl     t12,-28040(gp)
      4ff041012c:   00 00 1f fe     ldih    a0,0(zero)
      4ff0410130:   00 00 1f fc     ldih    v0,0(zero)
      4ff0410134:   00 00 1f fd     ldih    t7,0(zero)
      4ff0410138:   00 00 df fc     ldih    t5,0(zero)
      4ff041013c:   80 ff de fb     ldi     sp,-128(sp)
    
    0000004ff0410140 <.LCFI___phy_put_for_host_adjustsp>:
      4ff0410140:   d9 15 e8 4b     log2xe  zero,0x40,t11
      4ff0410144:   00 01 10 fa     ldi     a0,256(a0)
      4ff0410148:   24 00 3e ab     stw     t11,36(sp)
      4ff041014c:   20 00 fe ab     stw     zero,32(sp)
      4ff0410150:   d0 00 00 f8     ldi     v0,208(v0)
      4ff0410154:   08 01 08 f9     ldi     t7,264(t7)
      4ff0410158:   e0 00 c6 f8     ldi     t5,224(t5)
      4ff041015c:   01 fe 1f 1b     rcsr    t10,0x1
      4ff0410160:   02 fe 5f 1a     rcsr    a2,0x2
      4ff0410164:   08 00 3b 8e     ldl     a1,8(t12)
      4ff0410168:   93 01 1f 43     s8addl  t10,zero,a3
    
    0000004ff041016c <.BB2___phy_put_for_host>:
      4ff041016c:   c8 9c 9d 8e     ldl     a4,-25400(gp)
      4ff0410170:   19 00 f3 43     sextw   a3,t11
      4ff0410174:   19 00 59 42     addw    a2,t11,t11
      4ff0410178:   10 00 3b 8c     ldl     t0,16(t12)
      4ff041017c:   18 00 9b 8b     ldw     at,24(t12)
    

    less中按/,然后粘帖PC敲回车进行搜索:

    /4ff0416b70
    

    less会定位到类似于:

    0000004ff0416b54 <.BB3_main>:
      4ff0416b54:   10 00 1e 88     ldw     v0,16(sp)
      4ff0416b58:   e7 0f ba ff     ldih    gp,4071(ra)
      4ff0416b5c:   0c 08 bd fb     ldi     gp,2060(gp)
      4ff0416b60:   00 25 00 48     cmpeq   v0,0x1,v0
      4ff0416b64:   03 00 00 c0     beq     v0,4ff0416b74 <.L_3_1794>
    
    0000004ff0416b68 <.L_3_1538>:
      4ff0416b68:   10 00 1e 88     ldw     v0,16(sp)
      4ff0416b6c:   00 25 00 48     cmpeq   v0,0x1,v0
      4ff0416b70:   fd ff 1f c4     bne     v0,4ff0416b68 <.L_3_1538> <---此处应该有黑黢黢的高亮
    
    0000004ff0416b74 <.L_3_1794>:
      4ff0416b74:   00 00 e0 13     br      4ff0416b78 <.Lt_3_770>
    
    

    向上滚动发现这是在main函数中。
    那么我们基本可以猜测这是坏掉的PC了,那么对应的搜索进程:

    [swsduhpc@psn004 example]$ bpeek 42191725 | grep 0x4ff0416
     CATCHSIG: Myid = 1(CPU   58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70)
    

    猜出结论58号CPU,1号进程卡死,其他进程由于等待卡死。
    让上次坏掉的节点休眠,再进行提交运行:

    [swsduhpc@psn004 example]$ bsub -node 58 sleep 1h
    

    然后排查是节点(可能运行成功或者其他进程报错)还是程序的问题(那么应该进程号一般一样)。



  • 膜拜,拿走!
    &&八个字八个字



  • 我再插一嘴。
    BBS那么凉,相当大程度上是因为大家比较喜欢怼到脸上来问。
    这样其实不好。
    想想在X86上编程,基本上有问题就StackOverflow,而我们遇到问题,一个自己搜一下解决的地方都没有。
    所以我还是墙裂建议大家把问题po到BBS上来。
    哪怕贴脸问了得到了答案分享一下也可以。
    另外,我遇到了另一个问题,今天帮阿廖调CAM的时候发现有两个进程直接没了。
    bpeek <jobid> | grep -o -E 'Myid = [0-9]*' | awk '{print $3}' | sort | awk 'BEGIN{last=-1}{if ($1 > last +1) print $1; last = $1}可以找到中间缺了哪个进程(打出来的是它的上一个进程丢掉了)。


Log in to reply