采用统计方法在大规模调试中寻找死掉的进程
-
大规模调试中可能会经常遇到一些神奇的错误,比较恶心的情况是有的进程死在什么地方了。
下面是一个很小的例子,会由于rank1的进程死循环导致后面的MPI_Barrier卡死://broken.c #include <mpi.h> #include <string.h> #include <stdlib.h> int main(int argc, char **argv){ MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); while (rank == 1); MPI_Barrier(MPI_COMM_WORLD); }
假设提交64进程:
[swsduhpc@psn004 example]$ mpicc broken.c [swsduhpc@psn004 example]$ bsub -I -n 64 ./a.out
打开一个新的终端,先找到进程的作业号:
[swsduhpc@psn004 example]$ bjobs JOBID STAT USER JOB_NAME QUEUE FROM SUBMIT_TIME START_TIME NODENUM NODELIST ------------------------------------------------------------------------------------------------------------------- 42191725 RUN swsduhpc a.out q_sw_expr psn0* May 04 15:18 May 04 15:18 16 58-60,63,65-67,81,92,96-99,101-103
此时先对节点制造一个段错误:
[swsduhpc@psn004 example]$ bsignal -s 11 4219725
则会导致卡死的进程段错误,得到类似这样的输出:
CATCHSIG: Myid = 0(CPU 58,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff047b700) CATCHSIG: Myid = 1(CPU 58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70) CATCHSIG: Myid = 2(CPU 58,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff0625780) [vn000058:mpi_rank_2][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 2: No such file or directory (2) CATCHSIG: Myid = 3(CPU 58,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff047b578) [vn000058:mpi_rank_3][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 3: No such file or directory (2) CATCHSIG: Myid = 4(CPU 59,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff04ac900) CATCHSIG: Myid = 5(CPU 59,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0628780) [vn000059:mpi_rank_5][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 5: No such file or directory (2) CATCHSIG: Myid = 6(CPU 59,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047cebc) [vn000059:mpi_rank_6][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 6: No such file or directory (2) CATCHSIG: Myid = 7(CPU 59,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff04aca9c) [vn000059:mpi_rank_7][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 7: No such file or directory (2) CATCHSIG: Myid = 9(CPU 60,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff047b66c) CATCHSIG: Myid = 10(CPU 60,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047b63c) CATCHSIG: Myid = 11(CPU 60,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff04a9390) CATCHSIG: Myid = 13(CPU 63,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff04acba8) CATCHSIG: Myid = 14(CPU 63,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047ceb4) CATCHSIG: Myid = 15(CPU 63,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff06257f0) CATCHSIG: Myid = 25(CPU 67,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0625990) CATCHSIG: Myid = 26(CPU 67,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047ce40) CATCHSIG: Myid = 27(CPU 67,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff047b620) ...
这时候我们应该想个办法来处理这堆PC值从而寻找不对的东西,
bpeek
是其他终端查看作业输出的工具,使用grep加正则表达式可以找出所有PC = xxxx
的内容,使用sort
进行排序,再使用uniq -c
来统计每个PC为x的进程有多少:[swsduhpc@psn004 example]$ bpeek 42191725 | grep -o -E "PC = 0x[0-9a-f]*" | sort | uniq -c 2 PC = 0x4ff047b588 1 PC = 0x4ff047b5f0 1 PC = 0x4ff047b600 2 PC = 0x4ff047b604 1 PC = 0x4ff047b610 1 PC = 0x4ff047b620 1 PC = 0x4ff047b64c 5 PC = 0x4ff047b6ec 2 PC = 0x4ff047b728 1 PC = 0x4ff047ce50 1 PC = 0x4ff047cec4 1 PC = 0x4ff047cf58 2 PC = 0x4ff047cf74 1 PC = 0x4ff04a93a0 1 PC = 0x4ff04a93c0 1 PC = 0x4ff04a94f0 2 PC = 0x4ff04a951c 1 PC = 0x4ff04a97a0 1 PC = 0x4ff04ac8c0 1 PC = 0x4ff04ac8f0 1 PC = 0x4ff04ac930 1 PC = 0x4ff04ac940 1 PC = 0x4ff04ac950 2 PC = 0x4ff04ac98c 1 PC = 0x4ff04acab0 1 PC = 0x4ff04acb18 2 PC = 0x4ff04acb2c 1 PC = 0x4ff04acbb0 1 PC = 0x4ff04acbb8 1 PC = 0x4ff04acde0 1 PC = 0x4ff04e02d0 1 PC = 0x4ff04e0310 1 PC = 0x4ff04e0340 1 PC = 0x4ff04e1038 1 PC = 0x4ff04e1040 1 PC = 0x4ff0625178 1 PC = 0x4ff06251a4 1 PC = 0x4ff06251c0 2 PC = 0x4ff06251e0 1 PC = 0x4ff0625790 2 PC = 0x4ff06257dc 3 PC = 0x4ff06257f0 1 PC = 0x4ff0625800 1 PC = 0x4ff0625970 1 PC = 0x4ff062597c 2 PC = 0x4ff06259a0 1 PC = 0x4ff06287b0 1 PC = 0x4ff0629b6c
在此时效果不是很明显,但是我们很容易发现有些行的PC很接近,但是有些会比较异常。
在懒的时候,我们可以修改grep的正则表达式来只筛选PC的前几位,从而获得更为显著的结果:[swsduhpc@psn004 swpf]$ bpeek 42191725 | grep -o -E "PC = 0x[0-9a-f]{6}" | sort | uniq -c 1 PC = 0x4ff041 21 PC = 0x4ff047 21 PC = 0x4ff04a 3 PC = 0x4ff04e 18 PC = 0x4ff062
鉴于我们更喜欢相信大多数进程是好的,那么异常进程应该为PC出现的次数较少的,这时候我们可以进行人工排查,这样,我们先找出完整的PC:
[swsduhpc@psn004 example]$ bpeek 42191725 | grep 0x4ff0416 CATCHSIG: Myid = 1(CPU 58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70)
然后考虑通过
addr2line
定位出错的位置:[swsduhpc@psn004 example]$ addr2line -e ../swdb/a.out 0x4ff0416b70 [这是一段马赛克]/example/broken.c:8
或者在没有调试信息的情况下使用
sw5objdump
:[swsduhpc@psn004 example]$ sw5objdump -d ./a.out | less
将会在
less
中看到类似下面的输出:a.out: file format elf64-sw_64 Disassembly of section .text1: 0000004ff0410120 <slave___phy_put_for_host>: 4ff0410120: e7 0f bb ff ldih gp,4071(t12) 4ff0410124: 40 72 bd fb ldi gp,29248(gp) 4ff0410128: 78 92 7d 8f ldl t12,-28040(gp) 4ff041012c: 00 00 1f fe ldih a0,0(zero) 4ff0410130: 00 00 1f fc ldih v0,0(zero) 4ff0410134: 00 00 1f fd ldih t7,0(zero) 4ff0410138: 00 00 df fc ldih t5,0(zero) 4ff041013c: 80 ff de fb ldi sp,-128(sp) 0000004ff0410140 <.LCFI___phy_put_for_host_adjustsp>: 4ff0410140: d9 15 e8 4b log2xe zero,0x40,t11 4ff0410144: 00 01 10 fa ldi a0,256(a0) 4ff0410148: 24 00 3e ab stw t11,36(sp) 4ff041014c: 20 00 fe ab stw zero,32(sp) 4ff0410150: d0 00 00 f8 ldi v0,208(v0) 4ff0410154: 08 01 08 f9 ldi t7,264(t7) 4ff0410158: e0 00 c6 f8 ldi t5,224(t5) 4ff041015c: 01 fe 1f 1b rcsr t10,0x1 4ff0410160: 02 fe 5f 1a rcsr a2,0x2 4ff0410164: 08 00 3b 8e ldl a1,8(t12) 4ff0410168: 93 01 1f 43 s8addl t10,zero,a3 0000004ff041016c <.BB2___phy_put_for_host>: 4ff041016c: c8 9c 9d 8e ldl a4,-25400(gp) 4ff0410170: 19 00 f3 43 sextw a3,t11 4ff0410174: 19 00 59 42 addw a2,t11,t11 4ff0410178: 10 00 3b 8c ldl t0,16(t12) 4ff041017c: 18 00 9b 8b ldw at,24(t12)
在
less
中按/
,然后粘帖PC敲回车进行搜索:/4ff0416b70
less
会定位到类似于:0000004ff0416b54 <.BB3_main>: 4ff0416b54: 10 00 1e 88 ldw v0,16(sp) 4ff0416b58: e7 0f ba ff ldih gp,4071(ra) 4ff0416b5c: 0c 08 bd fb ldi gp,2060(gp) 4ff0416b60: 00 25 00 48 cmpeq v0,0x1,v0 4ff0416b64: 03 00 00 c0 beq v0,4ff0416b74 <.L_3_1794> 0000004ff0416b68 <.L_3_1538>: 4ff0416b68: 10 00 1e 88 ldw v0,16(sp) 4ff0416b6c: 00 25 00 48 cmpeq v0,0x1,v0 4ff0416b70: fd ff 1f c4 bne v0,4ff0416b68 <.L_3_1538> <---此处应该有黑黢黢的高亮 0000004ff0416b74 <.L_3_1794>: 4ff0416b74: 00 00 e0 13 br 4ff0416b78 <.Lt_3_770>
向上滚动发现这是在
main
函数中。
那么我们基本可以猜测这是坏掉的PC了,那么对应的搜索进程:[swsduhpc@psn004 example]$ bpeek 42191725 | grep 0x4ff0416 CATCHSIG: Myid = 1(CPU 58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70)
猜出结论58号CPU,1号进程卡死,其他进程由于等待卡死。
让上次坏掉的节点休眠,再进行提交运行:[swsduhpc@psn004 example]$ bsub -node 58 sleep 1h
然后排查是节点(可能运行成功或者其他进程报错)还是程序的问题(那么应该进程号一般一样)。
-
膜拜,拿走!
&&八个字八个字
-
我再插一嘴。
BBS那么凉,相当大程度上是因为大家比较喜欢怼到脸上来问。
这样其实不好。
想想在X86上编程,基本上有问题就StackOverflow,而我们遇到问题,一个自己搜一下解决的地方都没有。
所以我还是墙裂建议大家把问题po到BBS上来。
哪怕贴脸问了得到了答案分享一下也可以。
另外,我遇到了另一个问题,今天帮阿廖调CAM的时候发现有两个进程直接没了。
用bpeek <jobid> | grep -o -E 'Myid = [0-9]*' | awk '{print $3}' | sort | awk 'BEGIN{last=-1}{if ($1 > last +1) print $1; last = $1}
可以找到中间缺了哪个进程(打出来的是它的上一个进程丢掉了)。