赞!
--whole-archive那个可以看这篇文章
swmore
@swmore
swmore 发布的帖子
-
通过屏幕输出追踪MPI通信的行为
使用gcc/Open64的wrap函数功能可以实现在MPI调用时同时在屏幕打印具体的MPI调用。
点击wrap-mpi.tgz 下载小例子。
使用gen_and_test.sh
生成wrap过的MPI函数。
通过修改wrapping_funcs
的内容可以选择需要wrap的函数。 -
RE: 无法打开图形界面
请贴出你的ssh客户端。
基本上需要试的事情:
ssh -X 用户名@服务器
:使用无xauth的X11转发
ssh -Y 用户名@服务器
:使用带xauth的X11转发
新手推荐使用MobaXterm,老手可以尝试使用vcxsrv+Windows Subsystem For Linux。 -
采用统计方法在大规模调试中寻找死掉的进程
大规模调试中可能会经常遇到一些神奇的错误,比较恶心的情况是有的进程死在什么地方了。
下面是一个很小的例子,会由于rank1的进程死循环导致后面的MPI_Barrier卡死://broken.c #include <mpi.h> #include <string.h> #include <stdlib.h> int main(int argc, char **argv){ MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); while (rank == 1); MPI_Barrier(MPI_COMM_WORLD); }
假设提交64进程:
[swsduhpc@psn004 example]$ mpicc broken.c [swsduhpc@psn004 example]$ bsub -I -n 64 ./a.out
打开一个新的终端,先找到进程的作业号:
[swsduhpc@psn004 example]$ bjobs JOBID STAT USER JOB_NAME QUEUE FROM SUBMIT_TIME START_TIME NODENUM NODELIST ------------------------------------------------------------------------------------------------------------------- 42191725 RUN swsduhpc a.out q_sw_expr psn0* May 04 15:18 May 04 15:18 16 58-60,63,65-67,81,92,96-99,101-103
此时先对节点制造一个段错误:
[swsduhpc@psn004 example]$ bsignal -s 11 4219725
则会导致卡死的进程段错误,得到类似这样的输出:
CATCHSIG: Myid = 0(CPU 58,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff047b700) CATCHSIG: Myid = 1(CPU 58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70) CATCHSIG: Myid = 2(CPU 58,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff0625780) [vn000058:mpi_rank_2][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 2: No such file or directory (2) CATCHSIG: Myid = 3(CPU 58,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff047b578) [vn000058:mpi_rank_3][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 3: No such file or directory (2) CATCHSIG: Myid = 4(CPU 59,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff04ac900) CATCHSIG: Myid = 5(CPU 59,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0628780) [vn000059:mpi_rank_5][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 5: No such file or directory (2) CATCHSIG: Myid = 6(CPU 59,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047cebc) [vn000059:mpi_rank_6][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 6: No such file or directory (2) CATCHSIG: Myid = 7(CPU 59,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff04aca9c) [vn000059:mpi_rank_7][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 7: No such file or directory (2) CATCHSIG: Myid = 9(CPU 60,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff047b66c) CATCHSIG: Myid = 10(CPU 60,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047b63c) CATCHSIG: Myid = 11(CPU 60,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff04a9390) CATCHSIG: Myid = 13(CPU 63,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff04acba8) CATCHSIG: Myid = 14(CPU 63,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047ceb4) CATCHSIG: Myid = 15(CPU 63,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff06257f0) CATCHSIG: Myid = 25(CPU 67,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0625990) CATCHSIG: Myid = 26(CPU 67,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff047ce40) CATCHSIG: Myid = 27(CPU 67,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff047b620) ...
这时候我们应该想个办法来处理这堆PC值从而寻找不对的东西,
bpeek
是其他终端查看作业输出的工具,使用grep加正则表达式可以找出所有PC = xxxx
的内容,使用sort
进行排序,再使用uniq -c
来统计每个PC为x的进程有多少:[swsduhpc@psn004 example]$ bpeek 42191725 | grep -o -E "PC = 0x[0-9a-f]*" | sort | uniq -c 2 PC = 0x4ff047b588 1 PC = 0x4ff047b5f0 1 PC = 0x4ff047b600 2 PC = 0x4ff047b604 1 PC = 0x4ff047b610 1 PC = 0x4ff047b620 1 PC = 0x4ff047b64c 5 PC = 0x4ff047b6ec 2 PC = 0x4ff047b728 1 PC = 0x4ff047ce50 1 PC = 0x4ff047cec4 1 PC = 0x4ff047cf58 2 PC = 0x4ff047cf74 1 PC = 0x4ff04a93a0 1 PC = 0x4ff04a93c0 1 PC = 0x4ff04a94f0 2 PC = 0x4ff04a951c 1 PC = 0x4ff04a97a0 1 PC = 0x4ff04ac8c0 1 PC = 0x4ff04ac8f0 1 PC = 0x4ff04ac930 1 PC = 0x4ff04ac940 1 PC = 0x4ff04ac950 2 PC = 0x4ff04ac98c 1 PC = 0x4ff04acab0 1 PC = 0x4ff04acb18 2 PC = 0x4ff04acb2c 1 PC = 0x4ff04acbb0 1 PC = 0x4ff04acbb8 1 PC = 0x4ff04acde0 1 PC = 0x4ff04e02d0 1 PC = 0x4ff04e0310 1 PC = 0x4ff04e0340 1 PC = 0x4ff04e1038 1 PC = 0x4ff04e1040 1 PC = 0x4ff0625178 1 PC = 0x4ff06251a4 1 PC = 0x4ff06251c0 2 PC = 0x4ff06251e0 1 PC = 0x4ff0625790 2 PC = 0x4ff06257dc 3 PC = 0x4ff06257f0 1 PC = 0x4ff0625800 1 PC = 0x4ff0625970 1 PC = 0x4ff062597c 2 PC = 0x4ff06259a0 1 PC = 0x4ff06287b0 1 PC = 0x4ff0629b6c
在此时效果不是很明显,但是我们很容易发现有些行的PC很接近,但是有些会比较异常。
在懒的时候,我们可以修改grep的正则表达式来只筛选PC的前几位,从而获得更为显著的结果:[swsduhpc@psn004 swpf]$ bpeek 42191725 | grep -o -E "PC = 0x[0-9a-f]{6}" | sort | uniq -c 1 PC = 0x4ff041 21 PC = 0x4ff047 21 PC = 0x4ff04a 3 PC = 0x4ff04e 18 PC = 0x4ff062
鉴于我们更喜欢相信大多数进程是好的,那么异常进程应该为PC出现的次数较少的,这时候我们可以进行人工排查,这样,我们先找出完整的PC:
[swsduhpc@psn004 example]$ bpeek 42191725 | grep 0x4ff0416 CATCHSIG: Myid = 1(CPU 58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70)
然后考虑通过
addr2line
定位出错的位置:[swsduhpc@psn004 example]$ addr2line -e ../swdb/a.out 0x4ff0416b70 [这是一段马赛克]/example/broken.c:8
或者在没有调试信息的情况下使用
sw5objdump
:[swsduhpc@psn004 example]$ sw5objdump -d ./a.out | less
将会在
less
中看到类似下面的输出:a.out: file format elf64-sw_64 Disassembly of section .text1: 0000004ff0410120 <slave___phy_put_for_host>: 4ff0410120: e7 0f bb ff ldih gp,4071(t12) 4ff0410124: 40 72 bd fb ldi gp,29248(gp) 4ff0410128: 78 92 7d 8f ldl t12,-28040(gp) 4ff041012c: 00 00 1f fe ldih a0,0(zero) 4ff0410130: 00 00 1f fc ldih v0,0(zero) 4ff0410134: 00 00 1f fd ldih t7,0(zero) 4ff0410138: 00 00 df fc ldih t5,0(zero) 4ff041013c: 80 ff de fb ldi sp,-128(sp) 0000004ff0410140 <.LCFI___phy_put_for_host_adjustsp>: 4ff0410140: d9 15 e8 4b log2xe zero,0x40,t11 4ff0410144: 00 01 10 fa ldi a0,256(a0) 4ff0410148: 24 00 3e ab stw t11,36(sp) 4ff041014c: 20 00 fe ab stw zero,32(sp) 4ff0410150: d0 00 00 f8 ldi v0,208(v0) 4ff0410154: 08 01 08 f9 ldi t7,264(t7) 4ff0410158: e0 00 c6 f8 ldi t5,224(t5) 4ff041015c: 01 fe 1f 1b rcsr t10,0x1 4ff0410160: 02 fe 5f 1a rcsr a2,0x2 4ff0410164: 08 00 3b 8e ldl a1,8(t12) 4ff0410168: 93 01 1f 43 s8addl t10,zero,a3 0000004ff041016c <.BB2___phy_put_for_host>: 4ff041016c: c8 9c 9d 8e ldl a4,-25400(gp) 4ff0410170: 19 00 f3 43 sextw a3,t11 4ff0410174: 19 00 59 42 addw a2,t11,t11 4ff0410178: 10 00 3b 8c ldl t0,16(t12) 4ff041017c: 18 00 9b 8b ldw at,24(t12)
在
less
中按/
,然后粘帖PC敲回车进行搜索:/4ff0416b70
less
会定位到类似于:0000004ff0416b54 <.BB3_main>: 4ff0416b54: 10 00 1e 88 ldw v0,16(sp) 4ff0416b58: e7 0f ba ff ldih gp,4071(ra) 4ff0416b5c: 0c 08 bd fb ldi gp,2060(gp) 4ff0416b60: 00 25 00 48 cmpeq v0,0x1,v0 4ff0416b64: 03 00 00 c0 beq v0,4ff0416b74 <.L_3_1794> 0000004ff0416b68 <.L_3_1538>: 4ff0416b68: 10 00 1e 88 ldw v0,16(sp) 4ff0416b6c: 00 25 00 48 cmpeq v0,0x1,v0 4ff0416b70: fd ff 1f c4 bne v0,4ff0416b68 <.L_3_1538> <---此处应该有黑黢黢的高亮 0000004ff0416b74 <.L_3_1794>: 4ff0416b74: 00 00 e0 13 br 4ff0416b78 <.Lt_3_770>
向上滚动发现这是在
main
函数中。
那么我们基本可以猜测这是坏掉的PC了,那么对应的搜索进程:[swsduhpc@psn004 example]$ bpeek 42191725 | grep 0x4ff0416 CATCHSIG: Myid = 1(CPU 58,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff0416b70)
猜出结论58号CPU,1号进程卡死,其他进程由于等待卡死。
让上次坏掉的节点休眠,再进行提交运行:[swsduhpc@psn004 example]$ bsub -node 58 sleep 1h
然后排查是节点(可能运行成功或者其他进程报错)还是程序的问题(那么应该进程号一般一样)。
-
RE: 自己写的一个半自动神威26010的性能采样和调试工具库
@xiaoq
我在改一个新的版本,要结合性能计数器,外加用python做后处理和可视化。
以及,从核有没有优雅的办法wrap函数?
另外我想定制一版编译器对dma进行插桩。 -
RE: 提交作业报错:job submit failed, ret = -19, reason: No enough compute nodes
此外应该注意如下可能:
- 队列资源不足且使用交互作业(-I)方式提交, 应使用qload -w确认相应的队列是否有足够的计算资源.
- 国产队列中, 由于指定了-np 4而队列中完整芯片不足, 若不使用交叉段可以尝试去掉-np 4选项.
-
自己写的一个半自动神威26010的性能采样和调试工具库
使用SW3 IO寄存器手册中的内容获取动态数据。
使用GNU libbfd和GNU libiberty获取调试信息。下载:
在此链接 下载.
编译:
make tests
会生成4个测试文件: test_pf_serial, test_db_serial, test_pf_mpi, test_db_mpi. 可以在计算节点上进行测试.
运行:
下面是用于控制这个库的环境变量:
ENABLED_PROC: 逗号分隔的进程号列表或者"ALL"以在所有进程下启用, 默认0号进程启用. ENABLE_PROF: 在此变量定义且不为"FALSE"时启用采样. OUT_PATTERN: 采样输出文件的文件名格式, 是printf的格式串, mpi下需要一个%d使不同进程输出到不同文件. VERBOSE: 在此变量定义且不为"FALSE"启用verbose模式.
cgsp至少应为1, 因为需要在从核获取一些运行信息.
额外的内存使用接近二进制文件大小*2, 用于使用libbfd解析调试信息.
例如:VERBOSE=1 ENABLED_PROC=ALL ENABLE_PROF=1 bsub -I -cgsp 1 -b -n 4 -share_size 128 ./test_pf_mpi
功能
调试工具可以用于定位从核DMA和SDLB错误以及死循环现场,
bsignal -s 30 <JobId> #可以使作业打印PC bsignal -s 31 <JobId> #可以使作业尝试打印PC和对应行号
采样工具可以采样从核PC在每个地址出现的次数并对应到行号, 输出到OUT_PATTERN指定的文件或者pc_hits*.txt.
由于IO接口限制, 目前来看4条指令的PC可能会汇集在一起, 例如:0x4ff0410940: 7803022 test_stub_cpe.c: 5: func1 0x4ff0410944: 0 test_stub_cpe.c: 5: func1 0x4ff0410948: 0 test_stub_cpe.c: 5: func1 0x4ff041094c: 0 test_stub_cpe.c: 5: func1 0x4ff0410950: 3901847 test_stub_cpe.c: 5: func1 0x4ff0410954: 0 test_stub_cpe.c: 5: func1 0x4ff0410958: 0 test_stub_cpe.c: 5: func1 0x4ff041095c: 0 test_stub_cpe.c: 5: func1 0x4ff0410960: 11707315 test_stub_cpe.c: 5: func1 0x4ff0410964: 0 test_stub_cpe.c: 5: func1 0x4ff0410968: 0 test_stub_cpe.c: 5: func1 0x4ff041096c: 0 test_stub_cpe.c: 5: func1 0x4ff0410970: 11706904 test_stub_cpe.c: 5: func1 0x4ff0410974: 0 test_stub_cpe.c: 5: func1 0x4ff0410978: 0 test_stub_cpe.c: 5: func1 0x4ff041097c: 0 test_stub_cpe.c: 5: func1 0x4ff0410980: 3901244 test_stub_cpe.c: 5: func1 0x4ff0410984: 0 test_stub_cpe.c: 5: func1 0x4ff0410988: 0 test_stub_cpe.c: 5: func1 0x4ff041098c: 0 test_stub_cpe.c: 5: func1 0x4ff0410990: 3904345 test_stub_cpe.c: 7: func2 0x4ff0410994: 0 test_stub_cpe.c: 7: func2 0x4ff0410998: 0 test_stub_cpe.c: 8: func2 0x4ff041099c: 0 test_stub_cpe.c: 7: func2 0x4ff04109a0: 0 test_stub_cpe.c: 9: func2 0x4ff04109a4: 0 test_stub_cpe.c: 8: func2 0x4ff04109a8: 0 test_stub_cpe.c: 8: func2 0x4ff04109ac: 0 test_stub_cpe.c: 9: func2 0x4ff04109b0: 0 test_stub_cpe.c: 9: func2 0x4ff04109b4: 0 test_stub_cpe.c: 9: func2 0x4ff04109b8: 0 test_stub_cpe.c: 9: func2 0x4ff04109bc: 0 test_stub_cpe.c: 9: func2 0x4ff04109c0: 7803191 test_stub_cpe.c: 9: func2
链接:
参照mpi.flag, serial.flag以链接到其他程序.
限制:
对athread_init, athread_join, athread_spawn进行了wrap, 可能影响这些函数的性能.
MPI版本对MPI_Init和MPI_Finalize进行了wrap, 可以用于c/c++, 但是程序必须有MPI_Finalize.
串行版本由于对main函数进行wrap, main函数必须是int main(int argc, char **argv), c++没有测试.