使用 200 左右个节点时会报错
-
测试 MPI 初始化的代码:
/// \file mpi_test.c #include <mpi.h> #include <stdio.h> int main(int argc,char *argv[]) { MPI_Init(&argc,&argv); int rank = 0; MPI_Comm_rank(MPI_COMM_WORLD,&rank); if (!rank) { printf("[Rank %d] Finalizing...\n", rank); } MPI_Finalize(); return 0; }
编译、运行:
mpicc -O3 mpi_test.c -o mpi_test bsub -q q_sw_share -N 200 -np 4 ./mpi_test
我在节点数 128~201 之间选了一些来测。
节点数 128(512 进程)时正常,节点数 196 (784 进程)时偶尔正常。问题1:在节点数 200 左右频繁报错
Other MPI error
跑了好几次 200 节点,全都是同一个错。下面这个错是节点的问题吗?
Other MPI error, error stack: PMPI_Wait(182).....................: MPI_Wait(request=0x5000454ecc, status=0x5000454ed0) failed MPIR_Wait_impl(71).................: _MPIDI_CH3I_Progress(292)..........: handle_read(1134)..................: handle_read_individual(1325).......: MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued. : No such file or directory (2) [vn025377:mpi_rank_118][MPIDI_CH3_Abort] Fatal error in PMPI_Wait: Other MPI error, error stack: PMPI_Wait(182).....................: MPI_Wait(request=0x5000728ecc, status=0x5000728ed0) failed MPIR_Wait_impl(71).................: _MPIDI_CH3I_Progress(292)..........: handle_read(1134)..................: handle_read_individual(1325).......: MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued. : No such file or directory (2)
问题2:在节点数 201 时出现了
CATCHSIG...Segmentation Fault...
打印的结果节选如下:
CATCHSIG: Myid = 6(CPU 25346,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844) CATCHSIG: Myid = 643(CPU 25528,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844) CATCHSIG: Myid = 640(CPU 25528,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844) [vn025528:mpi_rank_640][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2) CATCHSIG: Myid = 641(CPU 25528,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844) [vn025528:mpi_rank_641][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2)
输出文件里面全是这个,统计了一下,这个输出里面有每个节点的每个进程的报错信息,信息都是一样的。
一共出现 804 个 PC,地址一模一样都是0x4ff067c844
。在可执行文件里面找这个地址,可以看到它在memcpy
里面(826069 行),请问这是什么问题?826023 0000004ff067c790 <memcpy>: 826024 4ff067c790: 40 07 f0 43 or zero,a0,v0 826025 4ff067c794: 8a 00 40 ce ble a2,4ff067c9c0 <memcpy+0x230> 826026 4ff067c798: 81 07 11 42 xor a0,a1,t0 826027 4ff067c79c: 01 e7 20 48 and t0,0x7,t0 826028 4ff067c7a0: 5f 00 20 c4 bne t0,4ff067c920 <memcpy+0x190> 826029 4ff067c7a4: 01 e7 00 4a and a0,0x7,t0 826030 4ff067c7a8: 09 00 20 c0 beq t0,4ff067c7d0 <memcpy+0x40> 826031 4ff067c7ac: 5f 07 ff 43 or zero,zero,zero 826032 4ff067c7b0: 00 00 31 80 ldbu t0,0(a1) 826033 4ff067c7b4: 32 21 40 4a subl a2,0x1,a2 826034 4ff067c7b8: 11 21 20 4a addl a1,0x1,a1 826035 4ff067c7bc: 00 00 30 a0 stb t0,0(a0) 826036 4ff067c7c0: 10 21 00 4a addl a0,0x1,a0 826037 4ff067c7c4: 01 e7 00 4a and a0,0x7,t0 826038 4ff067c7c8: 7d 00 40 ce ble a2,4ff067c9c0 <memcpy+0x230> 826039 4ff067c7cc: f8 ff 3f c4 bne t0,4ff067c7b0 <memcpy+0x20> 826040 4ff067c7d0: 41 e5 4f 4a cmple a2,0x7f,t0 826041 4ff067c7d4: 36 00 20 c4 bne t0,4ff067c8b0 <memcpy+0x120> 826042 4ff067c7d8: 01 e7 07 4a and a0,0x3f,t0 826043 4ff067c7dc: 08 00 20 c0 beq t0,4ff067c800 <memcpy+0x70> 826044 4ff067c7e0: 00 00 31 8c ldl t0,0(a1) 826045 4ff067c7e4: 32 01 41 4a subl a2,0x8,a2 826046 4ff067c7e8: 11 01 21 4a addl a1,0x8,a1 826047 4ff067c7ec: 5f 07 ff 43 or zero,zero,zero 826048 4ff067c7f0: 00 00 30 ac stl t0,0(a0) 826049 4ff067c7f4: 10 01 01 4a addl a0,0x8,a0 826050 4ff067c7f8: 01 e7 07 4a and a0,0x3f,t0 826051 4ff067c7fc: f8 ff 3f c4 bne t0,4ff067c7e0 <memcpy+0x50> 826052 4ff067c800: 07 01 08 4a addl a0,0x40,t6 826053 4ff067c804: 41 e5 4f 4a cmple a2,0x7f,t0 826054 4ff067c808: 29 00 20 c4 bne t0,4ff067c8b0 <memcpy+0x120> 826055 4ff067c80c: 5f 07 ff 43 or zero,zero,zero 826056 4ff067c810: 00 01 e7 9b fetchd_w 256(t6) 826057 4ff067c814: 00 00 d1 8c ldl t5,0(a1) 826058 4ff067c818: 5f 07 ff 43 or zero,zero,zero 826059 4ff067c81c: 5f 07 ff 43 or zero,zero,zero 826060 4ff067c820: 08 00 91 8c ldl t3,8(a1) 826061 4ff067c824: 10 00 b1 8c ldl t4,16(a1) 826062 4ff067c828: 07 01 e8 48 addl t6,0x40,t6 826063 4ff067c82c: 5f 07 ff 43 or zero,zero,zero 826064 4ff067c830: 18 00 71 8c ldl t2,24(a1) 826065 4ff067c834: 01 01 08 4a addl a0,0x40,t0 826066 4ff067c838: 5f 07 ff 43 or zero,zero,zero 826067 4ff067c83c: 5f 07 ff 43 or zero,zero,zero 826068 4ff067c840: 11 01 24 4a addl a1,0x20,a1 * 826069 4ff067c844: 00 00 d0 ac stl t5,0(a0) 826070 4ff067c848: 5f 07 ff 43 or zero,zero,zero 826071 4ff067c84c: 5f 07 ff 43 or zero,zero,zero 826072 4ff067c850: 08 00 90 ac stl t3,8(a0) ...
-
我在提交作业时加上参数
-share_size
之后可以正常跑了。我使用 256 个节点、1024 进程时,至少要-share_size 76
才能正常跑。