使用 200 左右个节点时会报错



  • 测试 MPI 初始化的代码:

    /// \file mpi_test.c
    #include <mpi.h>
    #include <stdio.h>
    
    int main(int argc,char *argv[]) {
        MPI_Init(&argc,&argv);
    
        int rank = 0;
        MPI_Comm_rank(MPI_COMM_WORLD,&rank);
    
        if (!rank) {
            printf("[Rank %d] Finalizing...\n", rank);
        }
    
        MPI_Finalize();
    
        return 0;
    }
    

    编译、运行:

    mpicc -O3 mpi_test.c -o mpi_test
    bsub -q q_sw_share -N 200 -np 4 ./mpi_test
    

    我在节点数 128~201 之间选了一些来测。
    节点数 128(512 进程)时正常,节点数 196 (784 进程)时偶尔正常。

    问题1:在节点数 200 左右频繁报错 Other MPI error

    跑了好几次 200 节点,全都是同一个错。下面这个错是节点的问题吗?

    Other MPI error, error stack:
    PMPI_Wait(182).....................: MPI_Wait(request=0x5000454ecc, status=0x5000454ed0) failed
    MPIR_Wait_impl(71).................:
    _MPIDI_CH3I_Progress(292)..........:
    handle_read(1134)..................:
    handle_read_individual(1325).......:
    MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued.
    : No such file or directory (2)
    [vn025377:mpi_rank_118][MPIDI_CH3_Abort] Fatal error in PMPI_Wait:
    Other MPI error, error stack:
    PMPI_Wait(182).....................: MPI_Wait(request=0x5000728ecc, status=0x5000728ed0) failed
    MPIR_Wait_impl(71).................:
    _MPIDI_CH3I_Progress(292)..........:
    handle_read(1134)..................:
    handle_read_individual(1325).......:
    MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued.
    : No such file or directory (2)
    

    问题2:在节点数 201 时出现了 CATCHSIG...Segmentation Fault...

    打印的结果节选如下:

     CATCHSIG: Myid = 6(CPU 25346,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
     CATCHSIG: Myid = 643(CPU 25528,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
     CATCHSIG: Myid = 640(CPU 25528,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
    [vn025528:mpi_rank_640][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2)
     CATCHSIG: Myid = 641(CPU 25528,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
    [vn025528:mpi_rank_641][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2)
    

    输出文件里面全是这个,统计了一下,这个输出里面有每个节点的每个进程的报错信息,信息都是一样的。
    一共出现 804 个 PC,地址一模一样都是 0x4ff067c844。在可执行文件里面找这个地址,可以看到它在 memcpy 里面(826069 行),请问这是什么问题?

      826023 0000004ff067c790 <memcpy>:
      826024   4ff067c790:   40 07 f0 43     or  zero,a0,v0
      826025   4ff067c794:   8a 00 40 ce     ble a2,4ff067c9c0 <memcpy+0x230>
      826026   4ff067c798:   81 07 11 42     xor a0,a1,t0
      826027   4ff067c79c:   01 e7 20 48     and t0,0x7,t0
      826028   4ff067c7a0:   5f 00 20 c4     bne t0,4ff067c920 <memcpy+0x190>
      826029   4ff067c7a4:   01 e7 00 4a     and a0,0x7,t0
      826030   4ff067c7a8:   09 00 20 c0     beq t0,4ff067c7d0 <memcpy+0x40>
      826031   4ff067c7ac:   5f 07 ff 43     or  zero,zero,zero
      826032   4ff067c7b0:   00 00 31 80     ldbu    t0,0(a1)
      826033   4ff067c7b4:   32 21 40 4a     subl    a2,0x1,a2
      826034   4ff067c7b8:   11 21 20 4a     addl    a1,0x1,a1
      826035   4ff067c7bc:   00 00 30 a0     stb t0,0(a0)
      826036   4ff067c7c0:   10 21 00 4a     addl    a0,0x1,a0
      826037   4ff067c7c4:   01 e7 00 4a     and a0,0x7,t0
      826038   4ff067c7c8:   7d 00 40 ce     ble a2,4ff067c9c0 <memcpy+0x230>
      826039   4ff067c7cc:   f8 ff 3f c4     bne t0,4ff067c7b0 <memcpy+0x20>
      826040   4ff067c7d0:   41 e5 4f 4a     cmple   a2,0x7f,t0
      826041   4ff067c7d4:   36 00 20 c4     bne t0,4ff067c8b0 <memcpy+0x120>
      826042   4ff067c7d8:   01 e7 07 4a     and a0,0x3f,t0
      826043   4ff067c7dc:   08 00 20 c0     beq t0,4ff067c800 <memcpy+0x70>
      826044   4ff067c7e0:   00 00 31 8c     ldl t0,0(a1)
      826045   4ff067c7e4:   32 01 41 4a     subl    a2,0x8,a2
      826046   4ff067c7e8:   11 01 21 4a     addl    a1,0x8,a1
      826047   4ff067c7ec:   5f 07 ff 43     or  zero,zero,zero
      826048   4ff067c7f0:   00 00 30 ac     stl t0,0(a0)
      826049   4ff067c7f4:   10 01 01 4a     addl    a0,0x8,a0
      826050   4ff067c7f8:   01 e7 07 4a     and a0,0x3f,t0
      826051   4ff067c7fc:   f8 ff 3f c4     bne t0,4ff067c7e0 <memcpy+0x50>
      826052   4ff067c800:   07 01 08 4a     addl    a0,0x40,t6
      826053   4ff067c804:   41 e5 4f 4a     cmple   a2,0x7f,t0
      826054   4ff067c808:   29 00 20 c4     bne t0,4ff067c8b0 <memcpy+0x120>
      826055   4ff067c80c:   5f 07 ff 43     or  zero,zero,zero
      826056   4ff067c810:   00 01 e7 9b     fetchd_w    256(t6)
      826057   4ff067c814:   00 00 d1 8c     ldl t5,0(a1)
      826058   4ff067c818:   5f 07 ff 43     or  zero,zero,zero
      826059   4ff067c81c:   5f 07 ff 43     or  zero,zero,zero
      826060   4ff067c820:   08 00 91 8c     ldl t3,8(a1)
      826061   4ff067c824:   10 00 b1 8c     ldl t4,16(a1)
      826062   4ff067c828:   07 01 e8 48     addl    t6,0x40,t6
      826063   4ff067c82c:   5f 07 ff 43     or  zero,zero,zero
      826064   4ff067c830:   18 00 71 8c     ldl t2,24(a1)
      826065   4ff067c834:   01 01 08 4a     addl    a0,0x40,t0
      826066   4ff067c838:   5f 07 ff 43     or  zero,zero,zero
      826067   4ff067c83c:   5f 07 ff 43     or  zero,zero,zero
      826068   4ff067c840:   11 01 24 4a     addl    a1,0x20,a1
    * 826069   4ff067c844:   00 00 d0 ac     stl t5,0(a0)
      826070   4ff067c848:   5f 07 ff 43     or  zero,zero,zero
      826071   4ff067c84c:   5f 07 ff 43     or  zero,zero,zero
      826072   4ff067c850:   08 00 90 ac     stl t3,8(a0)
    ...
    


  • 我在提交作业时加上参数 -share_size 之后可以正常跑了。我使用 256 个节点、1024 进程时,至少要 -share_size 76 才能正常跑。


登录后回复