anology 发布的帖子

anology

已解决，编译时加上：

-OPT:IEEE_arith=1，或
-OPT:IEEE_arith=2

anology

1 总结

1.1 报错

warning: node [26528]: user's mpe task: tid= 0, pid= 22346, terminated by sig 8

1.2 解决

使用 sw5cc 编译时，主核上的浮点数默认似乎不遵守 IEEE 标准，指数为 0 的浮点数（subnormals）会被置为 0。如果在主核代码中尝试使用一个 subnormal，会报错 SIGFPE。

打开编译选项 -OPT:IEEE_arith，主核上能够正常使用 subnormals 了。在后面记录中出现的问题，可以通过以下编译选项解决：

-OPT:IEEE_arith=1，或
-OPT:IEEE_arith=2

2 记录

2.1 代码

使用手册里的计算除法的 C 语言例子，在主、从核上使用相同的表达式不断把浮点数除以 1000（数组索引为 j, i，j 也是线程编号，一共除以 (j+1)*i 次 1000）。随后数据由从核传回主核，把所有数据加起来作为 checksum 比较。

2.2 编译选项

sw5cc 和 sw5gcc 都试了。主核和从核使用相同选项， -O1、-O2、-O3 都一样报错。

2.3 结果

浮点数用 float，从核只用 2 个时，发生以下情形：

从核上出现的小于 2^(−127)≈5.877e−39 的数字可以在主核端打印，但不能用于四则运算，否则报错（SIGFPE）；
但是这些数据打印出来的值都超出了 float 的范围。

具体输出结果如下。其中，数组 c 在从核上计算，数组 cc 在主核上计算，紧跟数组值的是 c 中元素的十六进制。

从核 0 的结果：

c[0][0] = 3.6666666623209573e-18, cc[0][0] = 3.6666666623209573e-18
22 87 46 b0
c[0][1] = 3.6666666849391771e-21, cc[0][1] = 3.6666666849391771e-21
1d 8a 85 d3
c[0][2] = 3.6666665082343344e-24, cc[0][2] = 3.6666665082343344e-24
18 8d d8 e8
c[0][3] = 3.6666663171820839e-27, cc[0][3] = 3.6666663171820839e-27
13 91 40 6a
c[0][4] = 3.6666663021357562e-30, cc[0][4] = 3.6666663021357562e-30
0e 94 bc d7
c[0][5] = 3.6666662639321898e-33, cc[0][5] = 3.6666662639321898e-33
09 98 4e af
c[0][6] = 3.6666663500279674e-36, cc[0][6] = 3.6666663500279674e-36
04 9b f6 76
c[0][7] = 6.9405734356891773e-309, cc[0][7] = 0.0000000000000000e+00
00 27 ed 2d
c[0][8] = 6.9415787311951471e-312, cc[0][8] = 0.0000000000000000e+00
00 00 0a 39
c[0][9] = 7.9574842161197712e-315, cc[0][9] = 0.0000000000000000e+00
00 00 00 03
c[0][10] = 0.0000000000000000e+00, cc[0][10] = 0.0000000000000000e+00
00 00 00 00

从核 1 的结果：

c[1][0] = 3.6666666623209573e-18, cc[1][0] = 3.6666666623209573e-18
22 87 46 b0
c[1][1] = 3.6666665082343344e-24, cc[1][1] = 3.6666665082343344e-24
18 8d d8 e8
c[1][2] = 3.6666663021357562e-30, cc[1][2] = 3.6666663021357562e-30
0e 94 bc d7
c[1][3] = 3.6666663500279674e-36, cc[1][3] = 3.6666663500279674e-36
04 9b f6 76
c[1][4] = 6.9415787311951471e-312, cc[1][4] = 0.0000000000000000e+00
00 00 0a 39
c[1][5] = 0.0000000000000000e+00, cc[1][5] = 0.0000000000000000e+00
00 00 00 00

两个从核在循环次数不同的情况下，能够打印出数量不同的 subnormals，并且打印出的结果显示这些数字远远小于 float 类型允许的 subnormals。
其中一个数字使用 printf("%.16e\n") 打印出来为
- 十进制：7.9574842161197712e-315
- 十六进制：00 00 00 03
- 二进制：0000 0000 0000 0000 0000 0000 0000 0011

00 00 00 03 这个十六进制数实际对应的 4 字节单精度浮点数为 4.2039e-45。
7.9574842161197712e-315 这个浮点数按 8 字节双精度处理，转换成十六进制是 00 00 00 00 60 00 00 00。

2.4 补充

主核上似乎无法使用 subnormals。

float 类型的最后一个指数非 0 的数字是 0x00800000，这个用于计算不会报错。第一个指数为 0 的数字是 0x007fffff，这个数字用于计算会报错。

2.5 问题

请问这是什么情况？主核和从核对浮点数 subnormal 的处理不一样？
有没有办法避免这种情况？在从核上加一个判断把这些数字全部置为 0 是可以正常执行的，但这样有点浪费时间。

anology

请问如何查询历史作业的记录？比如时间、所用节点等等。找不到 bhist 这个程序。

anology

我在提交作业时加上参数 -share_size 之后可以正常跑了。我使用 256 个节点、1024 进程时，至少要 -share_size 76 才能正常跑。

anology

测试 MPI 初始化的代码：

/// \file mpi_test.c
#include <mpi.h>
#include <stdio.h>

int main(int argc,char *argv[]) {
    MPI_Init(&argc,&argv);

    int rank = 0;
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);

    if (!rank) {
        printf("[Rank %d] Finalizing...\n", rank);
    }

    MPI_Finalize();

    return 0;
}

编译、运行：

mpicc -O3 mpi_test.c -o mpi_test
bsub -q q_sw_share -N 200 -np 4 ./mpi_test

我在节点数 128~201 之间选了一些来测。
节点数 128（512 进程）时正常，节点数 196 （784 进程）时偶尔正常。

问题1：在节点数 200 左右频繁报错 Other MPI error

跑了好几次 200 节点，全都是同一个错。下面这个错是节点的问题吗？

Other MPI error, error stack:
PMPI_Wait(182).....................: MPI_Wait(request=0x5000454ecc, status=0x5000454ed0) failed
MPIR_Wait_impl(71).................:
_MPIDI_CH3I_Progress(292)..........:
handle_read(1134)..................:
handle_read_individual(1325).......:
MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued.
: No such file or directory (2)
[vn025377:mpi_rank_118][MPIDI_CH3_Abort] Fatal error in PMPI_Wait:
Other MPI error, error stack:
PMPI_Wait(182).....................: MPI_Wait(request=0x5000728ecc, status=0x5000728ed0) failed
MPIR_Wait_impl(71).................:
_MPIDI_CH3I_Progress(292)..........:
handle_read(1134)..................:
handle_read_individual(1325).......:
MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued.
: No such file or directory (2)

问题2：在节点数 201 时出现了 CATCHSIG...Segmentation Fault...

打印的结果节选如下：

 CATCHSIG: Myid = 6(CPU 25346,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
 CATCHSIG: Myid = 643(CPU 25528,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
 CATCHSIG: Myid = 640(CPU 25528,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
[vn025528:mpi_rank_640][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2)
 CATCHSIG: Myid = 641(CPU 25528,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
[vn025528:mpi_rank_641][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2)

输出文件里面全是这个，统计了一下，这个输出里面有每个节点的每个进程的报错信息，信息都是一样的。
一共出现 804 个 PC，地址一模一样都是 0x4ff067c844。在可执行文件里面找这个地址，可以看到它在 memcpy 里面（826069 行），请问这是什么问题？

  826023 0000004ff067c790 <memcpy>:
  826024   4ff067c790:   40 07 f0 43     or  zero,a0,v0
  826025   4ff067c794:   8a 00 40 ce     ble a2,4ff067c9c0 <memcpy+0x230>
  826026   4ff067c798:   81 07 11 42     xor a0,a1,t0
  826027   4ff067c79c:   01 e7 20 48     and t0,0x7,t0
  826028   4ff067c7a0:   5f 00 20 c4     bne t0,4ff067c920 <memcpy+0x190>
  826029   4ff067c7a4:   01 e7 00 4a     and a0,0x7,t0
  826030   4ff067c7a8:   09 00 20 c0     beq t0,4ff067c7d0 <memcpy+0x40>
  826031   4ff067c7ac:   5f 07 ff 43     or  zero,zero,zero
  826032   4ff067c7b0:   00 00 31 80     ldbu    t0,0(a1)
  826033   4ff067c7b4:   32 21 40 4a     subl    a2,0x1,a2
  826034   4ff067c7b8:   11 21 20 4a     addl    a1,0x1,a1
  826035   4ff067c7bc:   00 00 30 a0     stb t0,0(a0)
  826036   4ff067c7c0:   10 21 00 4a     addl    a0,0x1,a0
  826037   4ff067c7c4:   01 e7 00 4a     and a0,0x7,t0
  826038   4ff067c7c8:   7d 00 40 ce     ble a2,4ff067c9c0 <memcpy+0x230>
  826039   4ff067c7cc:   f8 ff 3f c4     bne t0,4ff067c7b0 <memcpy+0x20>
  826040   4ff067c7d0:   41 e5 4f 4a     cmple   a2,0x7f,t0
  826041   4ff067c7d4:   36 00 20 c4     bne t0,4ff067c8b0 <memcpy+0x120>
  826042   4ff067c7d8:   01 e7 07 4a     and a0,0x3f,t0
  826043   4ff067c7dc:   08 00 20 c0     beq t0,4ff067c800 <memcpy+0x70>
  826044   4ff067c7e0:   00 00 31 8c     ldl t0,0(a1)
  826045   4ff067c7e4:   32 01 41 4a     subl    a2,0x8,a2
  826046   4ff067c7e8:   11 01 21 4a     addl    a1,0x8,a1
  826047   4ff067c7ec:   5f 07 ff 43     or  zero,zero,zero
  826048   4ff067c7f0:   00 00 30 ac     stl t0,0(a0)
  826049   4ff067c7f4:   10 01 01 4a     addl    a0,0x8,a0
  826050   4ff067c7f8:   01 e7 07 4a     and a0,0x3f,t0
  826051   4ff067c7fc:   f8 ff 3f c4     bne t0,4ff067c7e0 <memcpy+0x50>
  826052   4ff067c800:   07 01 08 4a     addl    a0,0x40,t6
  826053   4ff067c804:   41 e5 4f 4a     cmple   a2,0x7f,t0
  826054   4ff067c808:   29 00 20 c4     bne t0,4ff067c8b0 <memcpy+0x120>
  826055   4ff067c80c:   5f 07 ff 43     or  zero,zero,zero
  826056   4ff067c810:   00 01 e7 9b     fetchd_w    256(t6)
  826057   4ff067c814:   00 00 d1 8c     ldl t5,0(a1)
  826058   4ff067c818:   5f 07 ff 43     or  zero,zero,zero
  826059   4ff067c81c:   5f 07 ff 43     or  zero,zero,zero
  826060   4ff067c820:   08 00 91 8c     ldl t3,8(a1)
  826061   4ff067c824:   10 00 b1 8c     ldl t4,16(a1)
  826062   4ff067c828:   07 01 e8 48     addl    t6,0x40,t6
  826063   4ff067c82c:   5f 07 ff 43     or  zero,zero,zero
  826064   4ff067c830:   18 00 71 8c     ldl t2,24(a1)
  826065   4ff067c834:   01 01 08 4a     addl    a0,0x40,t0
  826066   4ff067c838:   5f 07 ff 43     or  zero,zero,zero
  826067   4ff067c83c:   5f 07 ff 43     or  zero,zero,zero
  826068   4ff067c840:   11 01 24 4a     addl    a1,0x20,a1
* 826069   4ff067c844:   00 00 d0 ac     stl t5,0(a0)
  826070   4ff067c848:   5f 07 ff 43     or  zero,zero,zero
  826071   4ff067c84c:   5f 07 ff 43     or  zero,zero,zero
  826072   4ff067c850:   08 00 90 ac     stl t3,8(a0)
...

anology

申请使用多个节点时出错了，显示

submit-job failed, because some compute-nodes in list: [3365,17508,18162,18231,18243,18246,18248,18252,18255,18295,18297,18313,18319,18325,18327,18362,18364,18379-18380,18383-18385,18390,18416-18418,18420-18421,18423,31469,40382,40392,40395,40399-40400,40420,40434,40445,40461,40463-40464,40501,40818,40835,40838,40841] unavailable! check these nodes please!!!

请问是节点有问题吗？

anology

尝试着把手册里的 matrixMul 这个代码改为 athread 版的时候遇到一些问题。

问题1：PE_MODE 怎么才能用？

slave.c: In function 'func':
slave.c:24: error: 'PE_MODE' undeclared (first use in this function)
slave.c:24: error: (Each undeclared identifier is reported only once
slave.c:24: error: for each function it appears in.)

这个 PE_MODE 找不到吗？我于是把 PE_MODE 直接定义为 0 了。

问题2：__thread_local 怎么使用？
按手册里写了这样一句：

__thread_local volatile unsigned long get_reply, put_reply;

但是编译不通过：

slave.c:10: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'volatile'
slave.c: In function 'func':
slave.c:23: error: 'get_reply' undeclared (first use in this function)
slave.c:23: error: (Each undeclared identifier is reported only once
slave.c:23: error: for each function it appears in.)
slave.c:39: error: 'put_reply' undeclared (first use in this function)

于是我改成局部变量了，请问这个 __thread_local 怎么正确使用呢？

更新
我已经知道了，slave.h 不是自己的头文件，是这个：/usr/sw-mpp/swcc/sw5gcc-binary/include/slave.h。

附1：编译

$ sw5cc -host -O3 -c matrixMul.c
$ sw5cc -slave -O3 -c slave.c
$ sw5cc -hybrid matrixMul.o slave.o -o test

附2：提交

$ bsub -I -b -q q_sw_share -n 1 -cgsp 64 ./test

附3：源文件

/// \file matrixMul.c
#include <athread.h>
#include <stdio.h>
#include <sys/time.h>
#include "slave.h"

#define M 1024
#define N 1024
#define K 1024

double A[M][N];
double B[N][K];
double C[M][K];

extern SLAVE_FUN(func)();

// Timer
static inline unsigned long rpcc() {
    unsigned long time;
    asm("rtc %0": "=r" (time) : );
    return time;
}


int main() {
    int i, j, k;
    unsigned long count;

    //--------------------------------------------------------------------------
    // Matrix multiplication on host
    //--------------------------------------------------------------------------
    // init A, B
    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            A[i][j] = i;

    for (i = 0; i < N; i++)
        for (j = 0; j < K; j++)
            B[i][j] = j;

    count = -rpcc();

    // on host:  multiplication C = A*B
    for (i = 0; i < M; i++)
        for (k = 0; k < K; k++)
            for (j = 0; j < N; j++)
                C[i][k] += A[i][j] * B[j][k];

    count += rpcc();
    printf("Host: Matrix multiplication A[%d][%d] * B[%d][%d], counter: %ld\n", M, N, N, K, count);


    //--------------------------------------------------------------------------
    // Matrix multiplication on device
    //--------------------------------------------------------------------------
    // init A, B
    for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            A[i][j] = i;

    for (i = 0; i < N; i++)
        for (j = 0; j < K; j++)
            B[i][j] = j;

    athread_init();
    //athread_set_num_threads(64);
    count = -rpcc();

    // on device:  multiplication C = A*B
    athread_spawn(func, 0);
    athread_join();

    count += rpcc();
    printf("Device: Matrix multiplication A[%d][%d] * B[%d][%d], counter: %ld\n", M, N, N, K, count);
    athread_halt();

    return 0;
}

/// \file slave.h
void func();

/// \file slave.c
#include "slave.h"

#define M 1024
#define N 1024
#define K 1024
#define PE_MODE 0   // error: PE_MODE undeclared

extern double A[M][N], B[N][K], C[M][K];

//__thread_local volatile unsigned long get_reply, put_reply; // error: expected '=', ',', ';', 'asm' or '__attribute__' before 'volatile'

void func() {
    volatile unsigned long get_reply, put_reply;
    double A_dev[N], B_dev[4][K], C_dev[K];
    int tid, tsize, round, roundsize;
    int i, j, k;

    tid = athread_get_id(-1);
    //tsize = athread_get_max_threads();    // error: slave_athread_get_max_threads undeclared
    tsize = 64;
    roundsize = N / 4;

    while (tid < M) {
        // Fetch a single row of A
        get_reply = 0;
        athread_get(PE_MODE, &A[tid][0], &A_dev[0], N*8, &get_reply, 0, 0, 0);
        while (get_reply != 2);

        // Matrix-vector multiplication
        for (round = 0; round < roundsize; round++) {
            // Fetch B
            get_reply = 0;
            athread_get(PE_MODE, &B[4 * round][0], &B_dev[0][0], 4*K*8, &get_reply, 0, 0, 0);
            while (get_reply != 2);

            // Partial multiplication
            for (k = 0; k < K; k++)
                for (j = 0; j < 4; j++)
                    C_dev[k] += A_dev[j] * B_dev[j][k];
        }

        // Send the single row of C back to host
        put_reply = 0;
        athread_put(PE_MODE, &C_dev[0], &C[tid][0], K*8, &put_reply, 0, 0, 0);
        while (put_reply != 1) ;

        tid += tsize;
    }
}

matrixMul.zip