已解决,编译时加上:
-OPT:IEEE_arith=1
,或-OPT:IEEE_arith=2
已解决,编译时加上:
-OPT:IEEE_arith=1
,或-OPT:IEEE_arith=2
warning: node [26528]: user's mpe task: tid= 0, pid= 22346, terminated by sig 8
使用 sw5cc 编译时,主核上的浮点数默认似乎不遵守 IEEE 标准,指数为 0 的浮点数(subnormals)会被置为 0。如果在主核代码中尝试使用一个 subnormal,会报错 SIGFPE。
打开编译选项 -OPT:IEEE_arith
,主核上能够正常使用 subnormals 了。在后面记录中出现的问题,可以通过以下编译选项解决:
-OPT:IEEE_arith=1
,或-OPT:IEEE_arith=2
使用手册里的计算除法的 C 语言例子,在主、从核上使用相同的表达式不断把浮点数除以 1000(数组索引为 j, i
,j
也是线程编号,一共除以 (j+1)*i
次 1000)。随后数据由从核传回主核,把所有数据加起来作为 checksum 比较。
sw5cc 和 sw5gcc 都试了。主核和从核使用相同选项, -O1
、-O2
、-O3
都一样报错。
浮点数用 float,从核只用 2 个时,发生以下情形:
2^(−127)≈5.877e−39
的数字可以在主核端打印,但不能用于四则运算,否则报错(SIGFPE);具体输出结果如下。其中,数组 c 在从核上计算,数组 cc 在主核上计算,紧跟数组值的是 c 中元素的十六进制。
从核 0 的结果:
c[0][0] = 3.6666666623209573e-18, cc[0][0] = 3.6666666623209573e-18
22 87 46 b0
c[0][1] = 3.6666666849391771e-21, cc[0][1] = 3.6666666849391771e-21
1d 8a 85 d3
c[0][2] = 3.6666665082343344e-24, cc[0][2] = 3.6666665082343344e-24
18 8d d8 e8
c[0][3] = 3.6666663171820839e-27, cc[0][3] = 3.6666663171820839e-27
13 91 40 6a
c[0][4] = 3.6666663021357562e-30, cc[0][4] = 3.6666663021357562e-30
0e 94 bc d7
c[0][5] = 3.6666662639321898e-33, cc[0][5] = 3.6666662639321898e-33
09 98 4e af
c[0][6] = 3.6666663500279674e-36, cc[0][6] = 3.6666663500279674e-36
04 9b f6 76
c[0][7] = 6.9405734356891773e-309, cc[0][7] = 0.0000000000000000e+00
00 27 ed 2d
c[0][8] = 6.9415787311951471e-312, cc[0][8] = 0.0000000000000000e+00
00 00 0a 39
c[0][9] = 7.9574842161197712e-315, cc[0][9] = 0.0000000000000000e+00
00 00 00 03
c[0][10] = 0.0000000000000000e+00, cc[0][10] = 0.0000000000000000e+00
00 00 00 00
从核 1 的结果:
c[1][0] = 3.6666666623209573e-18, cc[1][0] = 3.6666666623209573e-18
22 87 46 b0
c[1][1] = 3.6666665082343344e-24, cc[1][1] = 3.6666665082343344e-24
18 8d d8 e8
c[1][2] = 3.6666663021357562e-30, cc[1][2] = 3.6666663021357562e-30
0e 94 bc d7
c[1][3] = 3.6666663500279674e-36, cc[1][3] = 3.6666663500279674e-36
04 9b f6 76
c[1][4] = 6.9415787311951471e-312, cc[1][4] = 0.0000000000000000e+00
00 00 0a 39
c[1][5] = 0.0000000000000000e+00, cc[1][5] = 0.0000000000000000e+00
00 00 00 00
两个从核在循环次数不同的情况下,能够打印出数量不同的 subnormals,并且打印出的结果显示这些数字远远小于 float 类型允许的 subnormals。
其中一个数字使用 printf("%.16e\n")
打印出来为
- 十进制:7.9574842161197712e-315
- 十六进制:00 00 00 03
- 二进制:0000 0000 0000 0000 0000 0000 0000 0011
00 00 00 03
这个十六进制数实际对应的 4 字节单精度浮点数为 4.2039e-45
。
7.9574842161197712e-315
这个浮点数按 8 字节双精度处理,转换成十六进制是 00 00 00 00 60 00 00 00
。
主核上似乎无法使用 subnormals。
float 类型的最后一个指数非 0 的数字是 0x00800000
,这个用于计算不会报错。第一个指数为 0 的数字是 0x007fffff
,这个数字用于计算会报错。
请问这是什么情况?主核和从核对浮点数 subnormal 的处理不一样?
有没有办法避免这种情况?在从核上加一个判断把这些数字全部置为 0 是可以正常执行的,但这样有点浪费时间。
我在提交作业时加上参数 -share_size
之后可以正常跑了。我使用 256 个节点、1024 进程时,至少要 -share_size 76
才能正常跑。
测试 MPI 初始化的代码:
/// \file mpi_test.c
#include <mpi.h>
#include <stdio.h>
int main(int argc,char *argv[]) {
MPI_Init(&argc,&argv);
int rank = 0;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
if (!rank) {
printf("[Rank %d] Finalizing...\n", rank);
}
MPI_Finalize();
return 0;
}
编译、运行:
mpicc -O3 mpi_test.c -o mpi_test
bsub -q q_sw_share -N 200 -np 4 ./mpi_test
我在节点数 128~201 之间选了一些来测。
节点数 128(512 进程)时正常,节点数 196 (784 进程)时偶尔正常。
问题1:在节点数 200 左右频繁报错 Other MPI error
跑了好几次 200 节点,全都是同一个错。下面这个错是节点的问题吗?
Other MPI error, error stack:
PMPI_Wait(182).....................: MPI_Wait(request=0x5000454ecc, status=0x5000454ed0) failed
MPIR_Wait_impl(71).................:
_MPIDI_CH3I_Progress(292)..........:
handle_read(1134)..................:
handle_read_individual(1325).......:
MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued.
: No such file or directory (2)
[vn025377:mpi_rank_118][MPIDI_CH3_Abort] Fatal error in PMPI_Wait:
Other MPI error, error stack:
PMPI_Wait(182).....................: MPI_Wait(request=0x5000728ecc, status=0x5000728ed0) failed
MPIR_Wait_impl(71).................:
_MPIDI_CH3I_Progress(292)..........:
handle_read(1134)..................:
handle_read_individual(1325).......:
MPIDI_CH3_PktHandler_EagerSend(875): Failed to allocate memory for an unexpected message. 7 unexpected messages queued.
: No such file or directory (2)
问题2:在节点数 201 时出现了 CATCHSIG...Segmentation Fault...
打印的结果节选如下:
CATCHSIG: Myid = 6(CPU 25346,CG 2), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
CATCHSIG: Myid = 643(CPU 25528,CG 3), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
CATCHSIG: Myid = 640(CPU 25528,CG 0), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
[vn025528:mpi_rank_640][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2)
CATCHSIG: Myid = 641(CPU 25528,CG 1), si_signo = 11(Segmentation Fault: PC = 0x4ff067c844)
[vn025528:mpi_rank_641][MPIDI_CH3_Abort] application called MPI_Abort(MPI_COMM_WORLD, 1000) - process 0: No such file or directory (2)
输出文件里面全是这个,统计了一下,这个输出里面有每个节点的每个进程的报错信息,信息都是一样的。
一共出现 804 个 PC,地址一模一样都是 0x4ff067c844
。在可执行文件里面找这个地址,可以看到它在 memcpy
里面(826069 行),请问这是什么问题?
826023 0000004ff067c790 <memcpy>:
826024 4ff067c790: 40 07 f0 43 or zero,a0,v0
826025 4ff067c794: 8a 00 40 ce ble a2,4ff067c9c0 <memcpy+0x230>
826026 4ff067c798: 81 07 11 42 xor a0,a1,t0
826027 4ff067c79c: 01 e7 20 48 and t0,0x7,t0
826028 4ff067c7a0: 5f 00 20 c4 bne t0,4ff067c920 <memcpy+0x190>
826029 4ff067c7a4: 01 e7 00 4a and a0,0x7,t0
826030 4ff067c7a8: 09 00 20 c0 beq t0,4ff067c7d0 <memcpy+0x40>
826031 4ff067c7ac: 5f 07 ff 43 or zero,zero,zero
826032 4ff067c7b0: 00 00 31 80 ldbu t0,0(a1)
826033 4ff067c7b4: 32 21 40 4a subl a2,0x1,a2
826034 4ff067c7b8: 11 21 20 4a addl a1,0x1,a1
826035 4ff067c7bc: 00 00 30 a0 stb t0,0(a0)
826036 4ff067c7c0: 10 21 00 4a addl a0,0x1,a0
826037 4ff067c7c4: 01 e7 00 4a and a0,0x7,t0
826038 4ff067c7c8: 7d 00 40 ce ble a2,4ff067c9c0 <memcpy+0x230>
826039 4ff067c7cc: f8 ff 3f c4 bne t0,4ff067c7b0 <memcpy+0x20>
826040 4ff067c7d0: 41 e5 4f 4a cmple a2,0x7f,t0
826041 4ff067c7d4: 36 00 20 c4 bne t0,4ff067c8b0 <memcpy+0x120>
826042 4ff067c7d8: 01 e7 07 4a and a0,0x3f,t0
826043 4ff067c7dc: 08 00 20 c0 beq t0,4ff067c800 <memcpy+0x70>
826044 4ff067c7e0: 00 00 31 8c ldl t0,0(a1)
826045 4ff067c7e4: 32 01 41 4a subl a2,0x8,a2
826046 4ff067c7e8: 11 01 21 4a addl a1,0x8,a1
826047 4ff067c7ec: 5f 07 ff 43 or zero,zero,zero
826048 4ff067c7f0: 00 00 30 ac stl t0,0(a0)
826049 4ff067c7f4: 10 01 01 4a addl a0,0x8,a0
826050 4ff067c7f8: 01 e7 07 4a and a0,0x3f,t0
826051 4ff067c7fc: f8 ff 3f c4 bne t0,4ff067c7e0 <memcpy+0x50>
826052 4ff067c800: 07 01 08 4a addl a0,0x40,t6
826053 4ff067c804: 41 e5 4f 4a cmple a2,0x7f,t0
826054 4ff067c808: 29 00 20 c4 bne t0,4ff067c8b0 <memcpy+0x120>
826055 4ff067c80c: 5f 07 ff 43 or zero,zero,zero
826056 4ff067c810: 00 01 e7 9b fetchd_w 256(t6)
826057 4ff067c814: 00 00 d1 8c ldl t5,0(a1)
826058 4ff067c818: 5f 07 ff 43 or zero,zero,zero
826059 4ff067c81c: 5f 07 ff 43 or zero,zero,zero
826060 4ff067c820: 08 00 91 8c ldl t3,8(a1)
826061 4ff067c824: 10 00 b1 8c ldl t4,16(a1)
826062 4ff067c828: 07 01 e8 48 addl t6,0x40,t6
826063 4ff067c82c: 5f 07 ff 43 or zero,zero,zero
826064 4ff067c830: 18 00 71 8c ldl t2,24(a1)
826065 4ff067c834: 01 01 08 4a addl a0,0x40,t0
826066 4ff067c838: 5f 07 ff 43 or zero,zero,zero
826067 4ff067c83c: 5f 07 ff 43 or zero,zero,zero
826068 4ff067c840: 11 01 24 4a addl a1,0x20,a1
* 826069 4ff067c844: 00 00 d0 ac stl t5,0(a0)
826070 4ff067c848: 5f 07 ff 43 or zero,zero,zero
826071 4ff067c84c: 5f 07 ff 43 or zero,zero,zero
826072 4ff067c850: 08 00 90 ac stl t3,8(a0)
...
申请使用多个节点时出错了,显示
submit-job failed, because some compute-nodes in list: [3365,17508,18162,18231,18243,18246,18248,18252,18255,18295,18297,18313,18319,18325,18327,18362,18364,18379-18380,18383-18385,18390,18416-18418,18420-18421,18423,31469,40382,40392,40395,40399-40400,40420,40434,40445,40461,40463-40464,40501,40818,40835,40838,40841] unavailable! check these nodes please!!!
请问是节点有问题吗?
尝试着把手册里的 matrixMul 这个代码改为 athread 版的时候遇到一些问题。
问题1:PE_MODE 怎么才能用?
slave.c: In function 'func':
slave.c:24: error: 'PE_MODE' undeclared (first use in this function)
slave.c:24: error: (Each undeclared identifier is reported only once
slave.c:24: error: for each function it appears in.)
这个 PE_MODE
找不到吗?我于是把 PE_MODE
直接定义为 0
了。
问题2:__thread_local 怎么使用?
按手册里写了这样一句:
__thread_local volatile unsigned long get_reply, put_reply;
但是编译不通过:
slave.c:10: error: expected '=', ',', ';', 'asm' or '__attribute__' before 'volatile'
slave.c: In function 'func':
slave.c:23: error: 'get_reply' undeclared (first use in this function)
slave.c:23: error: (Each undeclared identifier is reported only once
slave.c:23: error: for each function it appears in.)
slave.c:39: error: 'put_reply' undeclared (first use in this function)
于是我改成局部变量了,请问这个 __thread_local
怎么正确使用呢?
更新
我已经知道了,slave.h
不是自己的头文件,是这个:/usr/sw-mpp/swcc/sw5gcc-binary/include/slave.h
。
$ sw5cc -host -O3 -c matrixMul.c
$ sw5cc -slave -O3 -c slave.c
$ sw5cc -hybrid matrixMul.o slave.o -o test
$ bsub -I -b -q q_sw_share -n 1 -cgsp 64 ./test
/// \file matrixMul.c
#include <athread.h>
#include <stdio.h>
#include <sys/time.h>
#include "slave.h"
#define M 1024
#define N 1024
#define K 1024
double A[M][N];
double B[N][K];
double C[M][K];
extern SLAVE_FUN(func)();
// Timer
static inline unsigned long rpcc() {
unsigned long time;
asm("rtc %0": "=r" (time) : );
return time;
}
int main() {
int i, j, k;
unsigned long count;
//--------------------------------------------------------------------------
// Matrix multiplication on host
//--------------------------------------------------------------------------
// init A, B
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
A[i][j] = i;
for (i = 0; i < N; i++)
for (j = 0; j < K; j++)
B[i][j] = j;
count = -rpcc();
// on host: multiplication C = A*B
for (i = 0; i < M; i++)
for (k = 0; k < K; k++)
for (j = 0; j < N; j++)
C[i][k] += A[i][j] * B[j][k];
count += rpcc();
printf("Host: Matrix multiplication A[%d][%d] * B[%d][%d], counter: %ld\n", M, N, N, K, count);
//--------------------------------------------------------------------------
// Matrix multiplication on device
//--------------------------------------------------------------------------
// init A, B
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
A[i][j] = i;
for (i = 0; i < N; i++)
for (j = 0; j < K; j++)
B[i][j] = j;
athread_init();
//athread_set_num_threads(64);
count = -rpcc();
// on device: multiplication C = A*B
athread_spawn(func, 0);
athread_join();
count += rpcc();
printf("Device: Matrix multiplication A[%d][%d] * B[%d][%d], counter: %ld\n", M, N, N, K, count);
athread_halt();
return 0;
}
/// \file slave.h
void func();
/// \file slave.c
#include "slave.h"
#define M 1024
#define N 1024
#define K 1024
#define PE_MODE 0 // error: PE_MODE undeclared
extern double A[M][N], B[N][K], C[M][K];
//__thread_local volatile unsigned long get_reply, put_reply; // error: expected '=', ',', ';', 'asm' or '__attribute__' before 'volatile'
void func() {
volatile unsigned long get_reply, put_reply;
double A_dev[N], B_dev[4][K], C_dev[K];
int tid, tsize, round, roundsize;
int i, j, k;
tid = athread_get_id(-1);
//tsize = athread_get_max_threads(); // error: slave_athread_get_max_threads undeclared
tsize = 64;
roundsize = N / 4;
while (tid < M) {
// Fetch a single row of A
get_reply = 0;
athread_get(PE_MODE, &A[tid][0], &A_dev[0], N*8, &get_reply, 0, 0, 0);
while (get_reply != 2);
// Matrix-vector multiplication
for (round = 0; round < roundsize; round++) {
// Fetch B
get_reply = 0;
athread_get(PE_MODE, &B[4 * round][0], &B_dev[0][0], 4*K*8, &get_reply, 0, 0, 0);
while (get_reply != 2);
// Partial multiplication
for (k = 0; k < K; k++)
for (j = 0; j < 4; j++)
C_dev[k] += A_dev[j] * B_dev[j][k];
}
// Send the single row of C back to host
put_reply = 0;
athread_put(PE_MODE, &C_dev[0], &C[tid][0], K*8, &put_reply, 0, 0, 0);
while (put_reply != 1) ;
tid += tsize;
}
}