先拿swgdb给登上去看看在哪...
swmore 发布的帖子
-
RE: 神威编译关于math.h报错
把-lm和-lm_slave放到两个.o后面试一试.
或者-Wl,--whole-archive -lm -lm_slave -Wl,--no-whole-archive试一试. -
从核调试小工具 (2018-12-21更新)
2018-12-21更新:
- 现在可以获取从核是否发生了SDLB Transform Exception或者DMA Descriptor Examination Warning了.
- 支持多次接收从核异常信号, 并拼接到原有的从核现场文件.
- 现在可以正确处理带有精确异常的从核错误, 并在错误处暂停主核程序.
现有从核调试用的小工具一件.
稳定版功能主要包括:- 在SDLB Transform Exception和DMA Descriptor Examination Warning的情况下打印出从核的PC值.
- 在从核卡死的情况下打印出从核的PC值.
- 在Unknown Exception的情况下记录最后一个启动的从核函数的入口地址.
- 解析SDLB和DMA异常的现场.
测试版新增了:
- 在收到对应信号时打出所有从核所有通用寄存器的值.
- 可以唤醒通过halt指令(
asm volatile("halt")
)暂停的从核程序 (类似于断点?).
下载:
libspc.tgz
现在在psn上有发布目录了, 位于/home/export/online1/swmore/release/lib
.使用:
链接:
~
在最终链接程序时添加~-Wl,--whole-archive libspc.a -Wl,--no-whole-archive,-wrap,athread_init,-wrap,__real_athread_spawn,-wrap,__expt_handler
(测试版为libspc-beta.a).
现在没有测试版了, 目前看之前的测试版还算稳定, 我把输出格式改了一下.source /home/export/online1/swmore/release/setenv #设置对应的环境变量 <原有的链接命令> $LINK_SPC #如果是在不能获取环境变量的编译脚本, 则echo $LINK_SPC看一眼里面是哪些选项贴进去也可以.
运行:
- 如果程序遇到从核/核组接口错, 将会打出错误现场信息.
- 如果程序收到SIGUSR1会打出错误现场信息.
- 如果链接了测试版程序收到SIGUSR2, 将会恢复halt的从核并打出错误现场信息.
发送SIGUSR1/SIGUSR2:
- 可以通过
bsignal -s 30 <作业号>
向程序发送SIGUSR1, 通过bsignal -s 31 <作业号>
向程序发送SIGUSR2. - 在使用
swgdb
等GDB类工具attach程序之后, 使用signal SIGUSR1
/signal SIGUSR2
向程序发送信号.
打出的报错信息:
样例屏幕输出:
Job <43262426> has been submitted to queue <q_sw_expr> waiting for dispatch ... dispatching ... Before halt Before halt Before halt Before halt (NODE=vn000144, CG=0) wrote CPE spots to /home/export/online1/cesm06/dxh/workspace/cesm-scripts/spc/vn000144_cg0_cpe_spot.txt due to signal 31 (NODE=vn000144, CG=1) wrote CPE spots to /home/export/online1/cesm06/dxh/workspace/cesm-scripts/spc/vn000144_cg1_cpe_spot.txt due to signal 31 (NODE=vn000144, CG=2) wrote CPE spots to /home/export/online1/cesm06/dxh/workspace/cesm-scripts/spc/vn000144_cg2_cpe_spot.txt due to signal 31 (NODE=vn000144, CG=3) wrote CPE spots to /home/export/online1/cesm06/dxh/workspace/cesm-scripts/spc/vn000144_cg3_cpe_spot.txt due to signal 31 Waking up CPE 1! After halt Waking up CPE 1! After halt Waking up CPE 1! Waking up CPE 1! After halt After halt
注意中间的4行类似于
(NODE=vn000144, CG=0) wrote CPE spots to /home/export/online1/cesm06/dxh/workspace/cesm-scripts/spc/vn000144_cg0_cpe_spot.txt due to signal 31
中包括的节点号(不是进程号!!!), 核组号, 错误现场输出文件路径, 输出原因 (30为SIGUSR1, 31为SIGUSR2, 55为从核报错).样例报错文件:
==================DECODING ERR SPOT================== Last function spawned 4ff04101b0 ==================DECODE OF CPE PCs================== 0: 4ff0410830 1: 4ff0410830 2: 4ff0410824 3: 4ff0410824 4: 4ff0410824 5: 4ff0410210 6: 4ff0410824 7: 4ff0410824 8: 4ff0410830 9: 4ff0410830 10: 4ff0410830 11: 4ff0410824 12: 4ff0410824 13: 4ff0410824 14: 4ff0410824 15: 4ff0410830 16: 4ff0410824 17: 4ff0410830 18: 4ff0410824 19: 4ff0410824 20: 4ff0410824 21: 4ff0410824 22: 4ff0410830 23: 4ff0410830 24: 4ff0410824 25: 4ff0410830 26: 4ff0410824 27: 4ff0410824 28: 4ff0410830 29: 4ff0410824 30: 4ff0410830 31: 4ff0410830 32: 4ff0410830 33: 4ff0410824 34: 4ff0410830 35: 4ff0410830 36: 4ff0410830 37: 4ff0410824 38: 4ff0410830 39: 4ff0410830 40: 4ff0410824 41: 4ff0410830 42: 4ff0410830 43: 4ff0410830 44: 4ff0410830 45: 4ff0410824 46: 4ff0410824 47: 4ff0410830 48: 4ff0410830 49: 4ff0410824 50: 4ff0410830 51: 4ff0410824 52: 4ff0410824 53: 4ff0410824 54: 4ff0410830 55: 4ff0410830 56: 4ff0410830 57: 4ff0410830 58: 4ff0410824 59: 4ff0410830 60: 4ff0410830 61: 4ff0410824 62: 4ff0410824 63: 4ff0410824 ##################################################### ## Can't determine the DMA and SDLB status due to ## ## unknown reason, take the following info when job## ## shows corresponding exception. ## ##################################################### ==============DECODE OF SDLB ERROR SPOT============== TC_SDLB_ERR_SPOT: 5800ff0000000080 REQ_TYPE: read TC_SDLB_REQ_ADDR: 80 SRC_PE: 63 GRAIN: 8 SRC_TYPE: dma OUT_OF_RANGE: no OUT_OF_PERM: yes ===============DECODE OF DMACHK_FIELDS=============== DMACHK_FIELD0: 7000003 DMACHK_FIELD1: 6050 DMACHK_FIELD2: 200000 ERR_DETECTED: LDM unaligned ERR_DETECTED: MEM unaligned ERR_DETECTED: size unaligned ERR_DETECTED: bsize unaligned ERR_DETECTED: stride unaligned SRC_PE: 5 TRASFER_SIZE: 3 (0x3) BSIZE: 7 (0x7) OP: DMA_GET MODE: PE_MODE REPLY_ADDR: 4 BCAST_MASK: 5 STRIDE: 6 (0x6) MEM_ADDR: 1 LDM_ADDR: 2 DMA_PC[24:2]: 0x1040d5 =================DECODE OF CPE REGs================== #这一段及下面只有测试版才有. CPE[00]: $0( v0 ): hex : 0000000000000000000000000000000000000000000000000000000000000000 4B mark : 7 6 5 4 3 2 1 0 8B mark : 3 2 1 0 doublev4_value: 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00 longv4_value : 0, 0, 0, 0 intv8_value : 0, 0, 0, 0, 0, 0, 0, 0 $1( t0 ): hex : 0000000000000000000000000000000000000000000000000000005000018d88 4B mark : 7 6 5 4 3 2 1 0 8B mark : 3 2 1 0 doublev4_value: 1.697597e-312, 0.000000e+00, 0.000000e+00, 0.000000e+00 longv4_value : 343597485448, 0, 0, 0 intv8_value : 101768, 80, 0, 0, 0, 0, 0, 0 $2( t1 ): hex : 000000000000000000000000000000000000000000000000000000500001c688 4B mark : 7 6 5 4 3 2 1 0 8B mark : 3 2 1 0 doublev4_value: 1.697597e-312, 0.000000e+00, 0.000000e+00, 0.000000e+00 longv4_value : 343597500040, 0, 0, 0 intv8_value : 116360, 80, 0, 0, 0, 0, 0, 0 $3( t2 ): hex : 000000000000000000000000000000000000000000000000000000500001c688 4B mark : 7 6 5 4 3 2 1 0 8B mark : 3 2 1 0 doublev4_value: 1.697597e-312, 0.000000e+00, 0.000000e+00, 0.000000e+00 longv4_value : 343597500040, 0, 0, 0 #其实后面还有很多其他从核其他寄存器的数据.
解读报错信息:
Last function spawned:
最后启动spawn的一个从核函数, 任何情况下准确.
DECODE OF CPE PCs:
- 每行为8个从核当前的PC, 当发生有精确PC的异常时 (例如:
LDM Access Exception
,Unaligned Exception
等等, 会打出the exact PC is...
的异常), 由于从核函数被杀死 (从核回到slave_waiting_for_task
函数), 不准确, 请参考这些Exception对应的打出的从核PC. - 通过PC查询行号:
addr2line -Cfe <应用程序的可执行文件> <PC值>
. - 通过PC看附近指令:
sw5objdump -d <应用程序的可执行文件> | less
, 使用/<PC值>
+回车
在得出的反汇编信息中搜索PC.
DECODE OF SDLB ERROR SPOT:
- 在发生
SDLB Transform Exceptions
时准确, 其他情况为随机值.TC_SDLB_ERR_SPOT
: SDLB错误现场原值, 仅供调试调试工具时使用.REQ_TYPE
: SDLB请求类型, 包括读/写/原子操作TC_SDLB_REQ_ADDR
: SDLB请求的地址SRC_PE
: 异常请求的源从核.GRAIN
: 请求的访存粒度(GLD/GST
时精确, DMA时看不懂).SRC_TYPE
: 请求的来源(DMA请求为dma
, 取指令为ibox
,GLD/GST
为cpe
).OUT_OF_RANGE
: 访问是否越界.OUT_OF_PERM
: 访问是否越权.
DECODE OF DMACHK_FIELDS:
- 在发生
DMA Descriptor Examination Warning
时准确, 其他情况为随机值.DMACHK_FIELDx
: DMA错误现场原值, 仅供调试调试工具时使用.ERR_DETECTED
: DMA描述符异常的类型, 可能有多条.SRC_PE
: DMA请求的来源从核号.TRASFER_SIZE
,BSIZE
,OP
,MODE
,REPLY_ADDR
,BCAST_MASK
,STRIDE
,MEM_ADDR
,LDM_ADDR
: 对应于DMA/athread_get/put
请求的各个参数.DMA_PC[24:2]
: DMA请求的PC地址的2到24位, 左移两位并补充高位地址后可以查处DMA报错的PC. 注意:athread_get/put
时, DMA是在athread_get/put
函数中发出的, 所以请参照DECODE OF CPE PCs
一节来获取PC值, 它可能是在一个等待回答字的地方.
DECODE OF CPE REGs:
- 对从核寄存器中的值的解析, 适合调试汇编或者反汇编时对照.
- 结构为:
CPE[<从核号>]: $<寄存器号>(寄存器别名): hex: <通用寄存器的256位的16进制值, 从高到低排列>. 4B/8B mark: 每4B/8B对应的长度有一个数, 便于从中截取int/long的16进制形式. doublev4_value: 寄存器作为doublev4时, 从低到高的4个double值(作为floatv4时, 该值仍可以参考). longv4_value: 寄存器作为4个long使用时, 从低到高的4个long值 (虽然本质上没有longv4类型). intv8_value: 寄存器作为intv8使用时, 从低到高的4个int值.
感谢:
- LAMMPS, GRAPES, CESM的诸位开发者写出来的bug提供了足够的的测试环境和素材.