神威太湖之光坏点统计计划

swmore

近期机器上坏点越来越多, 我比较希望能够以一种更加公开的方式唤起大家的注意.
以及做一个有据可查的资料库, 以方便大家对于队列和作业的使用.
目前的思路是, 大家把遇到的坏点/疑似的坏点, 以如下格式 (去掉前导空格) 举报在本帖下, 我写一个脚本来查询某个CPU的list里面被举报过的坏点. 一定一定注意格式!!!!!!!!!!!!!!!!!!!!!!!!, 不然脚本会搜不到

    ```bad-node
    13165: 浮点计算错误
    01886: CESM 1.3主核版本存在数值不对的问题.
    ```

13165: 浮点计算错误
01886: CESM 1.3主核版本存在数值不对的问题.

附一个用来搜索坏点的脚本, 用python3在能访问bbs的机器上执行. 点击这里下载, 将扩展名改为py.

import urllib
import urllib.request
import html.parser
import json
import re
import sys
badnodeline_regex = re.compile("(?P<nodeid>[0-9]+)\\s*:\\s*(?P<problem>.*)")
class JSONDataFinder(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.json_data = "{}"
        self.in_data = False
    def handle_starttag(self, tag, attrs):
        if tag == 'script':
            attrs_dict = dict(attrs)
            if attrs_dict.get('id', None) == 'ajaxify-data':
                self.in_data = True
    def handle_endtag(self, tag):
        if tag == 'script':
            self.in_data = False
    def handle_data(self, data):
        if self.in_data:
            self.json_data = data
        
class CodeFinder(html.parser.HTMLParser):
    def __init__(self):
        super().__init__()
        self.code = ""
        self.in_data = False
    def handle_starttag(self, tag, attrs):
        if tag == 'code':
            attrs_dict = dict(attrs)
            if attrs_dict.get('class', None) == 'language-bad-node':
                self.in_data = True
    def handle_endtag(self, tag):
        if tag == 'code':
            self.in_data = False
    def handle_data(self, data):
        if self.in_data:
            self.code = self.code + data.strip() + "\n"

def extract_node_list(nlist):
    ret = []
    blocks = nlist.strip().split(",")
    for block in blocks:
        sp = block.split("-")
        if len(sp) == 1:
            ret.append(int(sp[0]))
        else:
            bs = int(sp[0])
            be = int(sp[1])
            for i in range(bs, be+1):
                ret.append(i)
    return ret;

opener = urllib.request.URLopener()
response = opener.open('http://bbs.nsccwx.cn/topic/119/%E7%A5%9E%E5%A8%81%E5%A4%AA%E6%B9%96%E4%B9%8B%E5%85%89%E5%9D%8F%E7%82%B9%E7%BB%9F%E8%AE%A1%E8%AE%A1%E5%88%92')
response_html = response.file.read().decode()

finder = JSONDataFinder()
finder.feed(response_html)
#print(finder.json_data)
decoder = json.JSONDecoder()
data = decoder.decode(finder.json_data)
code_finder = CodeFinder()
for post in data['posts']:
    code_finder.feed(post['content'])
#print(code_finder.code.split("\n"));

bad_node_dict = {}
for bad_node_line in code_finder.code.split("\n"):
    try:
        m = badnodeline_regex.match(bad_node_line)
        if m:
            match = m.groupdict()
            if int(match['nodeid']) not in bad_node_dict:
                bad_node_dict[int(match['nodeid'])] = []
            bad_node_dict[int(match['nodeid'])].append(match['problem'])
    except Exception:
        print('error in parsing: \"%s\"' % bad_node_line)

while True:
    print('paste a cpulist below (following bjobs\'s format), CTRL-C to exit:')
    nlist = sys.stdin.readline()
    nodes = extract_node_list(nlist)
    for node in nodes:
        if node in bad_node_dict:
            print("node: \033[1m\033[91m%05d\033[0m" % node)
            print("problems reported:")
            print("\033[91m    " + "\n    ".join(bad_node_dict[node]) + "\033[0m")

效果类似于:

[xduan@xduan-pc ~]$ python query_bad.py 
paste a cpulist below (following bjobs's format), CTRL-C to exit:
13161-13166,17494,20753
node: 13165
problems reported:
    浮点计算错误
paste a cpulist below (following bjobs's format), CTRL-C to exit:
13161-13166,17494,21753    
node: 13165
problems reported:
    浮点计算错误
node: 21753
problems reported:
    HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning
paste a cpulist below (following bjobs's format), CTRL-C to exit:

是的, 它爬了这个页面所有举报过的错误然后就查你的节点列表

swmore

做一个测试性的回复, 便于写脚本, 不过也都是我查出来过的坏点.

21753: HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning
19474: HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning
33246: LAMMPS中athread_join出现问题.
26987: LAMMPS执行明显比其他节点慢.
3649: LAMMPS执行出现SDLB transform Exception
11475: LAMMPS执行出现SDLB transform Exception 
6057: LAMMPS执行出现Floating Point Exception
6065: LAMMPS执行出现Floating Point Exception

popo

24842:GRPAES执行出现athread_join不了
25088:GRPAES执行出现 athread_join不了

swmore

1139: 主核版本CESM1.3带来约40%的性能下降.