神威太湖之光坏点统计计划



  • 近期机器上坏点越来越多, 我比较希望能够以一种更加公开的方式唤起大家的注意.
    以及做一个有据可查的资料库, 以方便大家对于队列和作业的使用.
    目前的思路是, 大家把遇到的坏点/疑似的坏点, 以如下格式 (去掉前导空格) 举报在本帖下, 我写一个脚本来查询某个CPU的list里面被举报过的坏点. 一定一定注意格式!!!!!!!!!!!!!!!!!!!!!!!!, 不然脚本会搜不到

        ```bad-node
        13165: 浮点计算错误
        01886: CESM 1.3主核版本存在数值不对的问题.
        ```
    
    13165: 浮点计算错误
    01886: CESM 1.3主核版本存在数值不对的问题.
    

    附一个用来搜索坏点的脚本, 用python3在能访问bbs的机器上执行. 点击这里 下载, 将扩展名改为py.

    import urllib
    import urllib.request
    import html.parser
    import json
    import re
    import sys
    badnodeline_regex = re.compile("(?P<nodeid>[0-9]+)\\s*:\\s*(?P<problem>.*)")
    class JSONDataFinder(html.parser.HTMLParser):
        def __init__(self):
            super().__init__()
            self.json_data = "{}"
            self.in_data = False
        def handle_starttag(self, tag, attrs):
            if tag == 'script':
                attrs_dict = dict(attrs)
                if attrs_dict.get('id', None) == 'ajaxify-data':
                    self.in_data = True
        def handle_endtag(self, tag):
            if tag == 'script':
                self.in_data = False
        def handle_data(self, data):
            if self.in_data:
                self.json_data = data
            
    class CodeFinder(html.parser.HTMLParser):
        def __init__(self):
            super().__init__()
            self.code = ""
            self.in_data = False
        def handle_starttag(self, tag, attrs):
            if tag == 'code':
                attrs_dict = dict(attrs)
                if attrs_dict.get('class', None) == 'language-bad-node':
                    self.in_data = True
        def handle_endtag(self, tag):
            if tag == 'code':
                self.in_data = False
        def handle_data(self, data):
            if self.in_data:
                self.code = self.code + data.strip() + "\n"
    
    def extract_node_list(nlist):
        ret = []
        blocks = nlist.strip().split(",")
        for block in blocks:
            sp = block.split("-")
            if len(sp) == 1:
                ret.append(int(sp[0]))
            else:
                bs = int(sp[0])
                be = int(sp[1])
                for i in range(bs, be+1):
                    ret.append(i)
        return ret;
    
    opener = urllib.request.URLopener()
    response = opener.open('http://bbs.nsccwx.cn/topic/119/%E7%A5%9E%E5%A8%81%E5%A4%AA%E6%B9%96%E4%B9%8B%E5%85%89%E5%9D%8F%E7%82%B9%E7%BB%9F%E8%AE%A1%E8%AE%A1%E5%88%92')
    response_html = response.file.read().decode()
    
    finder = JSONDataFinder()
    finder.feed(response_html)
    #print(finder.json_data)
    decoder = json.JSONDecoder()
    data = decoder.decode(finder.json_data)
    code_finder = CodeFinder()
    for post in data['posts']:
        code_finder.feed(post['content'])
    #print(code_finder.code.split("\n"));
    
    bad_node_dict = {}
    for bad_node_line in code_finder.code.split("\n"):
        try:
            m = badnodeline_regex.match(bad_node_line)
            if m:
                match = m.groupdict()
                if int(match['nodeid']) not in bad_node_dict:
                    bad_node_dict[int(match['nodeid'])] = []
                bad_node_dict[int(match['nodeid'])].append(match['problem'])
        except Exception:
            print('error in parsing: \"%s\"' % bad_node_line)
    
    while True:
        print('paste a cpulist below (following bjobs\'s format), CTRL-C to exit:')
        nlist = sys.stdin.readline()
        nodes = extract_node_list(nlist)
        for node in nodes:
            if node in bad_node_dict:
                print("node: \033[1m\033[91m%05d\033[0m" % node)
                print("problems reported:")
                print("\033[91m    " + "\n    ".join(bad_node_dict[node]) + "\033[0m")
    

    效果类似于:

    [xduan@xduan-pc ~]$ python query_bad.py 
    paste a cpulist below (following bjobs's format), CTRL-C to exit:
    13161-13166,17494,20753
    node: 13165
    problems reported:
        浮点计算错误
    paste a cpulist below (following bjobs's format), CTRL-C to exit:
    13161-13166,17494,21753    
    node: 13165
    problems reported:
        浮点计算错误
    node: 21753
    problems reported:
        HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning
    paste a cpulist below (following bjobs's format), CTRL-C to exit:
    

    是的, 它爬了这个页面所有举报过的错误然后就查你的节点列表



  • 做一个测试性的回复, 便于写脚本, 不过也都是我查出来过的坏点.

    21753: HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning
    19474: HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning
    33246: LAMMPS中athread_join出现问题.
    26987: LAMMPS执行明显比其他节点慢.
    3649: LAMMPS执行出现SDLB transform Exception
    11475: LAMMPS执行出现SDLB transform Exception 
    6057: LAMMPS执行出现Floating Point Exception
    6065: LAMMPS执行出现Floating Point Exception
    


  • 24842:GRPAES执行出现athread_join不了
    25088:GRPAES执行出现 athread_join不了
    


  • 1139: 主核版本CESM1.3带来约40%的性能下降.
    

登录后回复