神威太湖之光坏点统计计划
-
近期机器上坏点越来越多, 我比较希望能够以一种更加公开的方式唤起大家的注意.
以及做一个有据可查的资料库, 以方便大家对于队列和作业的使用.
目前的思路是, 大家把遇到的坏点/疑似的坏点, 以如下格式 (去掉前导空格) 举报在本帖下, 我写一个脚本来查询某个CPU的list里面被举报过的坏点. 一定一定注意格式!!!!!!!!!!!!!!!!!!!!!!!!, 不然脚本会搜不到```bad-node 13165: 浮点计算错误 01886: CESM 1.3主核版本存在数值不对的问题. ```
13165: 浮点计算错误 01886: CESM 1.3主核版本存在数值不对的问题.
附一个用来搜索坏点的脚本, 用python3在能访问bbs的机器上执行. 点击这里 下载, 将扩展名改为py.
import urllib import urllib.request import html.parser import json import re import sys badnodeline_regex = re.compile("(?P<nodeid>[0-9]+)\\s*:\\s*(?P<problem>.*)") class JSONDataFinder(html.parser.HTMLParser): def __init__(self): super().__init__() self.json_data = "{}" self.in_data = False def handle_starttag(self, tag, attrs): if tag == 'script': attrs_dict = dict(attrs) if attrs_dict.get('id', None) == 'ajaxify-data': self.in_data = True def handle_endtag(self, tag): if tag == 'script': self.in_data = False def handle_data(self, data): if self.in_data: self.json_data = data class CodeFinder(html.parser.HTMLParser): def __init__(self): super().__init__() self.code = "" self.in_data = False def handle_starttag(self, tag, attrs): if tag == 'code': attrs_dict = dict(attrs) if attrs_dict.get('class', None) == 'language-bad-node': self.in_data = True def handle_endtag(self, tag): if tag == 'code': self.in_data = False def handle_data(self, data): if self.in_data: self.code = self.code + data.strip() + "\n" def extract_node_list(nlist): ret = [] blocks = nlist.strip().split(",") for block in blocks: sp = block.split("-") if len(sp) == 1: ret.append(int(sp[0])) else: bs = int(sp[0]) be = int(sp[1]) for i in range(bs, be+1): ret.append(i) return ret; opener = urllib.request.URLopener() response = opener.open('http://bbs.nsccwx.cn/topic/119/%E7%A5%9E%E5%A8%81%E5%A4%AA%E6%B9%96%E4%B9%8B%E5%85%89%E5%9D%8F%E7%82%B9%E7%BB%9F%E8%AE%A1%E8%AE%A1%E5%88%92') response_html = response.file.read().decode() finder = JSONDataFinder() finder.feed(response_html) #print(finder.json_data) decoder = json.JSONDecoder() data = decoder.decode(finder.json_data) code_finder = CodeFinder() for post in data['posts']: code_finder.feed(post['content']) #print(code_finder.code.split("\n")); bad_node_dict = {} for bad_node_line in code_finder.code.split("\n"): try: m = badnodeline_regex.match(bad_node_line) if m: match = m.groupdict() if int(match['nodeid']) not in bad_node_dict: bad_node_dict[int(match['nodeid'])] = [] bad_node_dict[int(match['nodeid'])].append(match['problem']) except Exception: print('error in parsing: \"%s\"' % bad_node_line) while True: print('paste a cpulist below (following bjobs\'s format), CTRL-C to exit:') nlist = sys.stdin.readline() nodes = extract_node_list(nlist) for node in nodes: if node in bad_node_dict: print("node: \033[1m\033[91m%05d\033[0m" % node) print("problems reported:") print("\033[91m " + "\n ".join(bad_node_dict[node]) + "\033[0m")
效果类似于:
[xduan@xduan-pc ~]$ python query_bad.py paste a cpulist below (following bjobs's format), CTRL-C to exit: 13161-13166,17494,20753 node: 13165 problems reported: 浮点计算错误 paste a cpulist below (following bjobs's format), CTRL-C to exit: 13161-13166,17494,21753 node: 13165 problems reported: 浮点计算错误 node: 21753 problems reported: HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning paste a cpulist below (following bjobs's format), CTRL-C to exit:
是的, 它爬了这个页面所有举报过的错误然后就查你的节点列表
-
做一个测试性的回复, 便于写脚本, 不过也都是我查出来过的坏点.
21753: HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning 19474: HOMME动力框架和LAMMPS中产生DMA Descriptor Examination Warning 33246: LAMMPS中athread_join出现问题. 26987: LAMMPS执行明显比其他节点慢. 3649: LAMMPS执行出现SDLB transform Exception 11475: LAMMPS执行出现SDLB transform Exception 6057: LAMMPS执行出现Floating Point Exception 6065: LAMMPS执行出现Floating Point Exception
-
24842:GRPAES执行出现athread_join不了 25088:GRPAES执行出现 athread_join不了
-
1139: 主核版本CESM1.3带来约40%的性能下降.