http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
Scalable Approach to Failure Analysis of High-Performance Computing Systems
Doaa Shawky 한국전자통신연구원 2014 ETRI Journal Vol.36 No.6
Failure analysis is necessary to clarify the root cause of afailure, predict the next time a failure may occur, andimprove the performance and reliability of a system. However, it is not an easy task to analyze and interpretfailure data, especially for complex systems. Usually, thesedata are represented using many attributes, andsometimes they are inconsistent and ambiguous. In thispaper, we present a scalable approach for the analysis andinterpretation of failure data of high-performancecomputing systems. The approach employs rough setstheory (RST) for this task. The application of RST to alarge publicly available set of failure data highlights themain attributes responsible for the root cause of a failure. In addition, it is used to analyze other failurecharacteristics, such as time between failures, repair times,workload running on a failed node, and failure category. Experimental results show the scalability of the presentedapproach and its ability to reveal dependencies amongdifferent failure characteristics.