http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
An experimental analysis of limitations of MapReduce for iterative algorithms on Spark
Kang, Minseo,Lee, Jae-Gil Baltzer Science Publishers 2017 CLUSTER COMPUTING Vol.20 No.4
<P>MapReduce is the most popular framework for distributed processing. Recently, the scalability of data mining and machine learning algorithms has significantly improved with help from MapReduce. However, MapReduce does not handle iterative algorithms very efficiently. The problem is that many data mining and machine learning algorithms are iterative by nature. In order to overcome the limitations of MapReduce, many advanced distributed systems have been developed, including HaLoop, iMapReduce, Twister, and Spark. In this paper, we identify and categorize the limitations of MapReduce in handling iterative algorithms, and then, experimentally investigate the consequences of these limitations by using the most flexible and stable distributed system, Spark. According to our experiment results, the network I/O overhead was the primary factor that affected system performance the most. The disk I/O overhead also affected system performance, but it was less significant than the network I/O overhead. For the synchronization overhead, it affected system performance only when the static data was not cached.</P>
대용량 데이터 처리를 위한 고속 분산 인메모리 플랫폼 기반 재귀적 질의 알고리즘들의 구현 및 비교분석
강민서(Minseo Kang),김재성(Jaesung Kim),이재길(Jaegil Lee) Korean Institute of Information Scientists and Eng 2016 정보과학회논문지 Vol.43 No.6
Recursive query algorithm is used in many social network services, e.g., reachability queries in social networks. Recently, the size of social network data has increased as social network services evolve. As a result, it is almost impossible to use the recursive query algorithm on a single machine. In this paper, we implement recursive query on two popular in-memory distributed platforms, Spark and Twister, to solve this problem. We evaluate the performance of two implementations using 50 machines on Amazon EC2, and real-world data sets: LiveJournal and ClueWeb. The result shows that recursive query algorithm shows better performance on Spark for the Livejournal input data set with relatively high average degree, but smaller vertices. However, recursive query on Twister is superior to Spark for the ClueWeb input data set with relatively low average degree, but many vertices.