Spark 기반 개인선호도를 반영한 추천시스템 연구 = A Study on the Recommendation System Reflecting Spark-Based Personal Preference|RISS 상세보기

다국어 초록 (Multilingual Abstract)

With the rapid development of the Internet and information technology, the method of obtaining valuable data from complex information is an urgent problem to be solved. Recommendations are one of the effective ways to solve these problems. The recommendation system is a kind of method of recommending similar products to target users from past behavior and preference information. However, several problems still exist, such as data sparsity, cold start-up, and system prediction accuracy. In particular, as the number of users and items increases, existing, standalone-based recommendation algorithms meet the bottleneck of non-scalability. Spark is a new parallel big data computing engine based on memory. Due to the advantages of repetitive parallelism, it has received a lot of attention in the field of big data processing.
Neighbor-based and model-based recommended algorithms have been improved to address the problem of sparsity, cold start, and prediction accuracy degradation.
Considering the development trends of Spark-based applications at home and abroad, this paper aims to study Spark platform-based recommended algorithm technology including the following two aspects.
(1) A Study on the Parallelization of Recommendation Algorithm Based on Spark Platform
Based on the research on the Spark platform and the recommendation system, the process of parallelizing the recommendation algorithm based on the Spark platform is designed. Second, parallelization of Spark platform-based recommendation algorithms, which mainly include user-based collaboration filtering and article-based collaboration filtering algorithms, is realized. Finally, we analyze in detail how data and tasks are parallelized in the implementation of Spark memory algorithms.
(2) Optimization based on parallelization of the Spark platform
Optimization mainly involves two aspects: platform optimization and recommended algorithm optimization. In the parallel implementation of the recommended algorithm, HSATS is proposed to solve the problem of unreasonable task scheduling when Spark cluster nodes are heterogeneous. A novel approach to implicit label properties of users or articles is proposed based on optimization of neighbor recommendation algorithms. It quantifies and eventually fuses with similarity calculations. Based on the ALS model recommendation algorithm, a new loss function is designed that incorporates the similarity information of users and articles before training.
Experimental results show that Spark outperforms Hadoop in the parallel implementation of the recommended algorithm, which requires many iterations. For heterogeneous Spark clustering, an HSATS adaptive task scheduling strategy can reduce the completion time of the task and make more reasonable use of cluster node resources. The recommended algorithm optimization scheme is proposed to improve the evaluation index of the recommended system.

번역하기

국문 초록 (Abstract)

인터넷과 정보기술의 급속한 발전에 따라, 복잡한 정보로부터 가치 있는 데이터를 얻는 방법은 시급히 해결해야 할 문제이다. 추천제는 이러한 문제를 해결하는 효과적인 방법 중 하나이다. 추천시스템은 과거 행태와 선호도 정보에서 대상 사용자에게 유사한 상품을 추천하는 일종의 방식이다. 그러나 데이터 희소성, 콜드 시동 및 시스템 예측 정확도와 같은 몇 가지 문제가 여전히 존재한다. 특히 사용자와 항목의 수가 증가함에 따라 독립 형 기반의 기존 권장 알고리즘은 비 확장성의 병목 현상을 충족한다. Spark는 메모리 기반의 새로운 병렬형의 빅데이터 컴퓨팅 엔진이다. 반복 병렬화의 장점 때문에 빅데이터 처리 분야에서 많은 관심을 받아왔다. 이웃 기반 및 모델 기반 권장 알고리즘은 희소성, 콜드 스타트 및 예측 정확도 저하 문제를 해결하기 위해 개선되었다.
본 논문에서는 국내외 Spark 기반 애플리케이션의 개발 동향을 고려하여 다음 두 가지 측면을 포함하여 Spark 플랫폼 기반 권장 알고리즘 기술을 연구하고자 한다.
(1) Spark 플랫폼 기반 추천 알고리즘 병렬화 연구
Spark 플랫폼과 추천시스템의 연구를 바탕으로 Spark 플랫폼을 기반으로 한 추천 알고리즘의 병렬화 하는 과정을 설계한다. 둘째, 사용자 기반 협업 필터링, 기사 기반 협업 필터링 알고리즘을 주로 포함하는 Spark 플랫폼 기반 권장 알고리즘의 병렬화가 실현된다. 마지막으로 Spark 메모리 알고리즘 구현에서 데이터와 작업을 병렬화한 방법을 자세히 분석한다.
(2) Spark 플랫폼의 병렬화를 기반으로 한 최적화
최적화는 주로 플랫폼 최적화와 권장 알고리즘 최적화라는 두 가지 측면을 포함한다. 권장 알고리즘의 병렬 구현에서 HSATS는 Spark 클러스터 노드가 이기종일 때 불합리한 작업스케줄링 문제를 해결하기 위해 제안된다. 이웃 추천 알고리듬의 최적화를 기반으로 사용자 또는 기사의 암시적 레이블 속성에 대한 새로운 접근 방식이 제안된다. 정량화하고 결국 유사성 계산과 융합한다. ALS 모델 추천 알고리즘을 기반으로 훈련 전 사용자와 기사의 유사성 정보를 통합하는 새로운 손실함수가 설계된다.
실험 결과는 Spark가 권장 알고리즘의 병렬 구현에서 Hadoop을 능가한다는 것을 보여주는데, 이는 많은 반복이 필요하다. 이기 종 Spark 클러스터링의 경우, HSATS 적응형의 작업스케줄링 전략은 작업의 완료 시간을 단축하고 클러스터 노드 리소스를 보다 합리적으로 사용할 수 있다. 권장 알고리즘 최적화 체계는 권장 시스템의 평가 지수를 향상시키기 위해 제안된다.

번역하기

인터넷과 정보기술의 급속한 발전에 따라, 복잡한 정보로부터 가치 있는 데이터를 얻는 방법은 시급히 해결해야 할 문제이다. 추천제는 이러한 문제를 해결하는 효과적인 방법 중 하나이다....

목차 (Table of Contents)