With the rapid development of the Internet and information technology, the method of obtaining valuable data from complex information is an urgent problem to be solved. Recommendations are one of the effective ways to solve these problems. The recomme...
With the rapid development of the Internet and information technology, the method of obtaining valuable data from complex information is an urgent problem to be solved. Recommendations are one of the effective ways to solve these problems. The recommendation system is a kind of method of recommending similar products to target users from past behavior and preference information. However, several problems still exist, such as data sparsity, cold start-up, and system prediction accuracy. In particular, as the number of users and items increases, existing, standalone-based recommendation algorithms meet the bottleneck of non-scalability. Spark is a new parallel big data computing engine based on memory. Due to the advantages of repetitive parallelism, it has received a lot of attention in the field of big data processing.
Neighbor-based and model-based recommended algorithms have been improved to address the problem of sparsity, cold start, and prediction accuracy degradation.
Considering the development trends of Spark-based applications at home and abroad, this paper aims to study Spark platform-based recommended algorithm technology including the following two aspects.
(1) A Study on the Parallelization of Recommendation Algorithm Based on Spark Platform
Based on the research on the Spark platform and the recommendation system, the process of parallelizing the recommendation algorithm based on the Spark platform is designed. Second, parallelization of Spark platform-based recommendation algorithms, which mainly include user-based collaboration filtering and article-based collaboration filtering algorithms, is realized. Finally, we analyze in detail how data and tasks are parallelized in the implementation of Spark memory algorithms.
(2) Optimization based on parallelization of the Spark platform
Optimization mainly involves two aspects: platform optimization and recommended algorithm optimization. In the parallel implementation of the recommended algorithm, HSATS is proposed to solve the problem of unreasonable task scheduling when Spark cluster nodes are heterogeneous. A novel approach to implicit label properties of users or articles is proposed based on optimization of neighbor recommendation algorithms. It quantifies and eventually fuses with similarity calculations. Based on the ALS model recommendation algorithm, a new loss function is designed that incorporates the similarity information of users and articles before training.
Experimental results show that Spark outperforms Hadoop in the parallel implementation of the recommended algorithm, which requires many iterations. For heterogeneous Spark clustering, an HSATS adaptive task scheduling strategy can reduce the completion time of the task and make more reasonable use of cluster node resources. The recommended algorithm optimization scheme is proposed to improve the evaluation index of the recommended system.