Thanks to rapid improvement of information technology, the emergence of various information channels such as mobile devices and social media have been producing tremendous amount of data. The evolution of smartphones and social network services(SNS) l...
Thanks to rapid improvement of information technology, the emergence of various information channels such as mobile devices and social media have been producing tremendous amount of data. The evolution of smartphones and social network services(SNS) leads to the big data revolution.
Not only the amount of data have been growing up exponentially, but also more diverse types (structured, semi-structured, and unstructured) of data are emerging. In case the of Twitter and Facebook, there should be several analytical methods depending on the types of data.
In the case of online shopping, the log data can be used to analyze consumer's purchase pattern by measuring the time on how long they purchase items since they logged in the web. Collection and analysis of large and varied data presents a challenge, as compared to the standard and conventional data.
Even though the same data was used to extract the meaning, it can be interpreted in various ways depending on how it was pre-filtered and what kind of data mining methods was used. So the importance of pre-filtering and appropriate data mining techniques should be considered in mining the semantics of large and various data. The research for unstructured data, large and varied data, have been started for a more systematic and appropriate ways of collection and analysis.
In this study, Twitter data has been collected, stored and analyzed in a multi-dimensional fashion on top of Hadoop platform, widely used for distributed computing, in order to find out what kind of factors can affect the preference of smartphones. The data, which is around 600,000 tweets or 2.5 GB, has been collected for one month using smartphone-related keywords. The results affecting the preference of smartphones are processed in multi-dimensional analysis after pre-filtering and natural language processing. The most serious problem is the quality of the result that comes largely from the shortage of samples due to a short period of collection (one month). Another big problem comes from the synonyms including acronyms in Internet or smartphones. However, these problems can be moderated as the data collection time and the number of synonyms/acronyms in the dictionary increase.