Today, many companies and government agencies are utilizing a variety of methods for the development and management of the personal information protection system. Especially, in accordance with the Privacy Act, the Privacy Policy must be published by ...
Today, many companies and government agencies are utilizing a variety of methods for the development and management of the personal information protection system. Especially, in accordance with the Privacy Act, the Privacy Policy must be published by the one who collects and uses the personal information and it must include certain information about using the personal information. Recently, the Ministry of Security and Public Administration and other government agencies have continued to monitor and evaluate the personal information protection system of companies and government agencies by analyzing information on the Privacy Policy web pages. For more accurate evaluation and monitoring, the precise information collecting is more important than the design of evaluation method. However, so far, the evaluation process takes a long time and is inaccuracy, because investigators(human) collected information by themselves.
In this study, we have developed the Document Object Model based automatic information extraction system for the Privacy Policy web pages . The Existing web data extraction methodologies, that can extract web data by analyzing the overlapped patters from a lot of similar pattered web pages, can not be used for the Privacy Policy web pages, because the Privacy Policy has no rules or formats, so it is impossible to analyze the overlapped patters from the Privacy Policy web pages. To overcome the limitations of the existing methodologies, we have developed the system that can analyze the structure of individual Privacy Policy web page and extract data from it. For the development of system, we have utilized TFIDF(Term Frequency-Inverse Document Frequency), cosine similarity, the Network Distance Based similarity(NDB similarity) in addition to Document Object Model. The system can automatically extract data from the Privacy Policy web pages and it will be used for devising the other unstructured wed data extraction methodologies.