RISS 검색 - 국내학술지논문 상세보기

다국어 초록 (Multilingual Abstract)

In modern society, where digital documents have increased exponentially, it is essential to efficiently obtain important information within documents. However, due to the vast amount of digital documents, it has become difficult for humans to abbreviate important information on individual documents. Document summarization is a Natural Language Processing field that extracts or generates meaningful sentences shorter than the original document while maintaining key information on the original document. However, since there is no appropriate Korean summarization data for benchmark, research has been conducted without a baseline, and development in this field is insufficient. In this paper, two document datasets that satisfy the accessibility and verification of summarization data and different text characteristics were selected. In addition, BERT-based multilingual and Korean pre-trained language models were selected, compared, and tested. For Korean documents, the Korean pre-trained language models outperformed the multilingual pre-trained language models in ROUGE scores. The cause was analyzed through the extraction ratio of selected summary sentences.

국문 초록 (Abstract)

디지털 문서가 기하급수적으로 증가한 현대 사회에서 문서 내 중요한 정보를 효율적으로 획득하는 것은 중요한 요구사항이 되었다. 그러나 방대한 디지털 문서의 양은 개별 문서의 중요 정...

디지털 문서가 기하급수적으로 증가한 현대 사회에서 문서 내 중요한 정보를 효율적으로 획득하는 것은 중요한 요구사항이 되었다. 그러나 방대한 디지털 문서의 양은 개별 문서의 중요 정보를 식별하고 축약하는 데 어려움을 야기하였다. 문서 요약은 자연어 처리의 한 분야로서 원본 문서의 핵심적인 정보를 유지하는 동시에 중요 문장을 추출 또는 생성하는 작업이다. 하지만 벤치마크로 사용하기에 적절한 한국어 문서 데이터의 부재와 베이스라인 없이 문서 요약 연구가 진행되어 발전이 미진한 상황이다. 본 논문에서는 데이터에 대한 검증과 접근성을 충족하고 글의 특성이 다른 두 개의 문서 집합을 선정하였다. BERT 기반의 다국어 및 한국어 사전 학습 언어 모형들을 선정하여 비교 및 실험하였다. 주요 결과로는 한국어 사전 학습 언어 모형이 ROUGE 점수에서 다국어 사전 학습 언어 모형을 능가하였으며, 이에 대한 원인을 추출된 요약 문장의 비율을 통해 분석하였다.

참고문헌 (Reference)

1 윤재민 ; 정유진 ; 이종혁, "육하원칙 활성화도를 이용한 신문기사 자동추출요약" 한국정보과학회 31 (31): 505-515, 2004

2 이경호 ; 박요한 ; 이공주, "신문기사와 소셜 미디어를 활용한 한국어 문서요약 데이터 구축" 한국정보처리학회 9 (9): 251-258, 2020

3 Alexis Conneau, "Unsupervised Cross-lingual Representation Learning at Scale" Association for Computational Linguistics 8440-8451, 2020

4 Jaewon Jeaon, "Two-step Document Summarization using Deep Learning and Maximal Marginal Relevance" 347-349, 2019

5 Yang Liu, "Text Summarization with Pretrained Encoders" Association for Computational Linguistics 3730-3740, 2019

6 Chin-Yew Lin, "Text Summarization Branches Out" Association for Computational Linguistics 74-81, 2004

7 D. Shen, "Text Summarization BT - Encyclopedia of Database Systems" Springer US 3079-3083, 2009

8 R. Nallapati, "Summarunner: A recurrent neural network based sequence model for extractive summarization of documents" 2017

9 Y. Liu, "Roberta: A robustly optimized bert pretraining approach" 2019

10 Shashi Narayan, "Ranking Sentences for Extractive Summarization with Reinforcement Learning" Association for Computational Linguistics 1 : 1747-1759, 2018