TF-IDF 기반 토픽 가이딩과 Fine-tuned KoBART를 활용한 뉴스 동향 요약 = News Trend Summarization Using TF-IDF-based Topic Guiding and Fine-tuned KoBART|RISS 상세보기

다국어 초록 (Multilingual Abstract)

This study aims to analyze how faithfully the summary results reflect the topic structure of the input data in news trend analysis using generative summarization models, and to propose a topic induction method focused on the input stage to improve this. Existing news summarization research has primarily focused on performance evaluation using reference summary-based similarity metrics like ROUGE, or on improving model architecture and training techniques. However, these approaches have limitations in fully explaining how well the summary text represents the issue structure over the entire period, particularly whether it is suitable from a trend analysis perspective. For this purpose, this study set period buckets of 1 day, 1 week, 1 month, and 3 months for Naver's economic and financial news data, and constructed news article sets for each period. First, topic keywords were extracted for each period using TF-IDF analysis. The representativeness of these topics was then re-evaluated based on article-level coverage, defining a hierarchical topic cluster structure comprising Primary Topic Clusters, Secondary Topic Clusters, and Tertiary Topic Clusters. In this process, TF-IDF was used solely as a tool to identify potential issues emerging during the period, not as a substitute for the internal judgments of the summarization model. The summary generation experiments were conducted using the fine-tuned KoBART model, distinguishing between Unguided and Guided approaches. The Unguided approach assumed a scenario where the user does not explicitly input topic keywords, using the entire article set as input. In contrast, the Guided approach filtered only relevant articles based on TF-IDF-based topic keywords to compose the input. Both methods maintained the same summary model and generation conditions, allowing for a comparison of how the topic representation in the summary results changed based on the difference in input composition. The evaluation of the summary results was performed not by assessing linguistic quality or similarity to reference summaries, but by analyzing how the topic keywords included in the generated summaries were distributed across PTC, STC, and TTC. Experimental results showed that TF-IDF-based topic guiding did not exhibit the same effectiveness across all time periods. However, it significantly improved the concentration of core issues in summaries, particularly for medium-scale period data like 1 week. Conversely, the guiding effect was limited in short-term data with high issue density, such as 1-day data. For long-term data like 1-month and 3-month periods, structural limitations were identified in representing the entire topic structure of the period with a single summary. Additionally, to explore alternatives to fine-tuning, this study conducted experiments combining the base KoBART model with TF-IDF-based RAG. The results showed that while natural sentences were generated under some conditions, consistent performance at a level capable of reliably replacing the fine-tuned model was not achieved overall. This confirmed that fine-tuning serves not as a topic-determining role, but rather as an element providing the linguistic foundation for reliably summarizing multiple article inputs and expressing guided information in coherent sentences. In summary, this study holds significance not by directly improving the performance of generative summarization models, but by proposing a methodological foundation for a news trend analysis summarization system usable by users without domain knowledge. This is achieved by separating topic guidance at the input stage from topic reflection evaluation at the output stage in the design. This is expected to serve as foundational material for constructing input-design-centric generative systems across various application domains, including not only news summarization but also time-series document analysis and automated report generation.

번역하기

국문 초록 (Abstract)

본 연구에서는 본 연구는 생성형 요약 모델을 활용한 뉴스 동향 분석에서 요약 결 과가 입력 데이터의 토픽 구조를 얼마나 충실히 반영하는지를 분석하고, 이를 개선 하기 위한 입력 단계 중심의 토픽 가이딩 방법을 제안하는 것을 목적으로 한다. 기 존 뉴스 요약 연구는 주로 ROUGE와 같은 참조 요약 기반 유사도 지표를 중심으로 성능을 평가하거나, 모델 구조 및 학습 기법의 개선에 초점을 맞추어 왔다. 그러나 이러한 접근은 기간 전체의 이슈 구조를 요약문이 얼마나 대표적으로 반영하는지, 즉 동향 분석 관점에서의 적합성을 충분히 설명하지 못하는 한계가 있다. 이를 위해 본 연구에서는 경제·금융 뉴스 데이터를 대상으로 1day, 1week, 1month, 3months의 기간 단위(bucket)를 설정하고, 각 기간별 뉴스 기사 집합을 구성하였다. 먼저 TF-IDF 분석을 통해 기간별 토픽 키워드를 추출하고, 기사 단위 커버리지 (article-level coverage)를 기준으로 토픽의 대표성을 재평가함으로써 PTC, STC, TTC로 구성된 계층적 토픽 클러스터를 정의한다. 이 과정에서 TF-IDF는 요약 모 델의 내부 판단을 대체하는 기준이 아닌, 해당 기간에 등장한 이슈 후보를 식별하 기 위한 도구로 한정하여 활용하였다. 요약 생성 실험은 Fine-tuned KoBART 모델을 사용하여 Unguided 방식과 Guided 방식으로 구분하여 수행한다. Unguided 방식은 사용자가 토픽 키워드를 명시적으로 입력하지 않는 상황을 가정하여 기간 전체 기사 집합을 입력으로 사용한 반면, Guided 방식은 TF-IDF 기반 토픽 키워드를 기준으로 관련 기사만을 필터링하여 입력으로 구성하였다. 두 방식은 동일한 요약 모델과 생성 조건을 유지함으로써, 입 력 구성 방식 차이에 따른 요약 결과의 토픽 반영력 변화를 비교할 수 있도록 설계 되었으며 요약 결과의 평가는 언어적 품질이나 참조 요약과의 유사도가 아닌, 생성 된 요약문에 포함된 토픽 키워드가 PTC, STC, TTC에 어떻게 분포하는지를 분석 하는 방식으로 수행된다. 실험 결과, TF-IDF 기반 토픽 가이딩은 모든 기간 단위에서 동일한 효과를 보이지 는 않았으나, 특히 1week와 같은 중간 규모 기간 데이터에서 요약의 핵심 이슈 집 중도를 유의미하게 향상시키는 효과를 확인할 수 있었다. 반면 1day와 같이 이슈 밀도가 높은 단기 데이터에서는 가이딩 효과가 제한적으로 나타났으며, 1month 및 3months와 같은 장기 데이터에서는 단일 요약문이 기간 전체의 토픽 구조를 대표 하는 데 구조적 한계가 존재함을 확인하였다. 추가적으로 본 연구에서는 Fine-tuning을 대체할 수 있는 가능성을 검토하기 위해 Base KoBART 모델과 TF-IDF 기반 RAG(Retrieval-Augmented Generation)를 결 합한 실험을 수행한다. 그 결과, 일부 조건에서는 자연스러운 문장이 생성되었으나, 전반적으로 Fine-tuned 모델을 안정적으로 대체할 수준의 일관된 성능을 확보하지 는 못하였다. 이를 통해 Fine-tuning은 토픽을 결정하는 역할이 아니라, 다수 기사 입력을 안정적으로 요약하고 가이딩된 정보를 일관된 문장으로 표현하기 위한 언어 적 기반을 제공하는 요소임을 확인한다. 종합하면, 본 연구는 생성형 요약 모델의 성능을 직접적으로 향상시키기보다는, 입 력 단계에서의 토픽 가이딩과 출력 단계에서의 토픽 반영력 평가를 분리하여 설계 함으로써, 도메인 지식이 없는 사용자도 활용 가능한 뉴스 동향 분석 요약 시스템 의 방법론적 기반을 제시한다는 점에서 의의를 가진다. 이는 뉴스 요약뿐만 아니라 시계열 문서 분석, 보고서 자동 생성 등 다양한 응용 분야에서 입력 설계 중심의 생성 시스템을 구축하는 데 기초 자료로 활용될 수 있을 것으로 기대된다.

번역하기

본 연구에서는 본 연구는 생성형 요약 모델을 활용한 뉴스 동향 분석에서 요약 결 과가 입력 데이터의 토픽 구조를 얼마나 충실히 반영하는지를 분석하고, 이를 개선 하기 위한 입력 단계 중...

목차 (Table of Contents)

표 목 차 ……………………………………………………………………………… iii
그 림 목 차 ……………………………………………………………………………… iii
용 어 설 명 ……………………………………………………………………………… iv
국 문 요 약 ……………………………………………………………………………… vi

표 목 차 ……………………………………………………………………………… iii
그 림 목 차 ……………………………………………………………………………… iii
용 어 설 명 ……………………………………………………………………………… iv
국 문 요 약 ……………………………………………………………………………… vi
1. 서 론 ……………………………………………………………………………… 1
1.1. 연구 배경 ………………………………………………………………… 1
1.2. 연구 목적 ………………………………………………………………… 2
2. 관련 연구 …………………………………………………………………………… 3
2.1. 뉴스 요약 연구 동향 …………………………………………………… 3
2.2. TF-IDF 기반 토픽 가이딩 연구 동향 ………………………… 3
2.3. Fine-tuning 및 시계열 데이터 구획 연구 동향 …………… 5
3. 연구 방법 ……………………………………………………………………………… 7
3.1. 연구 방법 개요 ………………………………………………………………… 7
3.2. 데이터셋 구성 …………………………………………………………………… 10
3.2.1. 기간 단위 데이터셋 구성 방식 ……………………………………………… 10
3.3. TF-IDF 기반 토픽 키워드 추출 …………………………………………… 11
3.3.1. TF-IDF 키워드의 역할 재정의 …………………………………………… 12
3.3.2. 사용자 입력 가이딩 관점에서의 TF-IDF ………………………………… 12
3.3.3. 기사 단위 커버리지 기반 집합과의 연결 ………………………………… 13
3.4. 기사 단위 커버리지 기반 집합 정의 ……………………………………… 14
3.4.1. 기사 단위 커버리지의 정의 ………………………………………………… 14
3.4.2. 커버리지 기반 토픽 집합 정의 ……………………………………………… 14
3.4.3. TF-IDF 기반 토픽 집합 정의 …………………………………………… 15
3.5. 커버리지 기반 토픽 클러스터 정의(PTC/STC/TTC) ……………………15
3.6. 토픽 가이딩 기반 요약 생성 실험 설계 ……………………………………… 17
3.6.1. 실험 설계의 전제 조건 ……………………………………………………… 17
3.6.2. Unguided 요약 생성 실험 ……………………………………………… 17
3.6.3. Guided 요약 생성 실험 …………………………………………………… 18
3.6.4. Guided vs. Unguided 평가 기준 ………………………………………… 19
3.6.5. Fine-tuning의 역할 재정의 ………………………………………………… 19
3.6.6. RAG 기반 요약 실험 ………………………………………………………… 20
3.7. 토픽 반영력 평가 지표 정의 …………………………………………………… 21
3.7.1. 요약문 내 토픽 키워드 매칭 방식 ……………… ……………………… 21
3.7.2. 클러스터별 토픽 반영 비율 산출 ………………………………………… 21
3.7.3. 평가 지표의 해석 범위 및 한계 ……………………………………… 22
4. 실험 및 결과 분석 ………………………………………………………………… 23
4.1. 실험 설정 요약 …………………………………………………………………… 23
4.2. Fine-tuned KoBART 기반 요약 결과 분석 ……………………………… 23
4.2.1. 실험 데이터 구성 및 모델 설정 ………………………………………… 23
4.2.2. Guided / Unguided 입력 방식 정의 ……………………………… 24
4.2.3. 버킷별 토픽 반영 분포 비교 ………………………………………… 24
4.2.4. 기간 단위별 토픽 반영 특성 분석 …………………………………… 26
4.3. RAG 기반 추가 실험 결과 분석(Exploratory Study) …………… 27
4.3.1. 실험 목적 및 설정 …………………………………………………… 27
4.3.2. RAG 기반 요약 생성 결과…………………………………………… 27
4.3.3. Fine-tuning vs. RAG 생성 요약 실제 사례 분석………………… 28
4.3.4. Guided 입력에 따른 토픽 반영력 증분 분석 (ΔPTC)…………… 29
4.3.5. 소결…………………………………………………………………………… 31
5. 결 론………………………………………………………………………………… 32
5.1. 연구 요약 ………………………………………………………………………… 32
5.2. 주요 실험 결과 및 시사점 ……………………………………………………… 33
5.3. RAG 추가 실험의 의미와 한계 ………………………………………… 34
5.4. 연구의 의의 및 향후 과제 …………………………………………… 34
5.5. 결론 …………………………………………………………………………… 35
참고문헌 ……………………………………………………………………………… 36
ABSTRACT …………………………………………………………………………… 40

상세검색

RISS 보유자료

상세검색

해외전자자료

TF-IDF 기반 토픽 가이딩과 Fine-tuned KoBART를 활용한 뉴스 동향 요약 = News Trend Summarization Using TF-IDF-based Topic Guiding and Fine-tuned KoBART

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료