A Modular Framework for Visual Emotion Analysis and Captioning with Psychological Insights : 심리학적 통찰을 활용한 시각 감정 분석 및 캡셔닝을 위한 모듈형 프레임워크|RISS 상세보기

국문 초록 (Abstract)

시각적 감정 분석은 이미지에 대해 인간이 느낄 수 있는 가장 가능성이 높은 감정을 예측하는 것을 목표로 한다. 반면, 감정 기반 캡셔닝은 이미지의 시각적 내용을 설명함과 동시에 해당 이미지가 유발하는 정서적 반응을 언어로 표현하는 데 중점을 둔다. 시각적 감정 분석은 일반적으로 분류 문제로 다루어지지만, 인간 감정의 주관성과 심리적 복합성으로 인해 예측 성능과 해석 가능성에서 고유한 도전 과제가 존재한다. 본 연구는 심리학적 통찰을 바탕으로 비전-언어 모델을 활용해 이미지로부터 감정적으로 중요한 텍스트 단서를 추출하고, 이를 텍스트 기반 분류 파이프라인에 입력하여 감정을 예측하였다. 또한, 예측된 감정과 텍스트 단서를 결합해 감정 기반 캡션 생성을 위한 토대를 마련하였다. 이를 위해 대규모 언어 모델을 사용해 과업 특화 소규모 데이터셋을 구축하고, 이를 바탕으로 소형 언어 모델을 미세 조정하였다. 비전-언어 모델은 수정하지 않고 텍스트 분류기만 미세 조정하였음에도 불구하고, 제안하는 접근법은 시각적 감정 분석 성능을 30% 이상 향상시켰다. 또한, 생성된 감정 기반 캡션은 문맥적으로 자연스러우면서 다양한 감정 간 구분에도 효과적임을 확인하였다. 제안 프레임워크는 모듈형 구조로 각 구성 요소의 개선이 전체 성능 향상으로 이어지며, 파운데이션 모델의 빠른 진화에도 유연하게 호환된다.

번역하기

시각적 감정 분석은 이미지에 대해 인간이 느낄 수 있는 가장 가능성이 높은 감정을 예측하는 것을 목표로 한다. 반면, 감정 기반 캡셔닝은 이미지의 시각적 내용을 설명함과 동시에 해당 이...

다국어 초록 (Multilingual Abstract)

Visual Emotion Analysis (VEA) aims to predict the emotion that humans are most likely to experience when viewing an image, while emotion-aware captioning focuses on generating captions that not only describe image content but also explain the emotional response it evokes. VEA is typically framed as a classification task; however, unlike standard visual classification, it must contend with the inherently subjective and psychologically complex nature of human emotions. Despite progress in this field, this characteristic poses unique challenges for both performance and interpretability. This work explores using vision-language models (VLMs) to extract emotionally relevant textual cues from images, guided by psychological insights. These cues are then used in a simple text-based classification pipeline for emotion prediction. The extracted information and predicted emotion also serve as the foundation for generating emotion-aware captions that explain the rationale behind each emotional interpretation. To achieve this, a small task-specific dataset is first constructed using large language model APIs, and a smaller language model is then fine-tuned on this dataset for the same task. Without modifying the underlying VLM and solely by fine-tuning the selected text classifier, this approach improves VEA performance by over 30%. Moreover, the generated captions are not only contextually relevant but also more effective at distinguishing between different emotions. Furthermore, the proposed framework is inherently plug-and-play, allowing enhancements to any individual component to directly translate into overall performance gains, naturally aligning with the rapid evolution of foundation models.

번역하기

목차 (Table of Contents)

Abstract 1
I Introduction 2
II Related Work 6
2.1 Emotions and Emotion Analysis 6
2.2 Automatic Visual Emotion Analysis 8

Abstract 1
I Introduction 2
II Related Work 6
2.1 Emotions and Emotion Analysis 6
2.2 Automatic Visual Emotion Analysis 8
2.3 Language Models 12
2.3.1 Transformer 15
2.3.2 BERT 17
2.3.3 GPT Models 19
2.3.4 Phi Models 21
2.4 Vision-Language Models 21
2.4.1 BLIP Models 25
III Method 29
3.1 Conceptual Overview 29
3.2 Emotion-aware Description 29
3.2.1 Emotion-aware Attributes 30
3.2.2 Emotion-aware Template Formatting 34
3.3 Visual Emotion Analysis 36
3.4 Emotion-aware Captioning 37
3.4.1 Captioning with Larger Language Model APIs 38
3.4.2 Captioning with Fine-tuned Smaller Language Models 40
IV Experiments and Results 42
4.1 Datasets 42
4.2 Evaluation Metrics 45
4.3 Implementation Details 48
4.4 Performance Evaluations 50
4.4.1 Emotion Prediction on EmoSet 50
4.4.2 Emotion Prediction on Other Datasets 55
4.4.3 Emotion-aware Captioning Evaluation 58
4.4.4 Ablation and Replacement Studies 62
4.4.5 Visualizations 68
V Conclusion 70
References 72
Abstract in Korean 84

상세검색

RISS 보유자료

상세검색

해외전자자료

A Modular Framework for Visual Emotion Analysis and Captioning with Psychological Insights : 심리학적 통찰을 활용한 시각 감정 분석 및 캡셔닝을 위한 모듈형 프레임워크

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료