패션 스타일 분류를 위한 Vision-Language 모델에서 텍스트 입력 유형이 성능에 미치는 영향 분석 = Analyzing the Impact of Text Input Types on Vision-Language Models for Fashion Style Classification|RISS 상세보기

국문 초록 (Abstract)

패션 스타일 분류를 위한 Vision-Language 모델에서 텍스트 입력 유형이 성능에 미치는 영향 분석 최근 패션 추천 및 스타일 분석 시스템은 사용자 구매 이력이나 클릭 로그에 기반 한 협업 필터링 기법, 또는 이미지 특징에 집중한 비전 중심 모델에 주로 의존해 왔다. 이러한 접근은 충분한 사용자 행동 데이터가 확보된 경우에는 효과적일 수 있으나, 신상품이나 희소 아이템에 대해서는 콜드 스타트 문제가 발생하며, 의류가 지니는 추상적인 스타일 개념을 충분히 반영하는 데 한계를 가진다. 특히 패션 스 타일은 색상이나 실루엣과 같은 시각적 요소뿐 아니라 맥락적·의미적 특성을 포함 하기 때문에, 이를 보다 정교하게 표현하기 위해서는 이미지와 텍스트 정보를 함께 고려하는 멀티모달 접근이 필요하다. 본 연구는 이러한 문제의식에서 출발하여, 패션 스타일 분류 태스크를 대상으로 Vision-only 모델과 Vision-Language 모델의 성능을 동일한 조건에서 비교하고, 텍 스트 데이터의 유형과 추상화 수준 차이가 모델 성능에 미치는 영향을 실증적으로 분석한다. 기존 Vision-Language 모델 연구가 이미지와 텍스트의 결합 자체가 성능 향상으로 이어진다는 전제에 초점을 맞추어 왔다면, 본 연구는 텍스트의 존재 여부 가 아니라 텍스트 입력 방식과 태스크 정합성이 실제 분류 성능에 어떤 영향을 미 치는지를 규명하는 것을 목표로 한다. 이를 위해 H&M 공개 데이터셋의 의류 이미지와 상품 메타데이터를 기반으로 14개 스타일 라벨 체계를 구성하고, 동일한 데이터 분할 및 평가 환경에서 세 가지 모델 을 비교하였다. Vision-only 모델은 이미지 정보만을 사용하여 스타일을 분류하였으 며, Vision-Language 모델은 동일한 네트워크 구조를 유지한 상태에서 추상적인 스 타일 설명 텍스트를 사용하는 VL-Style 모델과 실제 상품 설명 텍스트를 사용하는 VL-Product 모델로 구분하여 실험을 수행하였다. Vision-Language 모델에는 분류 손실과 이미지–텍스트 정렬을 위한 대조 학습 손실을 결합한 하이브리드 학습 구 조를 적용하여, 스타일 분류 성능과 의미적 표현 학습을 동시에 고려하였다. 또한 데이터 전처리 단계에서 속옷 및 나이트웨어와 같은 민감 카테고리를 제외함으로 써, 학습 데이터의 일관성과 윤리적 고려 사항을 함께 반영하였다. 실험 결과, Vision-only 모델은 텍스트 정보를 사용하지 않음에도 불구하고 안정적 인 스타일 분류 성능을 보였으며, 이는 패션 스타일 분류 태스크에서 시각적 정보 가 핵심적인 역할을 수행함을 의미한다. 추상적인 스타일 설명 텍스트를 활용한 VL-Style 모델은 Vision-only 모델 대비 소폭의 성능 향상을 보였고, 스타일 분류 와 검색 관점 모두에서 경쟁력 있는 성능을 나타내어 텍스트 조건 설계의 중요성을 확인할 수 있었다. 반면 실제 상품 설명 텍스트를 그대로 활용한 VL-Product 모델 은 텍스트 표현의 다양성과 노이즈로 인해 성능 편차와 전반적인 성능 저하를 보였 다. 클래스별 성능 분석과 실패 사례 분석을 통해, 상품 설명 텍스트에 포함된 속성 나열과 마케팅 문구가 스타일 판단에 노이즈로 작용하며 이미지–텍스트 간 의미적 정합성을 약화시키는 주요 원인임을 확인하였다. 본 연구는 Vision-Language 모델에서 텍스트 모달리티가 항상 성능 향상을 보장하 지 않으며, 텍스트의 추상화 수준과 태스크 목적 간 정합성이 모델 성능을 좌우하 는 핵심 요소임을 실험적으로 입증한다는 점에서 의의를 가진다. 이러한 결과는 이 미지 중심 접근의 한계를 보완할 수 있는 대안으로서 Vision-Language 기반 방법 의 가능성을 제시하며, 향후 패션 추천, 스타일 매칭, 개인화 서비스 등 다양한 응 용 분야로 확장될 수 있는 실질적인 설계 기준을 제공한다.

번역하기

패션 스타일 분류를 위한 Vision-Language 모델에서 텍스트 입력 유형이 성능에 미치는 영향 분석 최근 패션 추천 및 스타일 분석 시스템은 사용자 구매 이력이나 클릭 로그에 기반 한 협업 필터...

다국어 초록 (Multilingual Abstract)

Recent fashion recommendation and style analysis systems have predominantly relied on collaborative filtering techniques based on user purchase histories or click logs, as well as vision-centric models that focus solely on visual features. While such approaches can be effective when sufficient user interaction data are available, they often suffer from cold-start issues for new or sparse items and show limitations in capturing the abstract and semantic nature of fashion styles. Since fashion styles encompass not only visual attributes such as color and silhouette but also contextual and semantic characteristics, multimodal approaches that jointly consider image and text information have gained increasing attention.
Motivated by this observation, this study investigates fashion style classification by systematically comparing Vision-only models and Vision-Language (VL) models under identical experimental conditions, with a particular focus on how different types and abstraction levels of text inputs affect model performance. Unlike prior studies that primarily emphasize the effectiveness of combining image and text modalities, this work aims to empirically examine whether and under what conditions textual information contributes to improved classification performance.
To this end, a 14-class fashion style labeling scheme was constructed using clothing images and product metadata from the publicly available H&M dataset. Three models were evaluated using the same data splits and evaluation protocols: a Vision-only model utilizing image information alone, a VL-Style model employing abstract style description templates, and a VL-Product model using raw product description texts. The Vision-Language models adopted a hybrid learning framework that combines classification loss with contrastive loss to jointly optimize style classification accuracy and image–text embedding alignment. In addition, sensitive categories such as underwear and nightwear were excluded during data preprocessing to ensure dataset consistency and ethical considerations.
Experimental results demonstrate that the Vision-only model achieves stable and competitive performance despite the absence of textual input, highlighting the dominant role of visual information in fashion style classification. The VL-Style model yields a modest but consistent improvement over the Vision-only baseline, indicating that carefully designed abstract style descriptions can serve as effective auxiliary information. In contrast, the VL-Product model exhibits notable performance degradation and instability, largely due to noise and heterogeneity in raw product descriptions. Further class-wise analysis and qualitative failure case studies reveal that attribute-level details and marketing-oriented expressions in product texts often interfere with style recognition by weakening semantic alignment between image and text representations.
This study provides empirical evidence that textual information in Vision-Language models does not inherently guarantee performance gains; rather, the alignment between text abstraction level and task objectives plays a critical role. By highlighting the limitations of naively incorporating raw textual descriptions, this work offers practical insights into text design strategies for fashion Vision-Language models and suggests meaningful directions for extending multimodal approaches to fashion recommendation, style matching, and personalized fashion services.

번역하기

목차 (Table of Contents)

표 목 차 iii
그 림 목 차 iii
용 어 설 명 iv
국 문 요 약 v
1. 서 론 1

표 목 차 iii
그 림 목 차 iii
용 어 설 명 iv
국 문 요 약 v
1. 서 론 1
1.1 연구 배경 1
1.2 패션 스타일 분류 문제의 특수성 3
1.3 기존 접근 방식의 한계 4
1.3.1. Vision-only 접근의 한계와 가능성 4
1.3.2. Vision-Language 접근의 전제와 잠재적 맹점 4
1.4 연구 목적 및 연구 질문 5
1.5 연구 목적 및 기여 5
1.6 논문 구성 6
2. 관 련 연 구 7
2.1 전통적인 패션 추천 시스템 7
2.2 콘텐츠 기반 추천 시스템 8
2.3 협업 필터링 기반 추천 시스템 9
2.4 Vision-Language 모델과 CLIP 구조 10
2.5 소결 및 본 연구의 위치 11
2.6 기존 연구의 한계 및 본 연구의 차별성 12
3. 연 구 방 법 론 13
3.1 연구 개요 및 설계 원칙 13
3.2 전체 실험 파이프라인 13
3.3 데이터셋 구성 및 전처리 15
3.3.1 데이터셋 개요 15
3.3.2 데이터 분할 및 전처리 15
3.4 스타일 클래스 정의 15
3.4.1 스타일 클래스 목록 16
3.4.2 유사 스타일 클래스 구분 기준 17
3.5 모델 구성 18
3.5.1 V i s i on-on l y 모델 18
3.5.2 Vision-Language 모델 19
3.6 학습 전략 및 손실 함수 23
3.6.1 분류 손실(Classification Loss) 23
3.6.2 대조 학습 손실(Contrastive Loss) 23
3.6.3 최종 손실 함수 구성 23
3.7 학습 알고리즘 2 4
3.8 평가 방법 25
3.9 연구 방법 요약 25
4. 실 험 및 결 과 분 석 26
4.1 실험 설정 및 평가 기준 26
4.2 전체 성능 비교 결과 26
4.3 Vision-only와 Vision-Language 모델 비교 분석 27
4.4 클래스별 모델 성능 분석 27
4.4.1 분석 목적 및 방법 27
4.4.2 클래스별 모델 성능 비교 결과 28
4.4.3 텍스트 데이터에 대한 클래스 의존성 29
4.5 실패 사례 분석 30
4.5.1 실패 사례 종합 31
4.6 요약 31
5. 결 론 및 향 후 연 구 32
5.1 결론 32
5.2 연구의 한계 33
5.3 향후 연구 방향 33
참 고 문 헌 35
ABSTRACT 38

상세검색

RISS 보유자료

상세검색

해외전자자료

패션 스타일 분류를 위한 Vision-Language 모델에서 텍스트 입력 유형이 성능에 미치는 영향 분석 = Analyzing the Impact of Text Input Types on Vision-Language Models for Fashion Style Classification

부가정보

분석정보

연관 공개강의(KOCW)

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료