Synthetic Data-based Knowledge Transfer from Foundation Models to Visual Downstream Tasks|RISS 상세보기

다국어 초록 (Multilingual Abstract)

In data-scarce environments such as few-shot learning and low-resource cultural domains, training reliable visual downstream models faces fundamental limitations. Collecting diverse training data is time-consuming and expensive, and conventional data augmentation methods are confined to simple transformations of observed samples, limiting their ability to supplement insufficient semantic and visual information. This dissertation reinterprets Large Language Models (LLMs) and Diffusion Models (DMs) not as mere generators but as sources of knowledge, proposing a framework that structures their internal knowledge into synthetic data and transfers it to visual downstream tasks. The core objectives are: (1) generating synthetic images with high diversity within the target distribution, (2) mitigating data scarcity in one-shot image classification and culture-specific text-to-image (T2I) generation, and (3) achieving both computational efficiency and performance improvements without additional DM fine-tuning. The proposed framework consists of three stages. First, in the DALDA stage, we combine the semantic knowledge of LLMs with the visual priors of DMs to synthesize novel scenarios beyond class names, thereby strengthening sample-level knowledge transfer. By adaptively adjusting per-sample guidance based on CLIP similarity, we reduce the risk of generating out-of-distribution samples and improve one-shot classification performance without DM fine-tuning. Second, in the CABIN stage, we reinterpret the LLM not as a simple prompt generator but as a tool for constructing class-wise descriptors, designing two complementary synthesis modes: diversity-centric (broad exploration of the visual-semantic space) and key-feature-centric (emphasizing class-invariant attributes). By sampling scenarios from pre-constructed descriptor sets for each class, we significantly reduce LLM token costs while suppressing bias introduction into the synthetic dataset. Through a multi-stage learning strategy that leverages the complementarity of these two modes, we achieve significant accuracy improvements over existing methods across various few-shot benchmarks, again without DM fine-tuning. Third, in the KoDi stage, we extend the proposed knowledge transfer framework to a new domain—Korean culture- specific T2I generation—to validate its generalizability. To this end, we construct the Korean Cultural Dataset (KCD) and assign three caption types to each image: Korean, semantic English translation, and phonetic romanization, preventing cultural meaning loss during synthesis. Building upon this, we propose KoDi, a bilingual T2I model, and apply CABIN’s diversity-centric synthesis strategy to demonstrate that the proposed framework transfers effectively even to culturally specific environments. Through MetaCLIP-based metrics and Large Vision-Language Model evaluation, we assess both visual quality and cultural fidelity, demonstrating that our approach achieves superior Korean cultural representation compared to existing multilingual T2I models. In summary, this dissertation demonstrates that explicit data design combining diffusion model-based data augmentation with the internal knowledge of LLMs enables efficient knowledge transfer from foundation models to visual downstream tasks. Moving beyond naive real data expansion, the proposed approach improves performance and generalization in data-scarce environments through purpose-driven data design that balances diversity, distribution alignment, and cost efficiency, while providing practical guidelines for scalable synthetic data generation and deployment.

번역하기

국문 초록 (Abstract)

소수 샷 학습(few-shot learning)이나 저자원 문화 도메인과 같이 데이터가 부족한 환경에서는 신뢰할 수 있는 시각적 다운스트림 모델을 학습하는 데 근본적인 한계가 존재한다. 다양한 학습 데이터를 수집하는 과정은 시간과 비용 부담이 크며, 기존의 데이터 증강 방법은 관측된 샘플의 단순 변환에 머무르기 때문에 부족한 의미적, 시각적 정보를 보완하는 데 한계가 존재한다. 본 학위논문은 대규모 언어모델(LLM)과 디퓨전 모델(DM)을 단순 생성기가 아닌 지식의 원천으로 재해석하여, 이들의 내재 지식을 합성 데이터 형태로 구성하고 시각적 다운스트림 태스크로 전이하는 프레임워크를 제안한다. 핵심 목표는 (1) 타겟 분포 내에서 높은 다양성을 갖는 합성 이미지 생성, (2) 소수 샷 이미지 분류 및 문화 특화 text-to-image(T2I) 생성에서의 데이터 부족 문제 완화, (3) 추가적인 디퓨전 모델 미세조정 없이 계산 비용 효율성과 성능 향상의 동시 달성이다. 제안 프레임워크는 세 단계로 구성된다. 첫째, DALDA 단계에서는 LLM의 의미적 지식과 DM의 시각적 사전지식을 결합하여 클래스명을 넘어서는 새로운 시나리오를 합성함으로써 샘플 수준의 지식 전이를 강화한다. CLIP 유사도를 기반으로 샘플별 가이던스를 적응적으로 조절하여 분포 외부(out-of-distribution) 샘플 생성 위험을 줄이고, DM 미세조정 없이 원 샷(one-shot) 분류 성능을 향상시킨다. 둘째, CABIN 단계에서는 LLM을 단순 프롬프트 생성기가 아닌 클래스별 설명자 구축 도구로 재해석하여, 다양성 중심(시각-의미 공간의 폭넓은 탐색)과 핵심 특징 중심(클래스 고유의 불변 속성 강조)이라는 두 가지 상보적 합성 모드를 설계한다. 클래스별로 사전 구축된 설명자 세트에서 시나리오를 샘플링함으로써 LLM 토큰 비용을 대폭 절감하는 동시에, 합성 데이터셋에 유입되는 편향을 억제한다. 두 모드의 상보성을 활용한 다단계 학습 전략을 통해 다양한 소수 샷 벤치마크에서 기존 방법 대비 유의미한 정확도 향상을 달성하며, 이 과정 역시 DM 미세조정 없이 수행된다. 셋째, KoDi 단계에서는 제안된 지식 전이 프레임워크의 일반화 가능성을 검증하기 위해 한국 문화 특화 T2I 생성이라는 새로운 영역으로 확장한다. 이를 위해 한국 문화 데이터셋(KCD)을 구축하고, 각 이미지에 한국어, 의미적 영어 번역, 음차 로마자 표기의 세 가지 캡션 유형을 부여하여 합성 과정에서 문화적 의미 손실을 방지한다. 이를 기반으로 이중언어 T2I 모델인 KoDi를 제안하며, CABIN의 다양성 중심 합성 전략을 적용하여 문화적으로 특수한 환경에서도 제안 프레임워크가 효과적으로 전이됨을 입증한다. MetaCLIP 기반 메트릭과 대규모 비전-언어 모델 평가를 통해 시각적 품질과 문화적 충실도를 검증한 결과, 기존 다국어 T2I 모델 대비 우수한 한국 문화 표현 성능을 달성한다. 종합하면, 본 논문은 디퓨전 모델 기반 데이터 증강과 LLM의 내재 지식을 결합한 명시적 데이터 설계를 통해, 파운데이션 모델로부터 시각적 다운스트림 태스크로의 효율적인 지식 전이가 가능함을 입증한다. 단순한 실제 데이터 확장을 넘어, 다양성·분포 정합성·비용 효율성 간의 균형을 고려한 목적 지향적 데이터 설계로 데이터 부족 환경에서의 성능과 일반화 능력을 개선하며, 확장 가능한 합성 데이터 생성 및 활용을 위한 구체적 지침을 제시한다.

번역하기

소수 샷 학습(few-shot learning)이나 저자원 문화 도메인과 같이 데이터가 부족한 환경에서는 신뢰할 수 있는 시각적 다운스트림 모델을 학습하는 데 근본적인 한계가 존재한다. 다양한 학습 데...

목차 (Table of Contents)