In data-scarce environments such as few-shot learning and low-resource cultural domains, training reliable visual downstream models faces fundamental limitations. Collecting diverse training data is time-consuming and expensive, and conventional data ...
In data-scarce environments such as few-shot learning and low-resource cultural domains, training reliable visual downstream models faces fundamental limitations. Collecting diverse training data is time-consuming and expensive, and conventional data augmentation methods are confined to simple transformations of observed samples, limiting their ability to supplement insufficient semantic and visual information. This dissertation reinterprets Large Language Models (LLMs) and Diffusion Models (DMs) not as mere generators but as sources of knowledge, proposing a framework that structures their internal knowledge into synthetic data and transfers it to visual downstream tasks. The core objectives are: (1) generating synthetic images with high diversity within the target distribution, (2) mitigating data scarcity in one-shot image classification and culture-specific text-to-image (T2I) generation, and (3) achieving both computational efficiency and performance improvements without additional DM fine-tuning. The proposed framework consists of three stages. First, in the DALDA stage, we combine the semantic knowledge of LLMs with the visual priors of DMs to synthesize novel scenarios beyond class names, thereby strengthening sample-level knowledge transfer. By adaptively adjusting per-sample guidance based on CLIP similarity, we reduce the risk of generating out-of-distribution samples and improve one-shot classification performance without DM fine-tuning. Second, in the CABIN stage, we reinterpret the LLM not as a simple prompt generator but as a tool for constructing class-wise descriptors, designing two complementary synthesis modes: diversity-centric (broad exploration of the visual-semantic space) and key-feature-centric (emphasizing class-invariant attributes). By sampling scenarios from pre-constructed descriptor sets for each class, we significantly reduce LLM token costs while suppressing bias introduction into the synthetic dataset. Through a multi-stage learning strategy that leverages the complementarity of these two modes, we achieve significant accuracy improvements over existing methods across various few-shot benchmarks, again without DM fine-tuning. Third, in the KoDi stage, we extend the proposed knowledge transfer framework to a new domain—Korean culture- specific T2I generation—to validate its generalizability. To this end, we construct the Korean Cultural Dataset (KCD) and assign three caption types to each image: Korean, semantic English translation, and phonetic romanization, preventing cultural meaning loss during synthesis. Building upon this, we propose KoDi, a bilingual T2I model, and apply CABIN’s diversity-centric synthesis strategy to demonstrate that the proposed framework transfers effectively even to culturally specific environments. Through MetaCLIP-based metrics and Large Vision-Language Model evaluation, we assess both visual quality and cultural fidelity, demonstrating that our approach achieves superior Korean cultural representation compared to existing multilingual T2I models. In summary, this dissertation demonstrates that explicit data design combining diffusion model-based data augmentation with the internal knowledge of LLMs enables efficient knowledge transfer from foundation models to visual downstream tasks. Moving beyond naive real data expansion, the proposed approach improves performance and generalization in data-scarce environments through purpose-driven data design that balances diversity, distribution alignment, and cost efficiency, while providing practical guidelines for scalable synthetic data generation and deployment.