MMDiT 구조와 Attention control 을 이용한 이미지 편집 모델을 위한 Zero-Training 기법 = Zero-Training for Image Editing Model based on MMDiT Architecture and Attention Control|RISS 상세보기

국문 초록 (Abstract)

MMDiT 구조와 Attention Control을 이용한 이미지 편집 모델을 위한 Zero-Training 기법 본 연구는 사전 학습된 Stable Diffusion 3.5 Medium 모델을 활용하여 이미지 편 집 수행을 위한 Zero-Training 방법을 제안한다. 기존의 이미지 편집 모델들은 객체 교체, 속성 변경, 스타일 조정과 같은 세밀한 편집을 위해 별도의 마스크 제작 또는 모델 재학습을 요구한다. 본 연구는 BLIP 기반 자동 캡셔닝을 활용하 여 입력 이미지의 의미 정보를 추출하고 원본 의미를 보존하기 위한 source prompt 를 자동 구성한다. 사용자가 제공하는 edit prompt 및 negative prompt 는 편집할 객체 및 억제할 요소를 정의하는 역할을 수행하고 전체 편집 과정의 제어 신호로 작용한다. 편집 대상의 위치 정보는 외부 마스크 없이도 추론 과정 에서 자동으로 생성된다. 본 연구는 Stable Diffusion 3.5의 Cross-Attention 구조 에서 단어별 Attention Map 을 실시간으로 추정하고 누적하여 텍스트 기반 자동 마스크를 생성한다. 또한 PixelMan 의 객체 단위 Attention 조작 방식을 MMDiT 구조에 맞게 재해석하여 편집 대상 객체에 해당하는 Attention 을 선택적으로 강 화하고 비대상 영역의 변형은 억제하는 방식으로 안정적 편집을 수행한다. 편집 강도 조절은 denoise, step 과 같은 추론 단계 파라미터만을 이용하여 구현하였으 며 모델의 가중치나 구조를 변경하지 않고 사전 학습된 백본을 그대로 유지하는 Zero-Training 의 특성을 보장한다. 본 연구는 마스크 없이 텍스트 기반으로 객 체 위치를 자동 추정하고 Attention 제어와 샘플링 파라미터 조절만으로 고품질 객체 단위 편집을 수행하는 새로운 Zero-Training Image Editing 프레임워크를 제 안한다. 제안된 구조는 객체 위치 정확도, 원본 보존도, 편집 영역의 명확성, 전 체 이미지 품질에서 기존 Prompt-to-Prompt, InstructPix2Pix 등의 학습 없는 편집 모델과 특정 데이터로 미세 조정된 SD3.5 Fine-Tuning 모델 대비 우수한 성능을 보인다.

번역하기

MMDiT 구조와 Attention Control을 이용한 이미지 편집 모델을 위한 Zero-Training 기법 본 연구는 사전 학습된 Stable Diffusion 3.5 Medium 모델을 활용하여 이미지 편 집 수행을 위한 Zero-Training 방법을 제안...

다국어 초록 (Multilingual Abstract)

This research proposed a Zero-Training Image Editing method that utilized the pre-trained Stable Diffusion 3.5 Medium model to perform image editing without any additional training. Conventional image editing approaches typically required manual mask creation or model fine-tuning to achieve fine-grained tasks such as object replacement, attribute modification, or style adjustments. In contrast, our approach leveraged BLIP-based automatic captioning to extract semantic information from input images and constructed a source prompt that preserved the original meaning. User-provided edit prompts and negative prompts served as effective control signals that defined the target objects and the semantic elements to suppress during editing. The spatial information of the target object was obtained automatically during inference without requiring external masks. Specifically, the method estimated and accumulated word-level Cross-Attention Maps within the Stable Diffusion 3.5 architecture to generate text-driven dynamic masks. Furthermore, the PixelMan attention-manipulation strategy was reinterpreted for the MMDiT architecture, enabling selective amplification of attention on target objects while suppressing undesired changes in non-target regions, which ensured stable editing. Experimental results demonstrated that editing intensity was controlled solely through inference-stage parameters such as denoising strength and sampling step, while the backbone model remained unchanged, thereby strictly maintaining the Zero-Training property. The proposed framework achieved superior performance compared to existing training-free editing methods, such as Prompt-to-Prompt and InstructPix2Pix, in terms of object localization accuracy, structural preservation, clarity of edited regions, and overall image quality. In summary, this study introduced a novel Zero-Training Image Editing framework that enabled high-quality object-level editing by automatically estimating object locations from text prompts and controlling Cross-Attention and sampling parameters without requiring any masks or model retraining. This approach provided enhanced accuracy, practicality, and scalability for a wide range of image editing tasks.

번역하기

목차 (Table of Contents)

그 림 목 차 ⅰ
표 목 차 ⅱ
용 어 설 명 ⅲ
국 문 요 약 ⅳ
부 록 ⅴ

그 림 목 차 ⅰ
표 목 차 ⅱ
용 어 설 명 ⅲ
국 문 요 약 ⅳ
부 록 ⅴ
1. 서 론 1
1.1. 연구의 배경 및 필요성 1
1.2. 연구 목적 및 방법 2
1.3. 연구의 구성 3
2. 관련 연구 4
2.1. 확산 모델 기반 이미지 생성 기술 4
2.1.1 GAN 기반 이미지 생성 4
2.1.2 확산 모델의 기본 원리와 Latent Diffusion Model 4
2.1.3 VAE(Variational Autoencoder) 5
2.1.4 재구성 손실(Reconstruction Loss) 6
2.1.5 KL Divergence 정규화(Regularization) 6
2.1.6 U-Net 6
2.1.7 Transformer 7
2.1.8 Vision Transformer 8
2.1.9 Classifier-Free Guidance 8
2.1.10 Stable Diffusion 구조 9
2.1.11 MMDiT(Multi-Modal Diffusion Transformer) 아키텍처 9
2.2. Zero-Training 기반 이미지 편집 11
2.2.1 BLIP 기반 자동 캡셔닝 11
2.2.2 DDIM 및 Null-Text Inversion을 통한 구조 11
2.2.3 Prompt-to-Prompt 및 InstructPix2Pix 12
2.3. Attention 조작 기반 이미지 편집 기법 12
2.3.1 Pixel-level Attention Control: PixelMan 12
2.3.2 Dual Attention Control 13
2.4. 이미지 편집 모델 성능 평가 지표 14
2.4.1 PSNR(Peak Singal-to-Noise Ratio) 14
2.4.1 SSIM(Structural Similarity Index Measure) 14
2.4.2 LPIPS(Learned Perceptual Image Patch Similarity) 15
2.4.3 CLIP Score(Contrastive Language-Image Pre-Training) 15
3. 제안 방법 16
3.1. 주요 구성 요소 17
3.1.1 Source Prompt 생성 17
3.1.2 Mask 생성 및 공간 재구성 17
3.2. Zero-Training 편집 전략 18
3.2.1 PixelMan 기반 노이즈 혼합 18
3.2.2 Dual Attention Control 19
3.3 이미지 생성 및 최적화 19
3.3.1 Negative Prompt 를 통한 불필요 정보 억제 19
3.3.2 마스크 기반 지역 편집 19
3.4 성능평가 지표 20
4. 실험 결과 및 분석 21
4.1. 실험 환경 21
4.2. 실험 결과 21
4.2.1 Deniosing Scheduler 및 Step 값 최적화 21
4.2.2 객체 속성 및 객체 교체 종합 편집 성능 평가 26
4.2.3 Negative Prompt 정량적 영향 분석 28
4.2.4 다른 모델과의 비교 평가 31
5. 결 론 35
5.1. 연구 결과 요약 35
5.2. 연구의 한계 및 향후 연구 방향 36
참고문헌 38
ABSTRACT 42
부록 43

상세검색

RISS 보유자료

상세검색

해외전자자료

MMDiT 구조와 Attention control 을 이용한 이미지 편집 모델을 위한 Zero-Training 기법 = Zero-Training for Image Editing Model based on MMDiT Architecture and Attention Control

부가정보

분석정보

이 자료와 함께 이용한 RISS 자료

나만을 위한 추천자료