Efficient Transformer-Driven Trapezoidal Attention Framework with Contextual Learning for Visual Saliency Detection = 시각적 현저성 감지를 위한 상황적 학습을 기반으로 한 효율적인 변압기 기반 사다리꼴 주의 프레임워크|RISS 상세보기

다국어 초록 (Multilingual Abstract)

Salient object detection (SOD) aims to identify and segment visually prominent objects in images, serving as a fundamental task in computer
vision applications. Despite significant advances in RGB SOD, current approaches still face key limitations, including inadequate contextual
refinement, which limits global scene understanding. Moreover, many methods overlook intermediate features that connect spatial and
semantic information and struggle to handle multi-scale features efficiently, thereby increasing computational complexity. This thesis
addresses these challenges through a unified Trapezoidal Attention Network (TRSNet) that integrates hierarchical representation learning
with adaptive attentions to establish superior performance-efficiency trade-offs under computational constraints. The proposed architecture employs Pyramid Vision Transformer v2 (PVTv2) as the backbone network to generate representations at four distinct levels. To understand each visual within the broader scene context, we introduce a Context-aware Feature Refinement Block (CFRB). This contextual learning
enables networks to distinguish genuinely salient objects from visually similar but contextually irrelevant regions. Recognizing that different
feature levels require specialized processing, we propose a Trapezoidal Attention Mechanism (TAM) that allocates differentiated attention
operations across the feature hierarchy. TAM is established through three specialized attention modules: (i) Enhanced Spatial Coordinate
Attention (ESCA) for low-level features, which preserves fine-grained positional information through directional average pooling along height and width dimensions; (ii) Learned Compact Channel Gate (LCCG) for high-level semantic features, which employs adaptive convolutions
with learned kernels for local cross-channel interaction modeling instead of fixed averaging; (iii) Dual-Path Multi-Head Attention (DPMHA) for
intermediate features, which addresses their often-overlooked status by combining efficient spatial reduction strategies with inter-head
interaction modeling. The decoder architecture employs a progressive fusion strategy that combines attention-refined features from all levels through four sequential stages. Extensive ablation studies and impact analysis validate the contribution of each component. A comprehensive experimental evaluation across six benchmark datasets demonstrates that the proposed architecture achieves superior performance
compared to state-of-the-art methods while incurring lower computational cost.

번역하기

국문 초록 (Abstract)

현저한 객체 검출(SOD)은 이미지에서 시각적으로 두드러진 객체를 식별하고 분할하는 것을 목표로 하며, 컴퓨터 비전 응용 분야의 기본 과제로 사용된다. RGB SOD의 상당한 발전에도 불구하고, 현재 접근법은 전역 장면 이해를 제한하는 부적절한 맥락적 개선을 포함한 주요 한계에 직면해 있다. 또한, 많은 방법들이 공간 정보와 의미 정보를 연결하는 중간 특징을 간과하고 다중 스케일 특징을 효율적으로 처리하는 데 어려움을 겪어 계산 복잡도가 증가한다. 본 논문은 계산 제약 하에서 우수한 성능-효율성 균형을 달성하기 위해 계층적 표현 학습과 적응적 어텐션을 통합한 통합 사다리꼴 어텐션 네트워크(TRSNet)를 통해 이러한 과제를 해결한다. 제안된 아키텍처는 네 가지 수준에서 표현을 생성하기 위해 Pyramid Vision Transformer v2(PVTv2)를 백본 네트워크로 사용한다. 각 시각적 요소를 더 넓은 장면 맥락 내에서 이해하기 위해, 맥락 인식 특징 개선 블록(CFRB)을 도입한다. 이 문맥 학습을 통해 네트워크는 시각적으로 유사하지만 문맥적으로 무관한 영역과 관계가 있는 객체를 구별할 수 있습니다. 서로 다른 특징 수준이 전문화된 처리를 필요로 함을 인식하여, 특징 계층 전반에 걸쳐 차별화된 어텐션 작업을 할당하는 사다리꼴 어텐션 메커니즘(TAM)을 제안한다. TAM은 세 가지 전문화된 어텐션 모듈을 통해 구성된다: (i) 저수준 특징을 위한 향상된 공간 좌표 어텐션(ESCA)으로, 높이와 너비 차원을 따라 방향성 평균 풀링을 통해 세밀한 위치 정보를 보존한다; (ii) 고수준 의미 특징을 위한 학습된 컴팩트 채널 게이트(LCCG)로, 고정 평균화 대신 학습된 커널을 가진 적응형 합성곱을 사용하여 로컬 교차 채널 상호 작용 모델링을 수행한다; (iii) 중간 특징을 위한 이중 경로 다중 헤드 어텐션(DPMHA)으로, 효율적인 공간 축소 전략과 헤드 간 상호 작용 모델링을 결합하여 종종 간과되는 중간 특징의 상태를 해결한다. 디코더 아키텍처는 네 개의 순차적 단계를 통해 모든 수준의 어텐션 개선 특징을 결합하는 점진적 융합 전략을 사용한다. 광범위한 절제 연구 및 영향 분석을 통해 각 구성 요소의 기여도를 검증한다. 6개의 벤치마크 데이터셋에 대한 포괄적인 실험 평가는 제안된 아키텍처가 더 낮은 계산 비용으로 최첨단 방법과 비교하여 우수한 성능을 달성함을 보여준다.

번역하기

현저한 객체 검출(SOD)은 이미지에서 시각적으로 두드러진 객체를 식별하고 분할하는 것을 목표로 하며, 컴퓨터 비전 응용 분야의 기본 과제로 사용된다. RGB SOD의 상당한 발전에도 불구하고, ...

목차 (Table of Contents)