Salient object detection (SOD) aims to identify and segment visually prominent objects in images, serving as a fundamental task in computer
vision applications. Despite significant advances in RGB SOD, current approaches still face key limitations, in...
Salient object detection (SOD) aims to identify and segment visually prominent objects in images, serving as a fundamental task in computer
vision applications. Despite significant advances in RGB SOD, current approaches still face key limitations, including inadequate contextual
refinement, which limits global scene understanding. Moreover, many methods overlook intermediate features that connect spatial and
semantic information and struggle to handle multi-scale features efficiently, thereby increasing computational complexity. This thesis
addresses these challenges through a unified Trapezoidal Attention Network (TRSNet) that integrates hierarchical representation learning
with adaptive attentions to establish superior performance-efficiency trade-offs under computational constraints. The proposed architecture employs Pyramid Vision Transformer v2 (PVTv2) as the backbone network to generate representations at four distinct levels. To understand each visual within the broader scene context, we introduce a Context-aware Feature Refinement Block (CFRB). This contextual learning
enables networks to distinguish genuinely salient objects from visually similar but contextually irrelevant regions. Recognizing that different
feature levels require specialized processing, we propose a Trapezoidal Attention Mechanism (TAM) that allocates differentiated attention
operations across the feature hierarchy. TAM is established through three specialized attention modules: (i) Enhanced Spatial Coordinate
Attention (ESCA) for low-level features, which preserves fine-grained positional information through directional average pooling along height and width dimensions; (ii) Learned Compact Channel Gate (LCCG) for high-level semantic features, which employs adaptive convolutions
with learned kernels for local cross-channel interaction modeling instead of fixed averaging; (iii) Dual-Path Multi-Head Attention (DPMHA) for
intermediate features, which addresses their often-overlooked status by combining efficient spatial reduction strategies with inter-head
interaction modeling. The decoder architecture employs a progressive fusion strategy that combines attention-refined features from all levels through four sequential stages. Extensive ablation studies and impact analysis validate the contribution of each component. A comprehensive experimental evaluation across six benchmark datasets demonstrates that the proposed architecture achieves superior performance
compared to state-of-the-art methods while incurring lower computational cost.