In this work, we demonstrate that using Multi-Task Representation Learning to capture both pixel-level details and high-level semantics improves anomaly detection performance.
This study proposes a multi-task learning (MTL)-based approach to address...
In this work, we demonstrate that using Multi-Task Representation Learning to capture both pixel-level details and high-level semantics improves anomaly detection performance.
This study proposes a multi-task learning (MTL)-based approach to address the challenge of applying supervised learning methods in image-based anomaly detection, where abnormal data are scarce. Existing multi-class unsupervised anomaly detection studies, such as ViTAD, share the common characteristic of utilizing a Vision Transformer encoder pretrained on large-scale datasets. However, these methods still exhibit limitations. In particular, ViTAD struggles with the reconstruction of fine-grained local defects and relatively low pixel-level segmentation performance. To overcome these issues, this study designs an MTL framework in which a shared encoder learns the detailed representations of normal data through a combination of reconstruction and classification tasks. The reconstruction task enables the encoder to learn fine-grained structural features at the pixel-level, while the classification task strengthens global semantic discrimination, all owing the encoder to simultaneously learn detailed texture information and high-level representations that more clearly define inter-class boundaries in the latent space. Through this MTL-based representation learning, the proposed model achieves more precise segmentation performance and improved pixel-level AP and F1 scores compared to ViTAD.