Shuai Yuan, Jianjian Yin, Runcheng Li, Yi Chen, Yudong Zhang
The current Transformer-based methods demonstrate outstanding performance in medical image segmentation. However, these methods only focus on extracting contextual information from a single image, ignoring the potential semantic correlation between images, making it difficult for the model to capture a broader range of semantic class information. In this paper, we propose a unified semantic model (UniSEM) to capture sufficient semantic information from both cross image and single image perspectives to enhance the feature representations of pixels. Specifically, we first design the cross image semantic aggregation module to extract the potential class semantic correlations between images, and store the features of the learned semantic correlations in a dynamically updated memory bank during the training process. Furthermore, we propose a single image semantic aggregation module to extract feature information at multiple scales by utilizing multiple windows, and adopt dual-attention to fuse multi-scale feature information to obtain more refined single image semantic information. Finally, we utilize cross image semantic information and single image semantic information to enhance the original pixel feature representations, thereby improving the performance of the model. We perform thorough ablation and comprehensive comparative experiments on the ACDC and BUSI datasets, demonstrating that our method achieves state-of-the-art performance. The source code is publicly accessible at https://github.com/NJNUCS/UniSEM .