• MDCT-Unet: A dual-encoder network combining multi-scale dilated convolutions with Transformer for medical image segmentation Author links open overlay panel

    Zhengshi Chen, Juanjuan He, Xiaoming Liu, Jinshan Tang

    Abstract

    Precise medical image segmentation is crucial in clinical diagnosis and pathological analysis. Most segmentation methods are based on U-shaped convolutional neural networks (U-Net). Although U-Net performs well in medical image segmentation, as a method based on CNN, its main drawback lies in the difficulty of establishing long-range pixels dependencies and has a constrained receptive field, which restricts segmentation accuracy. Many models address this issue by incorporating Transformer models into U-Net architectures to better capture long-range dependencies. However, these methods often suffer from simple feature fusion techniques and limited receptive fields for local features. To address these challenges, we propose a dual-encoder framework, named MDCT-Unet, which combines Swin-Transformer and CNN for enhanced medical image segmentation. This framework introduces a novel dynamic feature fusion module to better integrate of local and global features. By combining channel and spatial attention mechanisms and inducing competition between them, we enhance the coupling of these two types of features, ensuring richer information representation. In addition, to better extract multi-scale local features from medical images, we design a dilated convolution encoder (DCE) as the CNN branch of our model. By incorporating dilated convolutions with varying receptive fields, DCE captures rich local features at multiple scales, thereby enhancing the model�s ability to segment challenging regions such as boundaries and small organs. We conducted extensive experiments on four datasets: Synapse, ISIC2018, CHASEDB1, and MMWHS. The experimental results show that our method outperforms most current medical image segmentation methods quantitatively and qualitatively.