Zhiheng Cheng, Qingyue Wei, Hongru Zhu, Yan Wang, Liangqiong Qu, Wei Shao, Yuyin Zhou
The Segment Anything Model (SAM) has garnered sig- nificant attention for its versatile segmentation abilities and intuitive prompt-based interface. However, its application in medical imaging presents challenges, requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient fine-tuning of med- ical images via a two-stage hierarchical decoding proce- dure. In the initial stage, H-SAM employs SAM’s original decoder to generate a prior probabilistic mask, guiding a more intricate decoding process in the second stage. Specif- ically, we propose two key designs: 1) A class-balanced, mask-guided self-attention mechanism addressing the un- balanced label distribution, enhancing image embedding; 2) A learnable mask cross-attention mechanism spatially modulating the interplay among different image regions based on the prior mask. Moreover, the inclusion of a hier- archical pixel decoder in H-SAM enhances its proficiency in capturing fine-grained and localized details. This approach enables SAM to effectively integrate learned medical priors, facilitating enhanced adaptation for medical image seg- mentation with limited samples. Our H-SAM demonstrates a 4.78% improvement in average Dice compared to existing prompt-free SAM variants for multi-organ segmentation us- ing only 10% of 2D slices. Notably, without using any unla- beled data, H-SAM even outperforms state-of-the-art semi- supervised models relying on extensive unlabeled training data across various medical datasets. Our code is available at https://github.com/Cccccczh404/H-SAM.