U-Net | UNet++[15] | A dense jump connection strategy is used to introduce a deep supervision mechanism to motivate the network to better learn and utilize hierarchical features | Deeper and multi-scale feature extraction and fusion are realized to improve segmentation performance | Large number of model parameters and high computational complexity |
Attention U-Net[46] | AttentionGate attention mechanism is used to eliminate noise and irrelevant information of jump connections and extract key features | It effectively captures important local and global information in the image and improves the accuracy and performance of image segmentation | Model segmentation performance is improved to a lesser extent |
Transformer | Swin-Unet[58] | Use Swin Transformer module instead of U-Net's 2D convolutional block | Multi-scale feature fusion, parameters and computational efficiency are optimized for applicability | Lack of extraction of local feature information |
C2Former[59] | Designing cross-convolutional self-attention mechanism algorithms to improve semantic feature understanding | Integration of multi-scale feature information and filtering of interfering information | Loss of feature information |
U-Net+Transformer | Single-scale serial | TransUNet[62] | Local features are extracted using a CNN and the Transformer module is strung after the CNN to extract global contextual information | Perceiving global information for direct processing of sequential data, adaptation to multi-scale tasks | High computational complexity |
MultiIB-TransUNet[67] | A single Transformer layer is used to extract global features to reduce the number of parameters, and multiple IB blocks are utilized to compress noise and improve robustness | Reduced amount of model parameters | Compressing the relevant features, the segmentation accuracy is relatively reduced |
Multi-scale serial | CoTr[70] | Transformer in the encoder receives the feature maps at multiple scales in the CNN and introduces a deformable self-attention mechanism that focuses on some of the key sampling points | Computational and spatial complexity is reduced; multi-scale processing of 3D feature maps is realized | Insufficient generalizability |
HTUNet[73] | 2D U-Net design MSCAT module is used to jump connections and extract intra-frame features, and the 3D Transformer UNet extracts inter-frame features | Intra- and inter-frame feature fusion for improved segmentation performance | Failure to delineate the nodal region from the gland is computationally expensive |
Alternating serial | HCTNet[76] | TEBlocks module is designed to learn global context information and combine it with CNN blocks to extract features | Combining CNN inductive bias for spatial association modeling and Transformer capabilities for remote dependency modeling | Boundary detail segmentation is not effective |
U-Net+Transformer | Alternating serial | UTNet[79] | Replace the last convolution in each resolution of U-Net with the Transformer module | Combines convolutional and self-attentive mechanisms that are easy to understand and use | Poor segmentation performance in large-scale training tasks |
Global parallelism | Transfuse[82] | CNN and Transformer extract features in parallel, and BiFusion module is designed to fuse features from both branches | Capturing global information while maintaining sensitivity to low-level context, with strong fusion representation | Transformer layer is less efficient and insufficiently generalized |
HSNet[86] | Encoder and decoder are connected through an interactive attention mechanism, and the decoder has a two-branch structure | Discriminative remote dependencies can be generated, detail features can be recovered, and generalization ability is high | High model complexity |
Hierarchical parallelism | UconvTrans[88] | Each level uses a two-branch structure and a feature fusion module is designed to pass the fused features to the next level | Number of model parameters and the amount of computation are less, which better balances the segmentation accuracy and efficiency | Not applicable to 3D medical image segmentation where slice information is richer |
RMTF-Net[93] | Encoder combines Mix Transfomer and RCNN structures, and the feature decoder is designed with a GFI module to re-fuse the feature information extracted by the encoder | Network boundary coding is more capable and a local global balanced feature is available at each level of the encoder | Not extended to 3D image segmentation |