• Laser & Optoelectronics Progress
  • Vol. 62, Issue 2, 0200001 (2025)
Yixiao Yin*, Jingang Ma, Wenkai Zhang, and Liang Jiang
Author Affiliations
  • School of Medical Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan 250355, Shandong China
  • show less
    DOI: 10.3788/LOP240875 Cite this Article Set citation alerts
    Yixiao Yin, Jingang Ma, Wenkai Zhang, Liang Jiang. From U-Net to Transformer: Progress in the Application of Hybrid Models in Medical Image Segmentation[J]. Laser & Optoelectronics Progress, 2025, 62(2): 0200001 Copy Citation Text show less
    Different types of medical images
    Fig. 1. Different types of medical images
    U-Net structure diagram[12]
    Fig. 2. U-Net structure diagram[12]
    Transformer structure diagram[55]
    Fig. 3. Transformer structure diagram[55]
    Conceptual structure diagram of single-scale serial combination
    Fig. 4. Conceptual structure diagram of single-scale serial combination
    TransUNet structure diagram[62]
    Fig. 5. TransUNet structure diagram[62]
    Conceptual structure diagram of multi-scale serial combination
    Fig. 6. Conceptual structure diagram of multi-scale serial combination
    CoTr structure diagram[70]
    Fig. 7. CoTr structure diagram[70]
    Conceptual structure diagram of alternating serial combination
    Fig. 8. Conceptual structure diagram of alternating serial combination
    UTNet structure diagram[79]
    Fig. 9. UTNet structure diagram[79]
    Conceptual structure diagram of overall parallel combination
    Fig. 10. Conceptual structure diagram of overall parallel combination
    DuPNet structure diagram[84]
    Fig. 11. DuPNet structure diagram[84]
    Conceptual structure diagram of hierarchical parallel combination
    Fig. 12. Conceptual structure diagram of hierarchical parallel combination
    UconvTrans structure diagram[88]
    Fig. 13. UconvTrans structure diagram[88]
    Imaging typeImaging mechanismImaging objectiveAdvantageDisadvantageCommon datasets
    X-rayPenetration of electromagnetic radiation into body tissuesBones, lungs, heartFast, easy to operate, low cost, good contrast of negativesRadiation, limited ability to image soft tissue, only 2D images availableCOVID-1933
    CTRotational X-ray fabrication of in vivo cross-sectional imagesAny part of the bodyHigh-resolution images with no overlap between tissue structure and lesion imagesRadiating, prone to artifacts, average resolution for soft tissueLiTS34
    MRIStrong magnetic fields and radio wavesSoft tissues, brain, jointsNon-invasive, no radiation risk, high resolution to soft tissue, ability to cut layers directly in any directionTime-consuming, costly, tight inspection space and limited by metal implantsBraTS35
    USHigh-frequency sound wavesSoft tissues, blood flow, organsNon-invasive, non-radiation-risked, real-time imagingPoor imaging of gases and bony structures, high dependence on manual laborEchoNet-Dynamic36
    Digital pathologyHigh-resolution scans under an optical microscopeTissues, cellsHigh-resolution images, digital storageRequires advance sampling, which is costly and time-consumingSEED37
    Table 1. Comparison of five imaging methods
    Evaluation indexMathematical description
    Dice2ABA+B
    IoUABAB
    ACCRTP+RTNRTP+RTN+RFP+RFN
    SERTPRTP+RFN
    Table 2. Segmentation performance evaluation indexes and mathematical description
    Combination strategyTypologyMain ideaAdvantageDisadvantageModel complexity
    SerialSingle-scale serialCNN and Transformer serially combined at a single scaleEasy to implement and understandA loss of information between the CNN and the TransformerLow
    Multi-scale serialCNN serially combine with Transformers at different scalesAvailability of multi-scale informationIncreases complexity of managing different scales and detailed information may be lost when lowering resolutionMiddle
    Alternating serialAlternate between CNN and Transformer throughout the networkMore balanced utilization of CNN and Transformer capabilities to improve information flow at each stageNeed to manage the interaction layer, which increases complexity and can cause information bottlenecks or redundancy if not carefully designedHigh
    ParallelismGlobal parallelismCNN and Transformer process the image in parallel and finally fuse the featuresTaking full advantage of both networks, spatial and sequence information can be efficiently capturedNeed to design effective feature fusion mechanisms with increased computationHigh
    Hierarchical parallelismCNN and Transformer are processed in parallel at multiple layers of the model, with feature fusion at each layerEnhanced model generalization using two networks at each level to fully learn featuresNeed to design an effective feature fusion mechanism, complex structure and difficult to tuneHigh
    Table 3. Summary of different combination strategies
    MethodModelIdeological improvementAdvantageDisadvantage
    U-NetUNet++15A dense jump connection strategy is used to introduce a deep supervision mechanism to motivate the network to better learn and utilize hierarchical featuresDeeper and multi-scale feature extraction and fusion are realized to improve segmentation performanceLarge number of model parameters and high computational complexity
    Attention U-Net46AttentionGate attention mechanism is used to eliminate noise and irrelevant information of jump connections and extract key featuresIt effectively captures important local and global information in the image and improves the accuracy and performance of image segmentationModel segmentation performance is improved to a lesser extent
    TransformerSwin-Unet58Use Swin Transformer module instead of U-Net's 2D convolutional blockMulti-scale feature fusion, parameters and computational efficiency are optimized for applicabilityLack of extraction of local feature information
    C2Former59Designing cross-convolutional self-attention mechanism algorithms to improve semantic feature understandingIntegration of multi-scale feature information and filtering of interfering informationLoss of feature information
    U-Net+TransformerSingle-scale serialTransUNet62Local features are extracted using a CNN and the Transformer module is strung after the CNN to extract global contextual informationPerceiving global information for direct processing of sequential data, adaptation to multi-scale tasksHigh computational complexity
    MultiIB-TransUNet67A single Transformer layer is used to extract global features to reduce the number of parameters, and multiple IB blocks are utilized to compress noise and improve robustnessReduced amount of model parametersCompressing the relevant features, the segmentation accuracy is relatively reduced
    Multi-scale serialCoTr70Transformer in the encoder receives the feature maps at multiple scales in the CNN and introduces a deformable self-attention mechanism that focuses on some of the key sampling pointsComputational and spatial complexity is reduced; multi-scale processing of 3D feature maps is realizedInsufficient generalizability
    HTUNet732D U-Net design MSCAT module is used to jump connections and extract intra-frame features, and the 3D Transformer UNet extracts inter-frame featuresIntra- and inter-frame feature fusion for improved segmentation performanceFailure to delineate the nodal region from the gland is computationally expensive
    Alternating serialHCTNet76TEBlocks module is designed to learn global context information and combine it with CNN blocks to extract featuresCombining CNN inductive bias for spatial association modeling and Transformer capabilities for remote dependency modelingBoundary detail segmentation is not effective
    U-Net+TransformerAlternating serialUTNet79Replace the last convolution in each resolution of U-Net with the Transformer moduleCombines convolutional and self-attentive mechanisms that are easy to understand and usePoor segmentation performance in large-scale training tasks
    Global parallelismTransfuse82CNN and Transformer extract features in parallel, and BiFusion module is designed to fuse features from both branchesCapturing global information while maintaining sensitivity to low-level context, with strong fusion representationTransformer layer is less efficient and insufficiently generalized
    HSNet86Encoder and decoder are connected through an interactive attention mechanism, and the decoder has a two-branch structureDiscriminative remote dependencies can be generated, detail features can be recovered, and generalization ability is highHigh model complexity
    Hierarchical parallelismUconvTrans88Each level uses a two-branch structure and a feature fusion module is designed to pass the fused features to the next levelNumber of model parameters and the amount of computation are less, which better balances the segmentation accuracy and efficiencyNot applicable to 3D medical image segmentation where slice information is richer
    RMTF-Net93Encoder combines Mix Transfomer and RCNN structures, and the feature decoder is designed with a GFI module to re-fuse the feature information extracted by the encoderNetwork boundary coding is more capable and a local global balanced feature is available at each level of the encoderNot extended to 3D image segmentation
    Table 4. Idea and comparison of advantages and disadvantages of partially improved models
    MethodModelSplit taskImage typeDice /%IoU /%ACC /%
    U-NetUNet++15Nucleus/colon polyp/liver/pulmonary nodule segmentationMicroscopy/RGB video/CT/CT-92.52/32.10/82.90/77.21-
    UNet 3+43Liver/spleen segmentationCT/CT96.75/96.20--

    Attention

    U-Net46

    Pancreatic segmentationCT81.48--
    R2U-Net48Retinal vascular/skin lesion/pulmonary nodule segmentationFundus image/dermoscopy/CT--/94.21/99.1897.12/94.24/99.18
    TransformerSwin-Unet58Abdominal multi-organ/heart segmentationCT/MRI79.13/90.00--
    C2Former59Abdominal multi-organ/heart /skin cancer segmentationCT/MRI/dermoscopy83.22/91.42/86.78--
    MISSFormer60Abdominal multi-organ/heart segmentationCT/MRI81.96/90.86--
    DS-TransUNet61Polyp/dermatological/glandular/nuclear segmentationEndoscopy/dermatoscopy/Pathology/microscopy93.5/-/87.19/-88.9/85.23/78.45/86.12-
    Table 5. Comparison of segmentation methods based on U-Net and Transformer
    Combination strategyTypologyModelSplit taskImage typeDice /%IoU /%ACC /%
    Serial connectionSingle-scale serialTransUNet62Multi-organ/cardiac segmentationCT/MRI

    77.48/

    89.71

    --
    TransUNet+63Multi-organ/glandular/heart segmentationCT/filmstrip/MRI

    81.57/

    90.42/ 90.47

    -/82.69/--
    PKRT-Net 64Videocups and videodiscs segmentationFundus image

    91.20/

    97.66

    --
    TU-Net65Vascular segmentationUS-92.00/85.00/67.00-
    GL-Segnet66Rectal adenocarcinoma cell/skin lesion/glioma/thoracic organ segmentationPathology/dermoscopy/CT/X-ray

    93.10/

    91.50/

    93.10/

    96.90

    87.30/85.80/87.70/94.20-
    MultiIB-TransUNet67Mammary gland segmentationUS/CT-/81.8367.75/--
    UNETR68Abdominal multi-organ/brain tumor/spleen segmentationCT/MRI/CT

    89.10/

    71.10/

    96.40

    --
    UMSTC69Microscope imageMicroscopy82.5076.50-
    Multi-scale serialCoTr70Cranial vault multi-organ segmentationCT85.00--
    Multi-compound Transformer71Cell/colon/skin lesion segmentationMicroscopy/endoscopy/dermoscopy

    68.40/

    92.30/

    90.35

    --
    TFNet72Mammary gland segmentationUS87.9078.40-
    HTUNet73Thyroid segmentationUS98.5997.26-
    TransHRNet74Abdominal multi-organ/brain tumor/spleen segmentationCT/MRI/CT

    86.70/

    72.50/

    97.40

    --
    Swin UNETR75Brain tumor segmentationMRI91.30--
    Alternating serialHCTNet76Mammary gland segmentationUS97.2394.6397.41
    Feature integration network77Abdominal multi-organ/prostate segmentationCT/MRI

    92.45/

    81.63

    --
    SWTRU78Liver and tumor/skin lesion/glioma segmentationCT/Dermoscopy/MRI

    97.20/

    90.40/

    89.70

    94.90/-/--
    UTNet79Heart segmentationMRI88.30--
    nnFormer80Brain tumor/abdominal multi-organ/ heart segmentationMRI/CT/MRI

    86.40/

    86.57/

    92.06

    --
    SwinBTS81Brain tumor segmentationMRI81.1581.10-
    Parallel connectionGlobal parallelismTransfuse82Polyp/skin lesion/hip/prostate segmentationEndoscopy/Dermoscopy/X-ray/MRI

    94.20/

    87.20/-/-

    89.70/-/-/--/94.40/-/-
    Parallel connectionGlobal parallelismTransFusionNet83Liver tumor/vascular segmentationCT/CT

    96.10/

    90.10

    92.70/85.40-
    DuPNet84Rectal cancer segmentationMRI98.2289.34-
    CTC -Net85Multi-organ/heart segmentationCT/MRI

    78.41/

    90.77

    --
    HSNet86Polyp segmentationEndoscopy92.6087.70-
    TDD-UNet87COVID-19 pneumonia segmentationCT、X-ray78.94-96.34
    Hierarchical parallelismUconvTrans88Heart segmentationMRI89.60--
    Swin Unet3D89Brain tumor segmentationMRI90.50--
    P-TransUNet90Polyp/nucleus/glandular segmentationEndoscopy/Microscopy/Pathology

    93.52/

    93.63/

    95.93

    88.93/88.75/91.42-
    PCT91Parotid tumor segmentationUS91.5184.34-
    TransConver92Brain tumor segmentationMRI86.32--
    RMTF-Net93Brain tumor segmentationMRI93.5088.20-
    Table 6. Comparison of segmentation methods based on U-Net+Transformer
    Yixiao Yin, Jingang Ma, Wenkai Zhang, Liang Jiang. From U-Net to Transformer: Progress in the Application of Hybrid Models in Medical Image Segmentation[J]. Laser & Optoelectronics Progress, 2025, 62(2): 0200001
    Download Citation