- Advanced Photonics
- Vol. 6, Issue 6, 064001 (2024)
Abstract
Keywords
1 Introduction
Modern optical microscopy methods provide researchers with a window into the microscopic world with visual clarity not possible using traditional bright-field microscopy. While bright-field microscopy relies on light absorption by the sample to generate visual contrast, biological specimens often lack sufficient light absorption for clear, analyzable images.1 To overcome this challenge, scientists have traditionally employed various staining techniques and specialized microscopy methods tailored to derive contrast from diverse properties of the sample across different scales, portrayed in Fig. 1. For tissues, researchers use chemical dyes to stain the sample and create contrast.2 Similarly, fluorescent dyes are employed to highlight specific cellular structures.3,4 At the molecular level, fluorophores are commonly utilized to bind to target molecules, enabling researchers to track and observe individual molecules using fluorescence microscopy.5,6
Figure 1.Applications of cross-modality transformations across biological scales. At the largest scales, virtual staining is used to enhance imaging contrast. At intermediate scales, virtual staining is used in conjunction with noise reduction techniques. At the smallest scales, superresolution is used to study systems far beyond the optical diffraction limit. Image created with the assistance of BioRender.
However, modern microscopy techniques also present challenges. Staining tissue samples is a laborious, invasive, and often irreversible process, resulting in varying staining outcomes for different tissues and limiting their reuse for alternative purposes.7 Similarly, imaging cellular and subcellular structures poses challenges, such as costly and time-consuming staining procedures that often limit sample utility.8 Furthermore, at the molecular scale, light microscopy encounters limitations in image resolution. Researchers must use specialized objectives, sophisticated setups, and complex fluorophore mixes and buffers to observe sufficiently small structures, resulting in expensive and intricate optical arrangements, e.g., interferometric scattering microscopy9 and direct stochastic optical reconstruction microscopy (STORM).10,11
Recently, deep learning (DL) has emerged as a potential solution to overcome the aforementioned challenges in microscopy.12 By employing neural networks to perform numerical transformations of images between different optical modalities,13 researchers can capture images using a low-cost modality, such as bright-field microscopy, and convert them to preferred modality, such as fluorescence microscopy, for simplified analysis.14 This process, known as cross-modality transformation, eliminates the need for costly and invasive staining procedures,15 allowing for multiple staining techniques to be produced from the same sample with minimal additional expense. Moreover, since the entire process is numerical, results can be easily replicated by independent teams, ensuring reproducible and reliable outcomes.
Sign up for Advanced Photonics TOC. Get the latest issue of Advanced Photonics delivered right to you!Sign up now
In this review, we demonstrate the utilization of cross-modality transformations across biological scales. We outline common strategies for training neural networks for cross-modality transformations, while addressing the specific challenges and possibilities associated with each biological scale. Finally, we summarize the most successful techniques as rules-of-thumb and provide guidelines for the development and utilization of cross-modality transformations.
2 Introduction to DL for Multimodal Transformations
DL is a subset of machine learning that uses artificial neural networks to perform specific tasks. Neural networks are complex computational models processing input data to generate output results. The performance of a neural network is determined by its parameters, commonly referred to as weights, which can range from tens of thousands to hundreds of millions, depending on the application. Primarily, the objective of DL is to optimize these weights by a process called training, enabling the neural network to yield desired outcomes for a given input space.
In cross-modality transformations, the neural networks used are frequently trained using supervised learning.16 During this process, the network is presented with an image captured from one modality (e.g., bright-field) and trained to generate the corresponding image in another modality (e.g., fluorescence). Typically, these training data are obtained using either a dual-modality microscope,17 where the sample is imaged using both modalities, or alternatively, the sample can be imaged twice—before and after a specific treatment (e.g., staining)—with subsequent image processing to align the two views.18 In certain cases, it may be feasible to derive the transformation analytically, necessitating the imaging of the sample using only one modality to train the network to reconstruct the original image. Even if the physical staining or alternative imaging process is conducted at least once to establish a training pool, subsequent experiments benefit from their simplification.
Cross-modality transformation involves translating images from one modality to another, typically employing encoder–decoder-style fully convolutional neural networks (CNNs). A CNN is a regularized network, meaning it adds information unidirectionally and learns features by itself via kernel optimization. These networks utilize convolutional operations to process input images and are commonly exemplified by architectures, such as U-Net, ResNet, and InceptionNet.
An essential aspect in training neural networks for cross-modality transformation is choice of the loss function, a metric minimized during training. Traditionally, minimizing the mean absolute error () or mean squared error () distance between the predicted and ground truth images is common.19 However, this approach often yields low-resolution and nonphysical results. To address this, an auxiliary adversarial loss function is frequently incorporated.20 This involves training a discriminator neural network alongside the main generator network, where the discriminator distinguishes between generated and ground truth images. The generator is trained to deceive the discriminator by producing physically reasonable results. Such networks are referred to as generative adversarial networks (GANs).
Alternatively, diffusion models represent a recent approach generative modeling. These models utilize probabilistic generative techniques in a two-step process: forward diffusion, where noise is iteratively added to images until they become pure Gaussian noise; and reverse diffusion, where images are iteratively denoised using a neural network. By conditioning the reverse diffusion process, diffusion models effectively handle image-to-image transformation tasks.21 While diffusion models can produce higher-quality images compared to GANs, they come with significantly higher computational costs.
3 Tissue Imaging/Histology
Histological staining is a cornerstone of clinical pathology and research, playing a pivotal role in unraveling tissue details at the microscopic level. It enables the visualization of structures crucial for medical diagnosis, scientific study, autopsy, and forensic investigation.22 In recent years, DL advances have revolutionized tissue imaging and histology analysis, offering innovative solutions to overcome the limitations of traditional physical staining methods.23,24 In the following sections, we will explore the transformative impact of DL techniques in substituting conventional staining approaches, with the aim of improving the analysis of histological samples.
3.1 Limitations of Chemical Staining
One of the most impactful applications of DL for cross-modality transformation is in histology, where visual tissue analysis often faces challenges associated with traditional physical staining. Tissues, as the largest biological structures routinely observed through optical microscopy, require staining protocols to create visual contrast between features. However, these protocols often rely on chemical dyes that can be hazardous and may adversely affect the samples, especially during critical steps when sample structures are vulnerable.25
The histological process usually comprises several steps, including fixation, embedding, sectioning, staining, and mounting, although the specific steps may vary according to the staining technique and the target tissue. The first step is fixation, where the full tissue sample is preserved using chemicals, such as formaldehyde or glutaraldehyde. Fixation prevents decay and maintains the structural integrity by cross-linking the proteins in the sample. However, the tissue’s original chemistry is altered. An alternative approach is to freeze the sample, often using liquid nitrogen, which can preserve the natural state of proteins and lipids without chemicals. The next step is to dehydrate the tissue sample through a series of diluted alcohol solutions and to clear it, using different clearing agents that dissolve remnant lipids and simultaneously homogenize the refractive index. This process renders the tissue transparent as a consequence of even light scattering across the sample,28 and prepares it for infiltration, but can cause tissue shrinkage and morphological alterations. The sample can now be completely encased in an embedding medium, such as paraffin wax, and left to solidify. Once hardened, the sample is cut using a microtome into very thin slices, around 4 to thick, in a step known as sectioning. This process requires high skill and precision, as incorrect microtome alignment or use can lead to tearing or crushing of the tissue and obscure important details. Finally, the slices are placed on microscopic glass slides for observation. Handling must avoid stretching or folding, which can distort the samples and hinder analysis. Once mounted, the samples are ready for the next step: the staining. During staining, various stains are applied based on the cellular structures of interest. Among histological stains, hematoxylin and eosin (H&E) are among the most widely used, with hematoxylin staining cell nuclei purple and eosin coloring the cytoplasm and extracellular matrix pink. Other stains target different structures, such as Picrosirius Red (PSR) for collagen fibers or Alcian Blue for acidic mucins and cartilage. An important technique within this process is immunohistochemistry (IHC), which detects specific proteins in the sample using antibodies. This method requires antigen retrieval to unmask target proteins after fixation, alongside a blocking step to reduce nonspecific antibody binding. The staining process is sensitive to factors such as concentration and timing, and if applied unevenly, it can obscure details. Other limitations of IHC regard unspecific binding or cross-reactions of the added antibodies, which can negatively affect the outcome. Finally, the sample is prepared for detailed observation, and the histological features can be identified via microscopy imaging.
Traditionally a qualitative method, chemical staining can also yield quantitative data through image analysis,29 such as measuring staining intensity, to estimate the presence of biomolecules. However, identifying biological features often requires additional context and expert analysis, and variability in staining intensity, reagent quality, and human interpretation can affect results. Standardized protocols and software tools can mitigate some of this variability, though issues persist compared to virtual staining and inherent contrast techniques.
As an alternative to chemical staining, inherent contrast techniques—such as phase contrast, differential interference contrast (DIC),30,31 and quantitative phase imaging (QPI)—provide an alternative to chemical staining by exploiting refractive index variations in biological tissues to enhance contrast.32 These methods require minimal additional equipment but lack the specificity of chemical dyes and are prone to optical artifacts, such as halo effects.33 To improve specificity, they are often combined with other techniques. The choice between virtual staining and inherent contrast methods depends on factors, such as cost-effectiveness and imaging quality.
3.2 DL for Tissue Imaging
DL models can be trained to virtually stain samples, whether they are unstained34 or have been stained using a different method.35 By bypassing the physical staining process, multiple readouts equivalent to different dyes can be obtained from the same image. This not only maximizes information output36 for analysis and diagnostics37 but also simplifies the experimental setup requirements, as shown in Fig. 2.
Figure 2.Contrast between physical and virtual approaches to obtain a stained image. In the physical approach, the sample undergoes a series of complex procedures, including preparation, staining, and imaging. Tissue preparation may involve fixing, embedding, and sectioning, among other steps. Similarly, histological staining of an unstained sample requires permeabilization, chemical dye application, washing, counterstaining, and protocol optimization before imaging. In contrast, virtual staining offers a simplified alternative to these protocols, eliminating the need for physical processing
In histology, DL models are trained to virtually stain tissue using collections of stain/unstained image pairs as, a reference,36,37,39
Figure 3.Representative applications of cross-modality transformations for tissue imaging using DL. (a) Virtual staining of an unlabeled sample image to obtain the equivalent H&E stained image. Adapted from Rana et al.
Therefore, virtual and histological stainings are not mutually exclusive and can be complementary. While virtual staining offers convenience and reproducibility, it still relies on histological staining to provide the ground truth for generating large training data sets. The main trends are summed up in Table 1 in Sec. 6 guidelines. Some models benefit from existing databases of images of stained tissues for their training, reducing the manual effort required to obtain a sufficient training set.34,61,62 Virtual staining also offers advantages over traditional histology, including the potential for real-time staining of tissue samples63 and three-dimensional (3D) reconstructions of full tissues.63 In the latter case, Wang et al.63 virtually stained light-field microscopy—not to be confused with bright-field microscopy—images of volumetric samples, merging two typically incompatible techniques. Furthermore, certain virtual staining models extend their capabilities by integrating segmentation of the highlighted region of interest.48,64
Virtual staining models must undergo rigorous validation against chemically stained samples to ensure accurate representation of biological features, as variations in color, texture, and detail may occur. A comprehensive comparison between traditional and virtual staining methods is therefore crucial for assessing reliability,69 which requires either manual or automated ground-truth data annotation. Objective metrics, such as index measure (SSIM)51,70
In general, a single dye is insufficient to provide comprehensive information about a particular tissue sample. Instead, various dyes can be applied on different samples of the same original tissue, such as different slices of a single specimen. In the study by Li et al.,40 three distinct dyes—H&E, PSR, and orcein—the last used to demonstrate elastic fibers—were utilized on carotid tissue. Each dye targets different components of artery tissue, aiding in the identification of coronary artery disease and vascular injuries. First, an independent model was trained for each dye to virtually stain an unstained sample. Then, a complete model was trained to simultaneously produce all three modalities from the original sample. This was accomplished by applying one of the three corresponding staining protocols to the originally unstained samples, generating pairs of stained and unstained images for each type. A total of 60 (whole slide images) of each stain, along with their unstained equivalents, were utilized, yielding 1500 to 1800 divided images for training and 150 to 200 images for validation for each staining protocol. Following standard practice, a conditional generative adversarial network (cGAN) was implemented to learn the generation of stained images from the acquired data set. The generator architecture was based on U-Net, while the discriminator comprised a PatchGAN architecture. To accomplish virtual staining according to three distinct protocols, the StarGAN76 architecture was implemented. This framework enables image-to-image translations across multiple domains using only a single model, offering practically unlimited potential for utilizing unstained samples, as they can potentially be transformed into any other protocol with an appropriately trained network.36,77
These studies suggest that DL holds significant potential for histological staining, yet its widespread adoption remains limited. Though DL has emerged as a leading choice for analyzing and interpreting histology images with the potential to enhance medical diagnostics,78 very few algorithms have transitioned to clinical implementation.79 Several challenges persist, notably the need for accurate labeling and addressing variations in slide colors,80
Diffusion models have emerged as a powerful tool in histology for virtual staining, addressing some of the limitations seen in traditional models, such as GANs and encoder–decoder networks. For instance, StainDiff is a diffusion probabilistic model designed to improve the stain-to-stain transformations by overcoming issues, such as mode collapse, where the generator of a GAN produces images based on a limited range of the training samples and posterior mismatching found in other networks.88 These models are also applied to generate virtual IHC images from H&E stained slides, as seen in PST-Diff, which ensures structural and pathological consistency through mechanisms such as asymmetric attention and latent transfer.89 Despite their potential, diffusion models generally require large data sets, making them less feasible for histological applications with limited data. To address this, multitask architectures such as StainDiffuser have been developed to simultaneously generate cell-specific stains and segment cells, optimizing the performance even with constrained data sets.90 In addition, advanced methods, such as virtual IHC multiplex staining, utilize large vision-language diffusion models to generate multiple IHC stains from a single H&E image, addressing tissue preservation challenges often faced in biopsies.91 However, challenges remain. Diffusion models have shown limitations in unpaired image translation tasks, such as slide-free microscopy virtual staining, where the sample preparation process is bypassed along with the staining process. In such cases, they underperform compared to models such as CycleGAN without additional regularization, highlighting the need for further refinement in certain applications.92 Despite these challenges, diffusion models show great promise in virtual staining, with ongoing research focused on enhancing their reliability and applicability in histology.
4 Cellular and Subcellular Structure Imaging
Biologists and clinical laboratories routinely employ optical microscopy to examine cell cultures, enabling the study of cellular and subcellular morphologies and physiology. This examination helps in understanding intercellular communication networks, dynamic cell behaviors, and pathophysiological mechanisms.93 For instance, changes in the morphological characteristics of cellular structures serve as effective indicators of a cell culture’s physiological status and its response under drug exposure.94,95 In the subsequent sections of this review, we delve into the limitations of fluorescence staining techniques, shedding light on the challenges associated with both fixed and live staining methods. In addition, we explore how DL approaches are revolutionizing cellular imaging analysis, offering innovative solutions to overcome these limitations and ushering in a new era of advanced and automated cell culture investigations.96
4.1 Limitations of Fluorescence Staining
Standard cell imaging workflows typically rely on fluorescence microscopy, employing either fixed or live fluorescent staining techniques to highlight specific cell structures. Despite their widespread use, both fixed and live fluorescent staining methods have limitations. These procedures can be invasive and toxic, potentially impacting cell health and behavior.97 In fixed staining, as for tissue, the fixation process itself can introduce artifacts by altering the native state of cellular components. In addition, the use of permeabilization methods compromises cell membrane integrity.98 Furthermore, fixed staining provides only a static view of cellular processes, limiting the ability to study dynamic processes. Conversely, live staining, while theoretically preserving the native state of cells, often alters their biological activity and can be toxic.99 The availability of specific and effective live-staining dyes can also be limiting, restricting the visualization of certain cellular components. In addition, real-time observation of cellular processes in live staining may be challenging due to phototoxicity and photobleaching over extended imaging periods.100 Lastly, the use of multiple fluorophores can lead to spectral cross talk between fluorescence channels, potentially resulting in misleading results and complicating image analysis. These challenges hinder the acquisition of reliable longitudinal data, which is often crucial for studying the effects of drug exposure over time.101
4.2 DL for Cellular and Subcellular Imaging
Recently, research has proposed the use of DL as an alternative to conventional physical staining methods to mitigate inherent problems. These works suggest replacing physical staining and fluorescence microscopy with a neural network that generates virtual fluorescence-stained images from unlabeled samples.
Virtual cell staining has been achieved from various imaging modalities, including phase contrast,55,102 QPI,103 and holographic microscopy.104 Moreover, recent studies have shown that bright-field images, despite their limited detail, contain sufficient information for a CNN to reproduce different types of staining.
For example, Ounkomol et al.105 introduced a CNN-based framework to map the relationship between paired 3D bright-field and fluorescence live-cell images for various key subcellular structures (e.g., DNA, cell membrane, nuclear envelope, and mitochondria). Each cellular component is modeled separately, with a U-Net trained independently for each one. The training process minimizes the mean-squared error between the ground-truth fluorescence image and the predicted image. Once trained, these models can be combined, allowing a single 3D bright-field input to generate multichannel, integrated fluorescence images across multiple subcellular structures. Particularly advantageous, the training data require relatively few paired examples (solely 30 pairs per structure), lowering the machine-learning entry barriers.
The work by Helgadottir et al.54 offers yet another compelling example of the potential of virtual staining of cellular structures from bright-field images. Similar to other approaches, this method relies on a modified version of the U-Net to learn the cross-modality mapping. However, it enhances the reconstruction accuracy by incorporating GAN-based training.
GANs have become a widely adopted framework in virtual cell staining due to their capacity to generate high-quality, realistic images.54,55,102
Figure 4.Virtual cell staining using DL. (a) Helgadottir et al. introduced a cGAN to virtually stain lipid droplets, cytoplasm, and nuclei using bright-field images of human stem-cell-derived fat cells (adipocytes). The U-Net-based generator processes bright-field image stacks captured at various
While GANs have proven effective in enhancing the performance of virtual staining networks for various applications, they rely on co-registered input and ground-truth images. Nevertheless, obtaining perfectly co-registered training pairs is often challenging due to the rapid dynamics of biological processes or the incompatibility of different imaging modalities. To address this limitation, Li et al.55 introduced unsupervised content-preserving transformation for optical microscopy (UTOM). This approach utilizes a CycleGAN to transform images between domains without requiring paired data. Unlike traditional GAN models, CycleGANs employ two generator-discriminator pairs, one for each domain, to learn bidirectional mappings between imaging modalities [Fig. 4(c)]. UTOM has been applied, among other examples, to the virtual staining of phase-contrast images of differentiated human motor neurons, notably delivering competitive performance compared to a CNN architecture trained on paired samples under supervision despite the lack of paired training data [Fig. 4(d)].
Importantly, although the architecture and training of the neural network play a decisive role in the performance of virtual staining models, the input imaging modality must capture sufficient contrast of the different cell structures, providing the network with enough information to learn the transformation to the desired high-contrast, high-specificity fluorescently stained samples.
Recent research has centered on the development of optical systems that capture the rich structural details of cells and embed inductive bias within the network to enhance its performance. For instance, Cheng et al.106 profited from the rich structural information and high sensitivity in reflectance microscopy to boost the performance of virtual staining models. Specifically, the authors employed an LED array reflectance microscope to acquire co-registered label-free reflectance and fluorescence images.107 This platform collects four dark-field reflectance images using half-annulus LED patterns oriented in different directions (top, bottom, left, and right). These measurements derive two dark-field reflectance differential phase contrast (drDPC) images computed along orthogonal orientations [Fig. 4(e)]. Interestingly, the oblique illumination dark-field and drDPC images provide complementary contrast information. While raw dark-field images highlight subcellular structures, such as nuclei, nucleoli, and hyperreflective areas near the nuclear periphery, the drDPC images emphasize cell membranes with clearly defined boundaries. These images serve as multichannel input for the virtual staining model, which, boosted by the enhanced resolution and sensitivity in the backscattering data, provide a reliable prediction of subcellular features [Fig. 4(f)].
In a similar vein, Cooke et al.108 proposed incorporating a physical model of the experimental microscope into the virtual staining model. This approach utilizes a CNN which incorporates a “physical layer” representing the microscope’s illumination model. Consequently, during training, the network learns task-specific LED patterns that significantly enhance its ability to infer fluorescence image information from label-free transmission microscopy images. This work, in particular, further underscores the importance of rich input data and highlights the potential combination of programmable optical elements and physics-informed DL to open new possibilities for exploring the structure and function of cells.
5 Molecular Imaging
One of the most significant advancements in molecular imaging, which involves the optical imaging of single biological molecules at micro- and nanoscales, has been the introduction of fluorescence microscopy techniques.3 However, unlike in tissue and cellular imaging, the lack of viable techniques for studying molecules without fluorescence currently prevents virtual staining in molecular imaging. Rather, the focus of cross-modality transformations in molecular imaging typically revolves around superresolution microscopy aiming to surpass diffraction-imposed limits in imaging molecules.16 Traditionally, achieving such high resolutions requires either expensive microscopy setups with specialized objectives, complex numerical estimations of the imaging process, or specific fluorophores.109 Nevertheless, recent research has indicated that DL-based approaches using generative learning (Sec. 2) can enhance the resolution of images captured with ordinary objectives, comparable to those obtained with costly specialized objectives. Further, DL-driven cross-modality transformations have demonstrated the ability to achieve superresolution across various microscope modalities.16,110
5.1 Physics of Superresolution Microscopy
When light from a point-like light source (an object with diameter ) traverses a lens, it undergoes diffraction, producing a characteristic pattern known as the Airy disk. This pattern comprises a bright central region surrounded by concentric rings of diminishing intensity (). The Airy disk represents the smallest focal point achievable by a light beam. Below the object and Airy disk representations, corresponding intensity plots illustrating the point spread functions (PSFs) are displayed. As Airy patterns reach a point of significant interference, causing a reduction in contrast, they merge, becoming indistinguishable and limiting the spatial resolution. Spatial resolution, the shortest physical distance between two points within an image, stands out as the single most important feature in optical microscopy.113 The primary constraints affecting the achievable spatial resolution stem from an intrinsic phenomenon of diffraction physics. Regardless of lens quality or optical component alignment, a microscope’s resolution ultimately correlates with the wavelength of the detected scattered light and inversely with the numerical aperture (NA) of its objective. This relationship is shown in Fig. 5(a), where light from the sample traverses an objective to the image plane, generating a fundamentally limited diffraction pattern known as a PSF. The PSF inherently limits the minimal distance between two discernible points in the sample, shown in Fig. 5(b). The full width at half-maximum of a PSF in the lateral directions can be approximated as , where represents the light’s wavelength, and NA denotes the numerical aperture of the objective. Thus, for a typical oil immersion objective with , the resulting PSF has a lateral size of 200 nm and an axial size of 500 nm, effectively restricting the resolution to this range for visible-light studies.114 Comparing these scales with those depicted in Fig. 1, it becomes evident that the diffraction limit rarely poses a challenge in most imaging at organ, tissue, or even cellular levels. However, in cellular exploration, where subcellular and molecular structures are of interest, issues regarding diffraction limits become prominent. These issues are exacerbated by the typically dense distribution of molecules and subcellular structures, causing their PSFs to overlap, thus blurring many intricate details together. Hence, the development of superresolution techniques that surpass the diffraction limit becomes imperative for further exploration of these structures using noninvasive optical light. Various microscope techniques have been developed to overcome this limitation, including single-molecule localization microscopy (SMLM) methods, such as STORM,115 photo-activated localization microscopy (PALM),116 and fluorescence photoactivation localization microscopy.117 Other methods of transcending the standard resolutions of microscopes exist, including complex numerical estimations of point spread (transfer) functions seeking to estimate the diffraction behavior, illumination pattern engineering methods reducing the PSF size,118 as well as specialized fluorophores.109 However, these approaches pose their own challenges, including complex and multivariate dependencies on imaging conditions, making solving diffraction integrals of PSFs exceedingly difficult for practically relevant systems,119 as well as increased costs associated with the aforementioned fluorophores.120
Figure 5.Superresolution physical principles. (a) Illustration of the PSF resulting from imaging object of diameter
In recent years, another promising avenue for achieving super-resolution has emerged as a consequence of the astounding growth and success of DL-based computer vision algorithms. Analogous to the cross-modality transforms mentioned above, the DL approach to superresolution involves training neural networks to transform one imaging modality (regular-resolution images) to another (superresolved images). Some of these approaches utilize generative learning, effectively learning the complex interpolation function between regular- and superresolved images, or through direct supervised learning, for example, by estimating the positions of underlying diffraction-limited emitters. The specific techniques for training these networks vary considerably across applications, as elaborated upon further below.
5.2 DL for Superresolution Microscopy
In general, DL for superresolution can be categorized into two approaches, each with two learning paradigms. The first approach aims to enhance resolution by training end-to-end, directly transforming low-resolution images into high-resolution ones. This can be achieved through supervised learning, using pairs of simulated or experimentally measured images from the same sample at different resolutions to train neural networks, or through unsupervised learning, where only low-resolution or high-resolution images are obtained, either experimentally or through simulations. The other approach seeks resolution enhancement by training a network to output the position of each individual molecule (or equivalent scattering object) within an image, and then reconstructing the high-resolution image from these positions, thus transcending the diffraction limits. This approach can also be trained either in a supervised or unsupervised fashion. A summary of the different models and their characteristics can be found in Table 1 in Sec. 6 guidelines. Once such a network is trained, it can swiftly generate high-resolution images without the need for parameter adjustment, yielding an efficient algorithm for improving image resolution within a specific modality.121
5.2.1 End-to-end superresolution mapping
One common approach for supervised end-to-end low- to high-resolution mapping involves pre-upsampling the low-resolution image using a traditional upsampling interpolating algorithm, followed by training a CNN to refine the upsampled image until accurate superresolution is achieved. This approach, initially implemented for single-image superresolution,127 has been used in various biological applications, such as enhancing the resolution of magnetic resonance (MR) images128
Another approach is to apply superresolution to the image after it undergoes computationally intensive CNN layers. This reduces the overall computational burden, as most of the computations are performed on low-resolution images. This approach, known for its efficiency, was first introduced by Dong et al.134 and has also been applied in various biological contexts, including superresolution of X-ray images,135 endoscopy images,136,137 cardiac images,138 and MR images.131
A different strategy for end-to-end low- to high-resolution mapping involves iteratively up- and downsampling the image using downsampling convolutional layers and upsampling transposed convolutional layers. This technique utilized in the “back projection” networks presented by Harris et al.,139 incorporates an error feedback mechanism for projection errors at each iteration stage. Each up and downsampling stage is mutually connected through concatenation, reflecting the mutual dependence of low- and high-resolution image pairs, for which the authors demonstrated yield superior results across multiple data sets, as outlined in Table 1 in Sec. 6 guidelines. This approach has also been applied in various biological applications, such as transformation of CT scan brain images into higher-resolution MRI images140,141 for the detection of multiple sclerosis142 and Alzheimer’s disease,143 as well as cardiac MRI scans,144 and 3D scans.145
Yet another approach involves sequentially upsampling low-resolution images in several separate steps using separate models, as introduced by Lai et al.146 This approach offers two main benefits. First, it allows the user to choose desired resolutions for their high-resolution images without retraining models. Second, it simplifies the learning task for individual networks, since their task is simpler than performing full superresolution in a single feedforward pass. This may potentially improve the performance of the final models.
5.2.2 Specific architectures and methods
Although the generic approaches mentioned above can result in a practically infinite variety of specific architectures, many significant superresolution studies in molecular microscopy have been achieved using a small number of named architectures. Structured feature superresolution microscopy model architecture, shown in panels of Fig. 6 as an example, allows for precise live-cell imaging with high spatial and temporal resolution to continuously monitor subcellular dynamics over extended periods. Among these, ANNA-PALM,60 a U-Net based cGAN trained solely with experimental data, stands out; Deep-STORM,56 based on a CNN encoder–decoder network trained with simulated data; and smNet,57 which directly outputs molecule location, dipole orientation, and wavefront distortion from complex and subtle features of the PSF. In single-molecule superresolution microscopy, there is generally a trade-off between throughput and resolution. To construct a high-quality superresolution image, a large number of molecules need to be localized with high precision, requiring sufficient localizations before sampling a structure of interest.
Figure 6.Superresolution applied architecture. The superresolution network enhances image resolution by training on pairs of simulated low-resolution (LR) and high-resolution ground-truth images or on wide-field (WF) and STORM images from a STORM microscope. First, the LR/WF image undergoes preprocessing through a subpixel edge detector to generate an edge map, both of which serve as inputs to the network. Training is guided by a multi-component loss function that incorporates the combination of multiscale structure similarity index measure and mean absolute error loss (MS-SSIM L1) to capture pixel-level accuracy between the superresolution (SR) and ground-truth/STORM images through multiscale similarity and mean absolute error, perceptual loss to assess feature map differences via the visual geometry group network, adversarial loss using a U-Net discriminator to differentiate ground-truth/STORM images from SR images, and frequency loss to compare differences in the frequency spectrum between SR and ground-truth/STORM images within a specific frequency range using the fast Fourier transform function. This comprehensive loss function helps the superresolution network model achieve precise and perceptually accurate superresolution imaging. Image adapted from Chen et al.
ANNA-PALM accomplishes this by training on a set of blinking single molecules from which a high-quality superresolution image can be experimentally acquired. A subset of these frames is used to generate a (low-quality) “sparse” superresolution image, which, alongside the diffraction-limited image and information about the imaged structure, serves as inputs into ANNA-PALM. The output, or label, is the full superresolution image reconstructed using all frames. Once trained, ANNA-PALM demonstrated the ability to provide high-quality results in imaging mitochondria, the nuclear core complex, and microtubules60 at significantly higher speeds than conventional methods.
ANNA-PALM has proven to be a valuable method for accelerating the acquisition of high-density superresolution images by several orders of magnitude. However, there are many other DL models more appropriate for direct single-molecule localization. An important early model for this is Deep-STORM, developed for the acquisition of superresolution images of microtubules with single or multiple overlapping PSFs. While non-DL algorithms exist for this purpose,148 they typically suffer from high computational costs and require sample-specific parameter tuning. Deep-STORM, an encoder–decoder CNN trained on simulated images, consists of simulated PSFs in various positions in an image on top of experimentally relevant background levels. Said simulated images are thereafter upsampled by a constant factor, constituting a superresolution image. These two versions of the same image are fed as input and output, respectively, to Deep-STORM during training. Such a model has been used to superresolve images of microtubules and quantum dots,56 and inspired works on localizing high-density ultrasound scatterers149 and Crispr-CAS-protein-DNA binding events.150
Another impactful model is the aforementioned smNet, which similarly employs a (ResNet-inspired) CNN trained on simulated images for superresolution. The key distinctions lie in the image recreation of 3D images and the network’s outputs of the 3D coordinates and orientations of PSF-convoluted emitters from which the superresolution image can be reconstructed. smNet has been demonstrated to localize highly astigmatic single-molecule PSFs in experimental images of significantly higher quality compared to conventional Gaussian fitting methods.57 This approach, reminiscent of DeepLoco,151 was developed around the same time and trains NNs to reconstruct simulated emitters in 3D through well-defined mathematical models of astigmatic PSFs. Other related examples of 3D superresolution are that of Zhou et al., who used a dual-GAN framework to directly superresolve images of mouse brains and bodies taken with fluorescence microscopy,152 or Zhang et al.153 and Zelger et al.,154 who used a U-Net-based and CNN-based approach, respectively, for superresolution in SMLM.
For images with higher PSF density, Speiser et al. introduced the method known as DECODE.58 This architecture consists of a stack of two U-Nets, where the first U-Net processes a feature representation of a single frame, and the second U-Net processes feature representations of consecutive frames. The output of this method consists of several channels, each containing information in each pixel of the input image regarding (1) the probability of containing an emitter, (2) its brightness, (3) its 3D coordinates, (4) its background intensity, and (5) epistemic uncertainty of its localization and brightness. In the work of Speiser et al.,58 this architecture is trained on simulated PSFs with a loss function connected to all five aforementioned types of pixel-level information and has been successfully applied to microtubules in conditions of low light exposure and ultrahigh sample densities.
5.2.3 Superresolution by emitter localization
Further, there are DL methods revolving around improving the precision of localizing underlying emitters in images. This is particularly relevant in SMLM imaging, where spatial resolution of the microscope is in practice directly correlated with the localization precision of single molecules.155 Since this localization is normally enabled by conventional heuristic-based fitting algorithms, using DL methods may enhance its performance. BGNet59 is one such architecture, designed to accurately identify the centroid of a PSF. It achieves this by training on (simulated) corrupted PSF images and outputting the background of the image. A trained BGNet can then be used to correct the background of a given image at inference time by subtracting its predicted background. Thus, one obtains background-corrected PSF images, which can be fed into conventional maximum likelihood estimation-fitting algorithms for superresolution, thereby enhancing the overall final output without the need to replace the entire analysis pipeline with an end-to-end DL-based system.
5.2.4 Extracting additional information from PSFs
Further, DL methods aim to extract more information from PSFs themselves.10,147,156
The progress in DL for superresolution has been astounding in the past half-decade, driven mainly by different forms of CNNs trained in GANs through supervised learning. More recently, there are highly promising developments in few-shot,161 single-shot,162 zero-shot learning163 and even untrained neural networks for image superresolution.164 Diffusion models have also shown significant promise in improving the fidelity and robustness of image superresolution methods.165
Recently, they have been used to generate superresolution images of microtubules,169 reconstruct authentic images with unseen low-axial resolutions into high-axial resolution of 3D microscopic data,170 and outperform state-of-the-art in high-fidelity continuous image superresolution.171 Thus, these advancements suggest that the field will continue to progress significantly in the near future.
6 Guidelines
This section provides detailed recommendations for developing cross-modality transformation models in microscopy, with an emphasis on data quality, model architecture selection, and evaluation metrics. Researchers can use this as a framework to navigate the key decisions and challenges associated with their tasks.
6.1 Data Quality, Augmentations, and Data Normalization
The quality of data plays a critical role in determining the performance of DL models. Two major issues commonly affect model quality: insufficient data to capture the variability within the data set or training data that fail to represent the conditions under which the model will be applied.
To detect the issue of insufficient data, a standard approach is to set aside a validation set that the model never sees during training. If the model’s performance on this validation set is significantly worse than on the training set, it likely indicates a lack of sufficient training data to generalize effectively. A common practice is to allocate ∼20% to 30% of the data set as a validation set. However, it is important to ensure that the validation set is maximally decorrelated from the training set to avoid misleading results. For example, it is ideal to sample from different locations in the sample or even from entirely different experimental videos. A poor sampling strategy, such as selecting every fifth frame from the same video, would introduce a high correlation between the training and validation sets, resulting in overly optimistic performance estimates.
If detected, data scarcity can be mitigated by data augmentation techniques to synthetically increase the diversity of training data. Techniques such as geometric transformations (rotation, scaling), noise injection, and intensity variation can simulate a broader range of conditions. However, care must be taken to ensure that these transformations can be meaningfully applied across modalities. For instance, intensity variations in quantitative phase contrast imaging hold physical significance and altering them synthetically could distort biologically relevant information. Geometric translations, in most cases, provide limited benefit, as convolutional models are inherently translation-equivariant. However, if the chosen model breaks translation equivariance (such as vision transformers), they may be useful.
Regularization techniques also play an essential role in improving model robustness, especially when data are scarce or noisy. Methods, such as dropout, weight decay, or regularization, are commonly used to prevent overfitting by penalizing overly complex models that may fit noise in the data rather than in underlying patterns. In scenarios where the model could easily memorize the training data, these techniques ensure that the model learns generalizable features rather than artifacts specific to the training set. Advanced regularization techniques, such as Bayesian regularization, can further improve robustness by incorporating uncertainty into the model’s predictions, making it especially useful in tasks where noisy or variable data are expected.
Transfer learning offers another potential solution to address limited data availability. Pretrained models, especially those trained on large data sets from similar domains, can be fine-tuned to perform specific tasks in microscopy. By leveraging features learned from related tasks, transfer learning reduces the need for extensive training data while still allowing the model to generalize effectively. This approach not only speeds up training but also improves the model’s performance on smaller, domain-specific data sets. In some cases, transfer learning from pretrained models in related fields, such as medical imaging, can be more effective than starting from scratch, especially in scenarios where biological structures share visual characteristics across different imaging modalities.
When the training data are nonrepresentative, this issue can be identified by observing a drop in model performance under real-world conditions, even though the performance on the validation set remains strong. This discrepancy often arises due to variations in optical systems, sample preparation protocols, or environmental factors that differ from those present in the training data. For instance, subtle differences in microscope settings, sample staining techniques, or even temperature can cause a shift in the data distribution, leading to poor generalization when the model is applied in different scenarios.
The primary strategy to address issues of representativeness is through effective data normalization. Normalization techniques aim to reduce variability in the data by standardizing features across data sets, such as intensity scaling, contrast adjustment, or color normalization. This can help minimize discrepancies between data sets generated under different conditions. However, caution must be taken in modalities where quantitative relationships between intensity values are critical. In such cases, aggressive normalization may disrupt important mappings between intensity and biological features, potentially degrading the model’s ability to learn meaningful cross-modality transformations. Furthermore, domain adaptation techniques can be employed to align the distributions of training and application data, improving the robustness of the model across diverse conditions.
6.2 Model Selection
The choice of model architecture can have a significant impact on the performance of the model and depends on several key factors, such as data availability, target task, and specific requirements. Here, we give a general guideline for choosing an appropriate model.
If your data are not aligned, the CycleGAN is recommended. Aligned data refer to cases where each image in one modality has a direct counterpart in the other modality, meaning that both images capture the same sample part of the sample under the same conditions, making it possible to map pixel-to-pixel relationships between the two. When such paired data are unavailable, CycleGAN is suitable because it learns to map between modalities without requiring this strict correspondence. However, the less restrained training procedure also is likely to result in the model learning transformations that are less precise or biologically relevant, especially when precise quantitative relationships between modalities are required. Careful evaluation and additional constraints may be necessary to ensure that the model’s outputs are meaningful and accurate.
Assuming your data are paired and aligned, we recommend starting with a conditional GAN architecture, specifically using a U-Net-like generator and a spatial discriminator. A well-established configuration for this setup is the pix2pix model. Conditional GANs are optimized to generate quantitative, physically meaningful images by leveraging paired data to learn a direct mapping between input and output modalities. The U-Net generator was originally developed for biomedical images and is one of the most proven and widely adopted architectures for tasks involving fine-scale structural details. Spatial discriminators, in turn, evaluate the realism of local regions of the image rather than assessing it as a whole, often resulting in more detailed and accurate outputs.
However, depending on the specific requirements of your task, other architectures may be better suited. For example, if the goal is simply to enhance the contrast of specific substructures without requiring physical realism in the produced images, it may be more practical to forego generative models entirely. In such cases, direct supervised training of a U-Net can offer a simpler and stabler solution. The drawback of using a purely supervised U-Net is that it may lack the ability to generate the nuanced, high-fidelity details that generative models, particularly GANs, are capable of producing. However, for applications where interpretability and stability are more important than photorealism, this trade-off can be worthwhile.
On the other hand, if maximal photorealism is required, diffusion models are worth considering. These models have consistently been shown to produce highly realistic images, often outperforming GANs in terms of image quality and stability. Diffusion models work by iteratively denoising random noise to generate an image, which allows them to better capture fine-grained details and complex textures. However, diffusion models are typically much more computationally expensive, both to train and to evaluate, compared to GANs. Moreover, one should be careful not to conflate photorealism with better quantitative performance on downstream tasks.
Another important consideration is the spatial distribution of information in the image. The U-Net generator is highly effective for analyzing local, position-invariant features, making it ideal for tasks where the meaning of a structure does not depend on its specific location within the image. However, for data where the spatial context is crucial, such as brain scans, attention-based models may be more suitable. Attention mechanisms allow the model to focus on specific regions of the image while considering their global relationships, enabling more context-aware analysis. This makes attention-based architectures a better choice for tasks that require understanding both local features and their larger spatial context.
Finally, for more specialized applications, more complex, hybrid models may be necessary. For tasks where interpretability is a priority, incorporating latent-space constraints can improve both stability and clarity in the results. For example, using a Wasserstein GAN (WGAN) with a carefully designed loss function can provide more control over the training process and generate smoother, more interpretable transformations. In addition, hybrid models that combine multiple architectures, such as variational autoencoders (VAEs) with GANs, can provide both generative flexibility and the ability to impose structural constraints, improving the model’s capacity to generate accurate, interpretable results for complex tasks. In superresolution tasks, specialized models such as Deep-STORM or DECODE utilize domain knowledge to far outperform what standard cGANs can achieve.
Method | Architecture | Data sets | Learning type | Significant aspect | |
Discriminator | Generator | ||||
Tissue | |||||
Conditional GAN (cGAN) | CNN | U-Net | Paired labeled/annotated images | Supervised | Conditioning mechanism based on the additional input information |
CycleGAN | CNN | U-Net | Unpaired histology images | Unsupervised | Cycle-consistency loss enforces consistency and unsupervised translation |
StarGAN | PatchGAN | U-Net/ResNet | Unpaired images of tissue structures from multiple domains | Unsupervised | Unified architecture for a single model across multiple domains |
Cellular and subcellular structures | |||||
Conditional GAN (cGAN) | CNN | U-Net | Paired images from fluorescence, confocal electron microscopy | Supervised | Conditioning mechanism based on the additional input information |
CycleGAN | CNN | U-Net | Unpaired images of bright-field, phase-contrast, fluorescence, and DIC microscopy | Unsupervised | Cycle-consistency loss enforces consistency and unsupervised translation |
Molecular structures | |||||
Deep-STORM | Encoder–decoder CNN | Fluorescent images from techniques like STORM, PALM, or dSTORM | Supervised, labeled | Trained on simulated data to enhance resolution in SMLM, enabling superresolution imaging of molecular structures with improved accuracy | |
smNet | ResNet-inspired CNN | Fluorescent images from techniques such as STORM, PALM, or SIM | Supervised, labeled | Simulated PSFs and ground-truth 3D position labels training, accurately localized astigmatic single-molecule PSFs | |
DECODE | Stack of two U-Nets | Images captured via optical diffraction tomography | Supervised, labeled | Stacked U-Net to process single and consecutive frames, improved accuracy, and resolution under low-light conditions | |
DeepSTORM3D | Encoder–decoder CNN | Simulated images of fluorescent emitters noise, and optical properties of the microscope with known positions, including PSFs, background | Supervised, labeled | Image formation model and decoder CNN to pinpoint 3D emitter coordinates from simulated PSFs, enabling high-resolution volumetric molecular imaging | |
BGNet | CNN | Fluorescent images from fluorescence microscopy or SMLM | Supervised, labeled | Identifying PSF centroids for background correction, improving single-molecule localization and overall imaging resolution | |
ANNA-PALM | U-Net-based cGAN | Images of photoactivated single molecules captured by PALM or STORM | Supervised, labeled | Trained on experimental data to rapidly acquire high-density superresolution images, especially for mitochondria, nuclear core complexes, and microtubules |
Table 1. Overview of the key parameters for common approaches of DL in microscopy across scales.
6.3 Evaluation Metrics
Evaluating the performance of a cross-modality transformation model can be challenging. Typical strategies involve measuring the visual fidelity of the images, but these measures may not fully correlate with the retention of biologically relevant information. Some examples of evaluation metrics include the following.
Another approach is to evaluate the biological relevance of the generated images by performing downstream analyses, such as cell counting or feature segmentation, and comparing the results to known quantities or those obtained from real experimental data. This method more directly assesses the retention of biologically meaningful information, but it introduces additional uncertainties. For example, inaccuracies in the downstream task, such as errors in the cell-counting algorithm, can confound the evaluation of the model’s performance, making it difficult to disentangle the model’s contributions from errors in postprocessing or analysis pipelines.
6.4 Ethical Considerations
Ethical considerations are essential for ensuring responsible and fair use of AI in the image analysis of biological samples. A key concern is protecting the privacy of patients and donors, as these samples often contain sensitive personal information. Handling biological samples must comply with data protection laws, which require informed consent from all parties involved and transparency about how the samples will be used. These laws vary by region, with the General Data Protection Regulation (GDPR) in the European Union,172 the Health Insurance Portability and Accountability Act (HIPAA) in the United States,173 the Data Protection Act 2018 in the United Kingdom,174 and the Personal Information Protection Law (PIPL) in China.175 International efforts, such as those led by the World Health Organization,176,177 along with national initiatives, such as India’s data protection frameworks,178 continue to evolve these regulations to keep pace with the rapid growth of AI technologies.179
These regulations address issues such as privacy breaches due to improper data use and cybersecurity threats.185 Furthermore, the responsibility for AI models used in diagnostics is a major concern, particularly in relation to bias mitigation. Biased data sets can result in inaccurate or discriminatory outcomes, especially in healthcare applications. Rigorous validation of AI models is critical to ensure accuracy, reproducibility, and the prevention of errors that may lead to misdiagnosis or flawed scientific conclusions. Transparency is also crucial, requiring clear documentation of model training, data sources, usage frameworks, and decision-making processes.186 Guidelines should promote not only sharing data sets but also the trained model weights, enabling researchers to independently validate and replicate findings. Lastly, accountability frameworks are necessary to ensure that researchers and developers are held responsible for the ethical use of AI, with proper oversight to enforce compliance with established guidelines.
7 Perspectives
Cross-modality transformations in biological microscopy present advanced techniques with important implications for biology, medicine, and materials science. Although these advances suggest exciting opportunities where AI is set to increase diagnostic accuracy and improve workflow efficiency, it still faces ongoing challenges that require innovative solutions for them to be implicated and come to societal use. Figure 7 summarizes the ongoing developments and potential outcomes of combining AI with novel imaging modalities.
Figure 7.Potential application perspectives of AI on biological samples imaging. Current developments found in the literature are contained in green boxes, while speculative prospects for the future are contained in yellow boxes. Starting from the top left, AI is extensively used in diagnostics such as virtual staining and other cross-modality transforms (image in the green panel adapted from Li et al.
For example, the synthesis of high-resolution images from less invasive imaging methods such as MRI and CT scans provides tissue insights without the need for biopsies,37,49 illustrated in Fig. 7(a). This approach provides another important advantage: access to living tissue data that can have a significant impact on clinical studies. Similar strategies may also advance cellular in vitro culture studies in preclinical settings. Such strategies may advance preclinical in vitro cell culture studies in creating physically relevant environments with 3D settings, such as promoting cell growth into spheroids or inoculating them on a microphysiology platform.190 Recent advances of this technology have enabled co-culture of single or multiple cell types that mimic closer in vivo conditions by implementing 3D architectures, fluid dynamics, and the gradient of materials contained in living tissues. In these environments, extracting probe-free information on cellular behavior, metabolic states, or migration is currently not feasible, but may soon be achievable through AI and various imaging modalities, as presented in the bottom right panel.
It is also likely that new imaging technologies and implementations with AI will be able to integrate data across scales to provide unprecedented perspective on diseases at the cellular and molecular levels, portrayed in Fig. 7(b). This approach would produce realistic models that combine visual, molecular, and genomic information. By implementing data from a variety of modalities, AI will enable a more comprehensive analysis of biological models including the practical information required for the diagnosis of complex diseases where tissue morphology and function are crucial, exemplified in Fig. 7(d). In addition, longitudinal disease monitoring could be more sensitive, allowing clinicians to track tissue changes over time, shown in Fig. 7(c), and tailor treatment responses accordingly. This possibility extends to the field of transplantation biology, where the cellular integration, biocompatibility, and biodegradation of transplanted tissues, synthetic materials, or prostheses can be monitored over time. In summary, AI-powered tools will enable faster and more accurate diagnostics with reduced bias, thus minimizing the time required for human review. Over time, these advancements have the potential to make high-quality histological analysis more accessible, particularly in areas with limited pathology expertise, while also standardizing diagnostic protocols across institutions.
Beyond diagnostics, AI is already being used to guide physicians during advanced surgeries, directing the surgeon’s movements189 with accuracy and precision. Future imaging cross-modality transformations may facilitate real-time tissue mapping during surgery, giving surgeons immediate insights from various imaging techniques, represented in the top right panels. AI-powered methods are also being used in the research for drug discovery to identify new drug candidates and their potential folding structures,188 as demonstrated in the bottom central panels. Subsequent preclinical studies to predict how tissues will respond to novel treatments could be envisioned using cross-modality transformations, as seen in the bottom left panel. In the future, AI may simulate the effects of drugs on a patient’s tissue, aiding in the development of personalized therapies. However, challenges remain, including the need for high-quality multimodal data sets to train AI systems and the development of interpretable AI models that biologists and clinicians can trust. In addition, integrating AI into clinical workflows requires careful consideration to ensure these new technologies are used effectively and ethically by healthcare professionals. Despite these hurdles, the future of AI in cross-modality transformations in biology is promising, with profound implications for both biomedical research and clinical diagnostics.
8 Conclusions
The incorporation of DL techniques in biological microscopy represents a significant advancement, with the potential to enhance our understanding of histology, cellular structures, and molecular imaging. While these technologies offer promise, it is essential to acknowledge that the field is still evolving. The current state of these methods often involves grappling with their black-box nature, necessitating further refinement and investigation. Researchers continue to address challenges related to interpretability and the need for extensive developments to unlock the full transformative potential of DL in biological microscopy. Beyond technological advancement, these methods offer a paradigm shift by enabling imaging without the reliance on chemical stains and fluorescence, both simplifying experimental processes and preserving sample integrity. This marks a pivotal shift in microscopy, offering a noninvasive and label-free alternative that preserves the integrity of the specimens under investigation. Cross-modality transformations have a significant impact not only in laboratory settings but also in clinical diagnostics and fundamental biological research, opening new avenues for discoveries and breakthroughs. Furthermore, these techniques are becoming more accessible and affordable, democratizing access to microscopic exploration and empowering researchers across disciplines.
Jesús Manuel Antúnez Domínguez is a biophysicist with expertise in microscopic approaches to bacterial collective behaviour. Holding an industrial PhD in biophysics, he has experience in both academia and industry, notably at the Innovation Unit of Elvesys in Paris and the Biophysics Lab in the Department of Physics at the University of Gothenburg. His research interests span microfluidics, active matter, and advanced image analysis.
Giovanni Volpe is a professor of physics at the University of Gothenburg, with expertise in artificial intelligence, complex systems, and active matter. He leads interdisciplinary research exploring AI-driven solutions to understand emergent behaviors in biological and synthetic systems. He is currently co-authoring the book Deep Learning Crash Course for No Starch Press. His work integrates experimental, theoretical, and computational approaches, contributing widely to scientific literature and innovation.
Caroline B. Adiels is an associate professor of biophysics at the University of Gothenburg, with a focus on microfluidics, biology, and artificial intelligence applications. She leads an interdisciplinary research group dedicated to advancing single-cell analysis and communication studies using optics and microfluidics, which extends to organ-on-a-chip technology. Her work integrates AI-based image analysis software tailored for life sciences, creating a research portfolio that bridges the fields of physics and biology.
Biographies of the other authors are not available.
References
[1] D. Murphy, M. Davidson. Fundamentals of Light Microscopy and Electronic Imaging(2012).
[2] K. Suvarna, C. Layton, J. Bancroft. Bancroft’s Theory and Practice of Histological Techniques E-Book(2012).
[3] J. W. Lichtman, J.-A. Conchello. Fluorescence microscopy. Nat. Methods, 2, 910-919(2005).
[4] I. Johnson. Molecular probes handbook: a guide to fluorescent probes and labeling technologies(2010).
[6] S. Shashkova, M. Leake. Single-molecule fluorescence microscopy review: shedding new light on old problems. Biosci. Rep., 37, BSR20170031(2017).
[8] F. Helmchen, W. Denk. Deep tissue two-photon microscopy. Nat. Methods, 2, 932-940(2005).
[11] J. Ferreira, L. Groc. Surface Glutamate Receptor Nanoscale Organization with Super-Resolution Microscopy (dSTORM), 35-52(2024).
[17] X. Cao et al. Deep learning based inter-modality image registration supervised by intra-modality similarity. Lecture Notes in Computer Science, 55-63(2018).
[19] J. Johnson, A. Alahi, L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. Lecture Notes in Computer Science, 694-711(2016).
[20] C. Ledig et al. Photo-realistic single image super-resolution using a generative adversarial network, 15640-15649(2017).
[21] K. Zhang et al. Negative-aware attention framework for image-text matching(2022).
[22] T. S. Gurina, L. Simms. Histology, staining(2023).
[23] C. L. Chen et al. Deep learning in label-free cell classification. Sci. Rep., 6, 21471(2016).
[25] B. Bai et al. Label-free virtual HER2 immunohistochemical staining of breast tissue using deep learning. BME Frontiers, 2022, 9786242(2022).
[28] D. S. Richardson, J. W. Lichtman. Clarifying tissue clearing. Cell, 162, 246-257(2015).
[29] X. Wang et al. Single-shot isotropic differential interference contrast microscopy. Nat. Commun., 14, 2063(2023).
[34] E. Breznik et al. Cross-modality sub-image retrieval using contrastive multimodal image representations. Sci. Rep., 14, 18798(2024).
[35] A. Lahiani et al. Virtualization of tissue staining in digital pathology using an unsupervised deep learning approach. Lecture Notes in Comput. Sci., 47-55(2019).
[48] Y. Hong et al. Deep learning-based virtual cytokeratin staining of gastric carcinomas to measure tumor–stroma ratio. Sci. Rep., 11, 19255(2021).
[50] J. Y. Zhu et al. Unpaired image-to-image translation using cycle-consistent adversarial networks, 2242-2251(2017).
[55] X. Li et al. Unsupervised content-preserving transformation for optical microscopy. Light Sci. Appl., 10, 44(2021).
[61] R. Sanyal, D. Kar, R. Sarkar. Carcinoma type classification from high-resolution breast microscopy images using a hybrid ensemble of deep convolutional features and gradient boosting trees classifiers. IEEE/ACM Trans. Comput. Biol. Bioinf., 19, 2124-2136(2021).
[65] Q. Dou et al. Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss, 691-697(2018).
[72] M. Dohmen et al. Similarity metrics for MR image-to-image translation(2024).
[76] M. Luella, B. Paul, A. Javad. Generative AI in medical imaging and its application in low dose computed tomography (CT) image denoising. Applications of Generative AI, 387-401(2024).
[77] Y. Choi et al. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation, 8789-8797(2018).
[78] J. Vasiljević et al. HistostarGAN: a unified approach to stain normalisation, stain transfer and stain invariant segmentation in renal histopathology. Knowl.-Based Syst., 277, 110780(2023).
[82] S. Banerji, S. Mitra. Deep learning in histopathology: a review. WIREs Data Min. Knowl. Discovery, 12, e1439(2022).
[83] J. Xu et al. Deep Learning for Histopathological Image Analysis: Towards Computerized Diagnosis on Cancers, 73-95(2017).
[89] P. A. Moghadam et al. A morphology focused diffusion probabilistic model for synthesis of histopathology images, 1999-2008(2023).
[90] A. Greenspan, Y. Shen, J. Keet?al.. StainDiff: transfer stain styles of histology images with denoising diffusion probabilistic models and self-ensemble. Med. Image Computing and Computer Assisted Intervention, 549-559(2023).
[91] T. Kataria, B. Knudsen, S. Y. Elhabian. StainDiffuser: multitask dual diffusion model for virtual staining(2024).
[92] S. Dubey, X. Xu et al. VIMS: virtual immunohistochemistry multiplex staining via text-to-stain diffusion trained on uniplex stains. Mach. Learn. Med. Imaging, 143-155(2024).
[93] T. M. Abraham, R. Levenson. A comparison of diffusion models and CycleGANs for virtual staining of slide-free microscopy images, 1-6(2023).
[95] O. Kepp et al. Cell death assays for drug discovery. Nat. Rev. Drug Disc., 10, 221-237(2011).
[96] E. Moen et al. Deep learning for cellular image analysis. Nat. Methods, 16, 1233-1246(2019).
[108] C. L. Cooke et al. Physics-enhanced machine learning for virtual fluorescence microscopy, 3803-3813(2021).
[113] X. Zhuang. Nano-imaging with storm. Nat. Photonics, 3, 365-367(2009).
[119] E. Wolf. Progress in Optics(2008).
[121] C. Ledig et al. Photo-realistic single image super-resolution using a generative adversarial network, 105-114(2016).
[123] C. Chen et al. Synergistic image and feature adaptation: towards cross-modality domain adaptation for medical image segmentation, 865-872(2019).
[124] D. Maleki, H. Tizhoosh. LILE: look in-depth before looking elsewhere: a dual attention network using transformers for cross-modal information retrieval in histopathology archives(2022).
[125] Q. Yang et al. MRI cross-modality image-to-image translation. Sci. Rep., 10, 3753(2020).
[126] R. Naseem et al. Cross modality guided liver image enhancement of CT using MRI, 46-51(2019).
[127] C. Dong, D. Fleet et al. Learning a deep convolutional network for image super-resolution, 184-199(2014).
[133] E. Kang, J. Min, J. C. Ye. A deep convolutional neural network using directional wavelets for low-dose X-ray CT reconstruction. Med. Phys., 44, e360-e375(2017).
[134] C. Dong, J. M. Leibe, C. C. Loy, X. Tang et al. Accelerating the super-resolution convolutional neural network. Computer Vision–ECCV 2016, 391-407(2016).
[139] M. Haris, G. Shakhnarovich, N. Ukita. Deep back-projection networks for super-resolution, 1664-1673(2018).
[144] J.-Y. Lin, Y.-C. Chang, W. H. Hsu. Efficient and phase-aware video super-resolution for cardiac MRI, 66-76(2020).
[145] Y. Huang, L. Shao, A. F. Frangi. Simultaneous super-resolution and cross-modality synthesis of 3D medical images using weakly-supervised joint convolutional sparse coding, 5787-5796(2017).
[146] W.-S. Lai et al. Deep Laplacian pyramid networks for fast and accurate super-resolution, 624-632(2017).
[151] N. Boyd et al. DeepLOCO: fast 3D localization microscopy using neural networks. bioRxiv(2018).
[163] J. Soh, S. Cho, N. Cho. Meta-transfer learning for zero-shot super-resolution, 3513-3522(2020).
[165] H. Sahak et al. Denoising diffusion probabilistic models for robust image super-resolution in the wild(2023).
[166] S. Gao et al. Implicit diffusion models for continuous super-resolution, 10021-10030(2023).
[167] J. Ho et al. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23, 1-33(2021).
[168] R. Rombach et al. High-resolution image synthesis with latent diffusion models, 10674-10685(2021).
[169] A. Saguy et al. This microtubule does not exist: super-resolution microscopy image generation by a diffusion model, 2400672(2024).
[170] H. Greenspan, M. Pan et al. DiffuseIR: diffusion models for isotropic reconstruction of 3D microscopic images, 323-332(2023).
[171] S. Gao et al. GDPR Requirements for Biobanking Activities Across Europe, 10021-10030(2023).
[172] V. Colcelli et al. GDPR requirements for biobanking activities across europe(2023).
[173] Health insurance portability and accountability act of 1996 (HIPAA), 104-191(1996).
[174] (2018).
[176] D. Dhingra, A. Dabas. Global strategy on digital health. Indian Pediatrics, 57, 356-358(2020).
[177] The protection of personal data in health information systems–principles and processes for public health(2020).
[180] A. S. Pillai. Utilizing deep learning in medical image analysis for enhanced diagnostic accuracy and patient care: challenges, opportunities, and ethical implications. J. Deep Learn. Genomic Data Anal., 1, 1-17(2021).
[182] N. Forgó et al. Big data, AI and health data: between national, european, and international legal frameworks. Legal Challenges in the New Digital Age, 358-394(2023).
[186] K. Grünberg et al. Ethical and privacy aspects of using medical image data. Cloud-Based Benchmarking of Medical Image Analysis, 33-43(2017).
[189] J. Zhang et al. Ai co-pilot bronchoscope robot. Nat. Commun., 15, 241(2024).
Set citation alerts for the article
Please enter your email address