In scientific and industrial research, three-dimensional (3D) imaging, or depth measurement, is a critical tool that provides detailed insight into surface properties. Confocal microscopy, known for its precision in surface measurements, plays a key role in this field. However, 3D imaging based on confocal microscopy is often challenged by significant data requirements and slow measurement speeds. In this paper, we present a novel self-supervised learning algorithm called SSL Depth that overcomes these challenges. Specifically, our method exploits the feature learning capabilities of neural networks while avoiding the need for labeled data sets typically associated with supervised learning approaches. Through practical demonstrations on a commercially available confocal microscope, we find that our method not only maintains higher quality, but also significantly reduces the frequency of the z-axis sampling required for 3D imaging. This reduction results in a remarkable 16× measurement speed, with the potential for further acceleration in the future. Our methodological advance enables highly efficient and accurate 3D surface reconstructions, thereby expanding the potential applications of confocal microscopy in various scientific and industrial fields.

- Chinese Optics Letters
- Vol. 22, Issue 6, 060002 (2024)
Abstract
1. Introduction
Three-dimensional (3D) surface imaging is an important technique in a variety of fields and has been realized by several methods. Classical interferometry[1,2] uses light interference to make precise surface measurements. Geometric moiré and holographic interferometry[3,4] provide insight into surface deformation by analyzing interference patterns. Digital holography[5–9], which advances this approach, uses digital techniques for enhanced analysis. Fringe projection profilometry and deflectometry[10–13] use structured light to reveal surface topographies. Digital image correlation and stereovision provide noncontact surface imaging. And confocal microscopy[14–16] measures surfaces by peak detection with axis scanning.
In this paper, we focus on confocal surface metrology, a method that uses
With the advent of deep learning, supervised learning methods have been applied to 3D surface measurement[20], but these approaches introduce new challenges. First, neural networks trained in labeled scenarios struggle to adapt effectively to new unlabeled microscopy environments[21]; second, the reliance on labeled data naturally limits measurement capabilities to the range of the provided labels and restricts exploration beyond.
Sign up for Chinese Optics Letters TOC. Get the latest issue of Chinese Optics Letters delivered right to you!Sign up now
To address these challenges, we introduce a novel transformer architecture[22,23], SSL Depth, which incorporates physical information and utilizes self-supervised learning (SSL). Specifically, our method efficiently predicts the intensity, depth, and width of a Gaussian function directly from raw images captured by confocal microscopy. Here, the intensity corresponds to the scattered light intensity of the surface, the depth represents the morphology, and the width indicates the peak width of the confocal system in the
2. Framework for SSL
SSL has the ability to extract physical parameters directly from raw microscopy images[24]. Central to our approach, as shown in Fig. 1, is a physics-informed transformer for analyzing confocal scanning images. Inspired by the spatiotemporal vision transformer (ViT)[25], our model incorporates a high-dimensional data embedding scheme that allows the application of pretrained weights from foundation image models[26], enhancing the capabilities of the network. Additionally, we integrate adapters[27] to accelerate the training process. Our network consists of three decoders for predicting intensity, depth, and width. These predictions are then synthesized by a Gaussian function to reconstruct the raw images.
Figure 1.Network architecture. The encoder consists of a ViT and an adapter for accelerated training. Intensity, depth, and width are predicted by three decoders to finally reconstruct the raw data.
As shown in Fig. 2(a), the patch embedding module in our method is inspired by the handling of the temporal dimension in spatiotemporal ViT. We use a 3D convolution operation to transform stacks of confocal microscopy images scanned along the
Figure 2.Network modules design. (a) Convert the microscopic imaging stack into patches by a 3D convolution; (b) prediction head design for intensity, depth, and width.
To predict physical features, we decode each feature separately using dedicated decoders designed for physical information. These decoders consist of convolutional and upsampling operations that transform the features into images of the same size as the input. As shown in Fig. 2(b), we process three physical parameters: intensity, depth, and width. (1) Since the raw confocal microscopy images are normalized, we apply a sigmoid function to the intensity to ensure that it remains in the range 0–1. (2) Since the measurable morphology is within the range of the confocal scan, we apply a sigmoid function to the depth and then multiply it by the number of steps in the
In our self-supervised training process, the goal is to ensure that the reconstructed stack is highly similar to the raw confocal microscopy images, a task that involves measuring the similarity between 3D stacks. To achieve this, we apply the sum of mean absolute difference (
3. Data Collection
For our measurements, we use a commercial confocal microscope, Sensofar S-mart Optical Profilometer[28]. As shown in Fig. 3, the objects of our measurements are surface roughness comparators according to the ISO 2632/1-1975 standard. These comparators are tools used to determine surface roughness by comparative methods, visual estimation, or with the aid of a magnifying glass. Our focus is on measuring standard parts labeled “VERTICAL MILLING.” We perform three sets of measurements on the same field of view but with different
Figure 3.Examples of raw measurement data. Scale bar is 50 µm.
4. Results
4.1. Training
In our study, the raw measurement data have pixel sizes of (151, 1028, 1232) for the
We develop our code using JAX[30], a high-performance numerical computing library that provides an expressive array programming model and is optimized for both GPU and CPU performance. Training is performed on a single NVIDIA Tesla A100 40 GB GPU. For the
4.2. Evaluation
After training, we save the weights for the inference process. We start by performing a central crop on the raw data in the
4.3. Comparable measurement capability at 16× speed
We compared the results of traditional commercial algorithm processing data in
Figure 4.Comparative Experiment 1. (a) Confocal microscope intensity image, with the yellow area indicating the field of view for subsequent analysis; (b) depth image obtained from 1× mode data using the traditional commercial algorithm; (c) depth image obtained from 1× mode data using SSL Depth; (d) depth image obtained from 4× mode data using the traditional commercial algorithm; (e) depth image obtained from 4× mode data using SSL Depth; (f) depth image obtained from 16× mode data using SSL Depth; (g) mean absolute error (L1) corresponding to the cross-sectional line, assuming that the traditional commercial 1× result is true. The red dashed line shows the error at 4× speed for the commercial microscope. Scale bar is 50 µm.
4.4. Superior measurement with the same data
We further discover that SSL Depth can give better results under the same standard measurement conditions. As shown in Fig. 5(a), we select a region measuring
Figure 5.Comparative Experiment 2. (a) Confocal microscope intensity image, with the yellow area indicating the field of view for subsequent analysis; (b) depth image obtained from 1× mode data using the traditional commercial algorithm; (c) depth image obtained from 1× mode data using SSL Depth. Scale bar is 50 µm.
5. Discussion
Our SSL approach does not rely on additional prior knowledge, such as labeled data or more prior rules, nor does it require the collection of more data for the same measurement target. Instead, our method makes more comprehensive use of the measurement data compared to standard fitting methods. Specifically, we use SSL to learn effective representations of the distribution from 40 raw measurement stacks. This process essentially compresses the raw data and acts as an implicit constraint on the solution. Traditional methods, which typically involve a combination of function fitting, filtering, and image processing, only allow direct processing at the pixel level and do not facilitate feature learning. Our method is more effective because it goes beyond mere pixel manipulation to learn and exploit the deeper features of the data.
One disadvantage of our method is the memory consumption and the extended training time (about 2 days) during the training stage. This is because our network directly processes 3D data, resulting in a higher number of patches compared to typical visual tasks. This is an area for future improvement. However, the inference stage is remarkably fast, processing a complete data stack in less than 10 s. In addition, our model only needs to be trained once for samples of the same type. In the future, we plan to explore training on a larger data set to achieve sufficient generalizability. This would allow the model to be directly applicable to entirely new types of samples, facilitating easier deployment and broader adoption.
6. Conclusion
In this paper, we present a self-supervised approach, SSL Depth, for processing 3D surface measurements using confocal microscopy. We have developed a novel architecture based on the ViT, which incorporates physical information relevant to 3D surface measurements. This method is trained directly on raw confocal microscopy measurement stacks without the need for labeled data. Our approach demonstrates superiority over the traditional commercial algorithm in confocal microscopy measurements of surface roughness comparators. Not only can our method achieve comparable measurement capability with
References
[1] K. Creath. V phase-measurement interferometry techniques. Prog. Opt., 26, 349(1988).
[2] P. Hariharan. Basics of Interferometry(2010).
[4] D. Post, B. Han, P. Ifju. High Sensitivity Moiré: Experimental Analysis for Mechanics and Materials(2012).
[5] U. Schnars, C. Falldorf, J. Watson et al. Digital holography and wavefront sensing. Principles, Techniques and Applications(2015).
[11] J. Geng. Structured-light 3D surface imaging: a tutorial. Adv. Opt. Photonics, 3, 128(2011).
[12] M. C. Knauer, J. Kaminski, G. Hausler. Phase measuring deflectometry: a new approach to measure specular free-form surfaces. Optical Metrology in Production Engineering, 366(2004).
[14] M. Minsky. Memoir on inventing the confocal scanning microscope. Scanning, 10, 128(1988).
[17] R. Artigas. Imaging confocal microscopy. Optical Measurement of Surface Topography, 237(2011).
[22] A. Vaswani, N. Shazeer, N. Parmar et al. Attention is all you need. Advances in Neural Information Processing Systems, 1(2017).
[23] A. Dosovitskiy, L. Beyer, A. Kolesnikov et al. An image is worth 16 × 16 words: transformers for image recognition at scale(2020).
[25] C. Feichtenhofer, Y. Li, K. He et al. Masked autoencoders as spatiotemporal learners. Adv. Neural Inf. Process., 35, 35946(2022).
[26] A. Radford, J. W. Kim, C. Hallacy et al. Learning transferable visual models from natural language supervision. International Conference on Machine Learning, 8748(2021).
[27] Z. Chen, Y. Duan, W. Wang et al. Vision transformer adapter for dense predictions(2022).
[29] I. Loshchilov, F. Hutter. Decoupled weight decay regularization(2017).

Set citation alerts for the article
Please enter your email address