
- Opto-Electronic Advances
- Vol. 6, Issue 2, 220049 (2023)
Abstract
Introduction
Imaging through scatters is a classical inverse problem
Due to the limited size of the convolutional kernel, CNNs are a “local” model. As a “non-local” mechanism, the attention weighs the significance of each part of the input data and extracts long-term dependencies of the feature maps
Here, we propose a high-performance “non-local” generic feature extraction and reconstruction model - SpT UNet. The network is a UNet architecture including advanced transformer encoder and decoder blocks. For better feature reservation/extraction, we propose and demonstrate three key mechanisms, i.e., pre-batch normalization (pre-BN), and position encoding in multi-head attention/multi-head cross-attention (MHA/MHCA), and self-built up/down sampling pipelines. For the “scalable” data acquisition, four different grits of diffusers within the 40 mm detection range are considered. We further quantitatively evaluate the network performance with four scientific indicators, namely Pearson correlation coefficient (PCC), structural similarity measure (SSIM), Jaccard index (JI), and peak signal-to-noise ratio (PSNR). The SpT UNet shows less computational complexity and far better reconstruction and generalization ability compared with the other state-of-the-art transformer models in vision
Method
SpT UNet implementation
The architecture of the SpT UNet
As shown in
Figure 1.SpT UNet architecture for spatially dense feature reconstruction (
Besides the two layers in each encoder block, the decoder block contains an extra MHCA layer which is used to aggregate the feature through the skip connections. The embedded label is fed to the MHCA layer as
Transformer module for the SpT UNet
Transformer adopts attention mechanism
where the spaces for queries, keys, and values are
The core of SpT UNet encoder blocks is the multi-head attention (MHA) mechanism for joint information extraction from different representation subspaces at different positions. The MHA based model can be expressed as:
Here, the projections are parameter tensors
Besides the MHA mechanism, we propose the multi-head cross-attention (MHCA) mechanism for SpT UNet decoder blocks which can be expressed as:
Here,
Module-level modification of SpT UNet
Pre-batch normalization (Pre-BN)
For better training of the network, pre-normalization in each block is implemented for stabilized convergence of the gradients. As each batch contains more varied cross-speckle features compared with the features between channels, we further upgrade the pre-layer normalization to the pre-batch normalization. As shown in
Position encoding in MHA/MHCA
The SpT UNet backbone is a transformer module using MHA and MHCA as the core. MHA and MHCA are coped with three-dimensional position encoding, i.e., the inductive deviation. The purpose of the position encoding is for stable and efficient feature extraction. In specific, we designed a three-dimensional absolute sinusoidal position encoding for feature maps. For the index
where k is the index of the vector, i.e.,
Puffed downsampling and leaky upsampling
For better speckle feature extraction, we invent two sampling methods, i.e., puffed downsampling with sandwich-like autoencoder structure, and leaky upsampling with a bottleneck structure inspired by compressed sensing.
As shown in
Figure 2.
As shown in
Figure 3.
Optical imaging system and data acquisition
As shown in
Figure 4.
To collect the training and testing datasets, 1500 Faces-LFW face images
Figure 5.
Case 1: Train/test the network with the speckles produced by a 120 grits diffuser at 0, and 20 mm away from the image plane.
Case 2: Train/test the network with the speckles produced by a 220 grits diffuser at 0, and 20 mm away from the image plane.
Case 3: Train/test the network with the speckles produced by a 600 grit diffuser at 0, and 20 mm away from the image plane.
Case 4: Train/test the network with the speckles produced by a 1500 grit diffuser at 0, and 20 mm away from the image plane.
To better evaluate the generalization ability of the network especially for varied depth of range, the validation dataset consists of 6000 pairs produced by four diffusers with 1500 seen Faces-LFW face images at 40 mm away from the focal plane.
Data processing
The speckle patterns were first normalized between 0 and 1, and the labels for the generic face images were binary values. To reduce the parameters of the network and the demand for GPU and training data, the input speckle patterns were first downsampled from 800 × 800 pixels to 200 × 200 pixels using the bilinear interpolation approach. And the network was implemented using Python version 3.8.5 and PyTorch framework version 1.7.1 (Facebook Inc.) and ran on NVIDIA GeForce RTX 3090. The network was trained with 200 epochs with a learning rate of 10−4 for the first 100 epochs, 10−5 for the next 50 epochs, and 10−6 for the final 50 epochs. The batch size in the training/testing process is 2. Moreover, the lightweight SpT UNet contains 6.6 million neurons. For the network configurations, Adam optimizer, L2 norm regularizer, and the CE/NPCC as loss functions were chosen. Once the model was trained, each prediction was made within 16 ms.
Results and discussions
To intuitively visualize the JI score, the generic human faces, and related reconstructed pictures are shown in
Figure 6.
We further quantitatively evaluate the performance of the network using PCC, JI, SSIM, and PSNR. The PCC is essentially a normalized measurement of covariance, with the value 1 representing perfect correlation. The SSIM evaluates the similarity between reconstructed patterns and related ground truth. It is a decimal value between 0 and 1, value 1 represents perfect structural similarity, and 0 indicates no structural similarity. Similar to the SSIM, the JI gauges the similarity and diversity between prediction and its ground truth. The PSNR is used to quantify the quality of the reconstruction: the higher PSNR, the better the reconstructed image. As shown in
Moreover, to evaluate the loss and reconstruction accuracy of the SpT UNet. The plots of loss and accuracy for the trained SpT UNet as the function of the epoch are shown in
Figure 7.Quantitative analysis of the trained SpT UNet using NPCC as the loss function (
We also quantitatively evaluate the complexity of the Spt UNet and its downsize version—SpT UNet-B. The performance of the SpT UNet and the SpT UNet-B is shown in
It is worth noting that, as a lightweight network, the SpT UNet and Spt UNet-B reveal less than one order of parameters compared with ViT
Conclusions
We have proposed a “non-local” spatially dense object feature extraction and reconstruction model, i.e., the lightweight SpT UNet. It shows an excellent performance with comparative values on the scientific indicators for generic face images through varied types of diffusers at different detection planes. Although we just consider the reconstruction of binary generic face images, the reconstruction of spatially dense images at grayscale using the SpT UNet can be considered in the future. For the biomedical imaging, we believe that the network can be further implemented in complex tissue imaging to boost the image contrast and depth of range. For photonic computing, as the paralleling processing model, the SpT UNet can be further implemented as an all-optical diffractive neural network with surpassing feature extraction ability, light speed and even lower energy consumption.
References
[1] Goodman JW. Speckle Phenomena in Optics: Theory and Applications (Roberts and Company Publishers, Englewood, 2007).
[15] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems 6000–6010 (ACM, 2017).
[17] Lin TY, Wang YX, Liu XY, Qiu XP. A survey of transformers. (2021); https://arxiv.org/abs/2106.04554.
[18] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai XH et al. An image is worth 16x16 words: transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR, 2020).
[19] Touvron H, Cord M, Douze M, Massa F, Sablayrolles A et al. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning 10347–10357 (PMLR, 2021).
[20] Ye LW, Rochan M, Liu Z, Wang Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition 10494–10503 (IEEE, 2019).
[21] Yang FZ, Yang H, Fu JL, Lu HT, Guo BN. Learning texture transformer network for image super-resolution. In Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition 5790–5799 (IEEE, 2020).
[22] Sun C, Myers A, Vondrick C, Murphy K, Schmid C. Videobert: a joint model for video and language representation learning. In Proceedings of 2019 IEEE/CVF International Conference on Computer Vision 7463–7472 (IEEE, 2019).
[23] Girdhar R, Carreira JJ, Doersch C, Zisserman A. Video action transformer network. In Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition 244–253 (IEEE, 2021).
[24] Chen HT, Wang YH, Guo TY, Xu C, Deng YP et al. Pre-trained image processing transformer. In Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition 12294–12305 (IEEE, 2021);http://doi.org/10.1109/CVPR46437.2021.01212.
[25] Ramesh A, Pavlov M, Goh G, Gray S, Voss C et al. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning 8821–8831 (PMLR, 2021).
[26] Khan S, Naseer M, Hayat M, Zamir SW, Khan FS et al. Transformers in vision: a survey. (2021);https://arxiv.org/abs/2101.01169.
[27] Liu Z, Lin YT, Cao Y, Hu H, Wei YX et al. Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of 2021 IEEE/CVF International Conference on Computer Vision 9992–10002 (IEEE, 2021).
[28] He KM, Zhang XY, Ren SQ, Sun J. Deep residual learning for image recognition. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016); http://doi.org/10.1109/CVPR.2016.90.
[29] Huang GB, Mattar M, Berg T, Learned-Miller E. Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. In Proceedings of Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition (HAL, 2008).

Set citation alerts for the article
Please enter your email address