
- Advanced Photonics Nexus
- Vol. 4, Issue 2, 026010 (2025)
Abstract
1 Introduction
Single-pixel imaging (SPI) is a novel imaging technique that captures spatially modulated two-dimensional (2D) scenes using just one bucket detector.1,2 SPI offers numerous benefits, such as high signal-to-noise ratio (SNR), wide spectrum coverage, and low cost, which have made it widely used in various fields, including multispectral imaging,3,4 optical encryption,5,6 and image-free sensing.7,8 However, the reported pixel resolution of SPI systems ranges from
Researchers have explored various approaches to achieve large-scale SPI. Stojek et al.11 made an attempt to develop a high-resolution SPI scheme. However, it requires the target scene to be spatially sparse; otherwise, the recovered image will lose details. Wenwen et al.12 applied a compressive sensing (CS) algorithm to retrieve target scenes. However, the traditional CS-based restoration techniques are costly in terms of memory and time, and it is hard to reconstruct large-scale scenes. Recently, with the development of deep learning, convolutional neural network (CNN)-based approaches have been widely applied to SPI. Lyu et al.13 proposed a ghost imaging14 technique based on deep learning, which validated the feasibility of deep learning in simulating the physical process of ghost imaging. Kulkarni et al.15 input SPI measurements into the CNN-based ReconNet and sent the output feature map as an intermediate reconstruction result to the off-the-shelf denoiser to produce the final high-quality reconstruction image. However, due to the limited receptive field and globally consistent attention of CNN networks, the reconstructed images still suffer from the problems of blurring, missing details, and low resolution.
In recent years, the development of single-pixel systems16 has extended beyond imaging techniques and gradually expanded into the field of semantic sensing.17,18 Different from SPI, image-free single-pixel sensing directly perceives the target scene from the measurement values, eliminating the scene reconstruction step and saving much computational overhead. In image-free single-pixel sensing tasks, the sampling performance is strongly correlated with the type and number of patterns.19 Although reducing the number of unoptimized patterns (such as random pattern20 and Hadamard pattern21) can significantly improve sampling efficiency, the sampling performance is not satisfactory. Increasing the number of patterns can alleviate the problem of poor sampling performance, but it will reduce the efficiency of single-pixel sensing. Recently, researchers have started to use convolutional networks to simulate the SPI process22,23 and use full-size convolution kernels to simulate the modulation patterns. This operation embeds modulation patterns into the network and optimizes these patterns by training the network. Based on the optimized full-sized patterns, we have produced several performance-leading single-pixel sensing works.24,25 However, the full-sized optimized pattern still has the problem of excessive parameters, and it cannot preserve the position information of the target in the scene, resulting in unsatisfactory imaging and sensing performance. Therefore, how to balance the number of patterns (sampling rate or sampling speed) with sensing performance becomes the major challenge for single-pixel sensing.
Sign up for Advanced Photonics Nexus TOC. Get the latest issue of Advanced Photonics Nexus delivered right to you!Sign up now
To tackle the above challenge, we report a large-scale uncertainty-driven single-pixel imaging and sensing (SPIS) method that can achieve high-quality
Figure 1.Overview of the SPIS technique. (a) The optical setup of the SPIS technique. The structured illumination was generated using a DMD and a white-light source. A single-pixel detector was used to collect the light reflected from the target scene. The collected 1D measurements were digitized and then reshaped into 2D measurements. (b) In SPIS, we scan and sample the scene using small-sized optimized patterns, which achieve higher sampling performance with an order of magnitude fewer pattern parameters. The 2D measurements were then fed into the encoder to extract high-dimensional semantic features, and the features were sent to the task-specific plug-and-play decoder to complete large-scale SPI or image-free sensing. (c) The transformer-based encoder and UDL function can guide SPIS to pay more attention to the target area with more details in the scene, thus extracting high-dimensional semantic features that are effective for imaging and sensing. (d) The existing state-of-the-art SPI method ReconNet
The performance of the reported technique was demonstrated on three public data sets.31
The rest of this paper is organized as follows: Sec. 2 details our proposed SPIS approach, including the network structure and training strategy. Section 3 presents the experimental results to verify the performance of SPIS in large-scale SPI and image-free sensing tasks. Finally, Sec. 4 summarizes our research results, explores the advantages and challenges of this approach in different application scenarios, and proposes future research directions.
2 Method
As shown in Fig. 2(a), the SPIS network consists of two main modules, including the encoder and the decoder. The encoder is composed of an encoding module (which consists of several convolution layers) and a transformer-based high-dimensional semantic feature extraction module.26 The decoder type is determined by specific tasks. When SPIS is applied for SPI and image-free segmentation, the decoder consists of a multiscale upsampling pyramid convolutional block with residual connections. When SPIS is used for image-free object detection, the decoder consists of an MSAN module, a BMFP module, and a predict head.
Figure 2.SPIS network structure. (a) Overview of the SPIS technique. (b) The transformer-based encoder. (c) The decoder for large-scale imaging and image-free segmentation. The decoder consists of multiple upsampling convolution blocks, each of which consists of multiple convolutional layers and one upsampling layer. (d) The decoder for image-free object detection.
2.1 Detailed Structures of the Encoder and the Decoder
The encoder module consists of several convolutional blocks with a kernel size of
We implement small-sized pattern sampling by embedding nonoverlapping small-sized patterns in multiple zero-initialized full-sized patterns, quickly switching between full-sized patterns (more details are referred to in Appendix C). The patterns in all three applications are found by first decoding the original image, then switching to a different decoding task and fine-tuning the patterns in the convolutional layers. This sampling method combines the advantages of compressed sensing and point-scanning imaging.34 Compared with the conventional full-sized pattern sampling method, the small-sized sampling approach can retain the position information of the target and improve sampling efficiency. Compared with the point-scanning system, our method samples a larger portion of the scene at once, thus reducing sampling times and increasing sampling speed. Moreover, to improve SNR, we binarized the illuminating patterns and performed normalization to signals by maintaining an equal white/black ratio in each illuminating pattern.
The high-dimensional semantic feature extraction module of the encoder consists of several transformer layers [as shown in Fig. 2(b)]. A transformer is a basic deep-learning network structure.26 Thanks to its excellent contextual information capturing and global feature modeling capabilities, the Global Vision Transformer26 has achieved great success in the image processing field recently. Our transformer-based encoder can guide the network to focus on the regions with interesting targets to extract high-dimensional semantic features that are effective for imaging and sensing26 (more details about the encoder can be found in Sec. 4 in the Supplementary Material).
When SPIS is applied for large-scale SPI or image-free single-pixel segmentation, the decoder [as shown in Fig. 2(c)] consists of a multiscale upsampling pyramid network with residual connections. Each upsampling block is composed of multiple
When SPIS is applied for image-free object detection, the decoder consists of an MSAN module, BMFP module, and predict head. The MSAN and BMFP modules are constructed by stacking multiscale LC blocks. As shown in Fig. 2(d), the LC block combines local-window (
The feature
2.2 Training Strategy and Loss Function
To reinforce the network’s attention to the edge and texture-rich regions, we use a UDL loss function in the large-scale SPI task and imaging-free single-pixel segmentation task. The UDL loss function is inspired by Ref. 30. The training of the network is divided into two stages. In the first stage, the network estimates both the reconstructed result and uncertainty values. In the second stage, the uncertainty values are used to generate a spatially adaptive weight to guide the network to prioritize the image regions with rich textures and details, thus improving the reconstruction quality of these regions. The parameters of the encoder and decoder in both stages are updated simultaneously, and the trained encoding module
For the detailed derivation of the UDL function, please refer to Ref. 30. We made two changes to the UDL function, specifically for large-scale SPI tasks. First, we use a smoother and slowly increasing monotonically increasing function
Second, we select different hybrid loss functions for different downstream tasks and embed them into the uncertainty-driven loss function. Specifically, we selected a hybrid loss function composed of structural similarity index (SSIM) loss and
In the image-free segmentation task, we selected a hybrid loss function composed of the IOU loss function and the BCE loss function,
To validate the effectiveness of the UDL function, we performed ablation experiments on imaging and image-free single-pixel segmentation. The results are shown in Fig. 3(c), from which we can see that UDL can enhance the imaging and segmentation performance on texture-rich regions and edge regions.
Figure 3.Two-step training strategy of the SPIS network. (a) The overview of the two-step training strategy for imaging and image-free segmentation. In the first stage, the network estimates both the output result and the uncertainty values. In the second stage, the uncertainty values are used to generate a spatially adaptive weight to guide the network to prioritize the pixels in the texture-rich regions and edge regions. (b) The overview of the two-step training method for image-free single-pixel object detection. (c) Ablation study of the UDL loss function. “Step 1” represents the results output by SPIS that has not been trained by UDL, and “step 2” represents the results output by SPIS that has been trained by UDL.
When SPIS is applied for the image-free object detection task, we adopt another two-step training strategy, as shown in Fig. 3(b). The first training stage aims to train the encoder’s high-dimensional semantic feature extraction ability and obtain an optimized spatial light modulation mask. Training at this stage is performed on a large-scale self-supervised pretraining data set, whereas the function of the decoder in this stage is to reconstruct the scene image. The parameters of the encoder and decoder are updated simultaneously, and the parameter value range of the encoder is limited to [0, 1]. In the second training stage, the decoder is replaced with our designed object detection network module, as shown in Fig. 2(d). The encoder and decoder are updated simultaneously to find the optimal network parameters. The second training stage is performed on the public object detection data set Pascal VOC 2012. The reason for not using UDL in object detection training is that, in the image-free object detection task, our main goal is to detect the location, size, and category of the interest object in the scene rather than accurately reconstruct the scene image. Using uncertainty maps to increase the peak signal-to-noise ratio (PSNR) of the reconstructed result by 1 or 2 dB has a very limited effect on improving object detection accuracy, and it will increase the computational cost and time during training.
As for the loss function, in the first training stage, we use the
3 Results
3.1 Large-Scale Single-Pixel Imaging
To validate the imaging performance of SPIS, we first conducted simulations on the public data set Flickr2K,32 which contains 2650 images. We used 90% of the images for training and the remaining 10% for testing. In the simulation, the imaging resolution was
SR = 3% | SR = 5% | SR = 7% | SR = 10% | SR = 15% | ||||||
PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |
DCAN | 18.17 | 0.51 | 18.86 | 0.54 | 19.63 | 0.59 | 20.03 | 0.62 | 20.63 | 0.63 |
GAN | 18.82 | 0.58 | 19.35 | 0.61 | 20.19 | 0.65 | 20.61 | 0.68 | 21.44 | 0.72 |
ReconNet | 19.89 | 0.62 | 20.01 | 0.67 | 20.90 | 0.72 | 21.12 | 0.75 | 22.41 | 0.75 |
Ours |
Table 1. Quantitative results of large-scale SPI comparison
For the comparison methods (ReconNet,15 DCAN,22 and GAN23), we used the code made available by the respective authors on their original papers or websites. Because their original reconstruction resolution did not reach
In Fig. 4(a), we presented the statistical results of pattern comparison, noise robustness, and ablation study for uncertainty loss functions. For pattern comparison, we compared the reported small-sized optimized pattern with the conventional Hadamard pattern,22 random pattern,20 and the optimized full-sized pattern.24 The results show that our method can maintain a stable performance even at an extremely low sampling rate of 3%. The performance of Hadamard and random patterns did not perform as well as the optimized patterns. The optimized full-sized pattern did not perform well because it can only acquire one-dimensional measurements and cannot retain the position information of the target. In summary, the optimized small-sized pattern is superior for large-scale SPI.
Figure 4.SPI experiment. (a) The statistical results of pattern comparison, noise robustness, and ablation study for uncertainty loss functions. (b) The proof-of-concept setup for large-scale SPI and image-free single-pixel sensing. (c) The visualization results of large-scale SPI on 3D scenes at a sampling rate of 3% and a resolution of
To examine the robustness of measurement noise, we also added noises with an SNR of 10 to 20 dB to measurements (details on how to add noise can be found in Sec. 1 in the Supplementary Material). The results in Fig. 4(a) show that the reported technique still outperforms the others. Even at a sampling rate of 3% and a noise level of 20 dB, our method still produced a reconstruction PSNR of 22.13 dB. In addition, the PSNR gaps of SPIS reconstructed images under different noise levels are small, which shows that our method is robust to noise interference.
Also, we conducted an ablation study of the reported UDL function and two-stage training strategy, with the results shown in Fig. 4(a). There were three models involved in the comparison, including the SPIS model without the UDL function (W/O uncertainty), the SPIS model after the first training stage (step 1), and the full SPIS model after two training stages (step 2). We can see that the UDL function can effectively improve reconstruction quality. This benefits from that the uncertainty map
To experimentally demonstrate the large-scale imaging performance of the reported SPIS technique, we built a proof-of-concept setup [as shown in Fig. 4(b)] based on the optical system shown in Fig. 1(a). The structured illumination was generated using a DMD (V-9501, Vialux) and a white-light source (Thorlabs SLS302). With the help of the universal high-performance programming and development tool (ALP4.3), we can achieve a refresh rate of up to 50 kHz by utilizing only some of the DMD pixels. It is equivalent to increasing the refresh speed of DMD by sacrificing DMD resolution under limited data bandwidth. A single-pixel detector (PDA100A2, Thorlabs) was used to collect the light reflected from the target scene. The collected 1D measurements were digitized using a data acquisition card (PCIE8514, 10 MHz) and then fed into the computing unit. The computing unit will convert the collected 1D measurements into 2D measurements and send these 2D measurements to the SPIS network to complete imaging or image-free sensing tasks.
We used multiple 3D plaster sculpture models and a resolution board as the target scenes and conducted large-scale SPI comparison at a resolution of
3.2 Image-Free Single-Pixel Segmentation
The SPIS technique can be applied to the image-free semantic segmentation task without modifying the network structure. To demonstrate the segmentation performance of our method, we trained our SPIS network on the cell segmentation data set WBC33 and conducted an experimental comparison with the existing state-of-the-art single-pixel segmentation methods. The experimental results are shown in Table 2 and Fig. 5. More experimental results can be found in Sec. 2 in the Supplementary Material.
SR = 3% | SR = 5% | SR = 7% | SR = 10% | SR = 15% | ||||||
Dice | mIoU | Dice | mIoU | Dice | mIoU | Dice | mIoU | Dice | mIoU | |
Hadamard | 0.148 | 0.071 | 0.218 | 0.099 | 0.151 | 0.072 | 0.224 | 0.095 | 0.237 | 0.096 |
Random | 0.207 | 0.092 | 0.121 | 0.061 | 0.123 | 0.062 | 0.162 | 0.077 | 0.173 | 0.074 |
Optimized | 0.716 | 0.634 | 0.721 | 0.652 | 0.739 | 0.649 | 0.763 | 0.684 | 0.793 | 0.658 |
Imaging | 0.638 | 0.408 | 0.690 | 0.466 | 0.715 | 0.524 | 0.713 | 0.546 | 0.674 | 0.522 |
SPS | 0.705 | 0.628 | 0.717 | 0.659 | 0.729 | 0.673 | 0.743 | 0.691 | 0.785 | 0.648 |
Ours |
Table 2. Quantitative performance comparison of different image-free segmentation methods at different SRs. The methods for comparison include the image-free segmentation methods using the Hadamard pattern,22 random pattern,20 and optimized full-sized pattern (“Optimized”),24 the imaging-first15-sensing-later40 method (“Imaging”), and the UNet-based image-free segmentation method (SPS).25 In the “Imaging” method, we used the method in Ref. 15 for imaging and UNet40 for semantic segmentation.
Figure 5.Image-free single-pixel segmentation experiment. (a) Experimental results of segmentation performance of the five methods involved in the comparison and our proposed image-free single-pixel image segmentation method at different sampling rates. (b) Visualization results of image-free single-pixel image segmentation performance comparison experiments at different sampling rates (SRs).
We would like to note that the WBC data set contains a total of 400 pairs of images. We used 90% of the data set for training and the remaining 10% for testing. Because the resolution of the WBC data set33 is not high, all the simulation results in Table 2 and Fig. 5(a) were produced at a resolution of
As shown in Table 2, our method achieved superior image-free segmentation performance compared with other methods. The Hadamard pattern and random pattern are both without network optimization. These patterns sample all regions with the same weight, which means that they cannot adaptively assign higher sampling weights to the targets, resulting in relatively poor sampling efficiency. Although the optimized full-sized pattern can achieve global self-adaptive sampling, it can only produce 1D sampling measurements, which cannot retain the location and size information of the targets, resulting in unsatisfactory segmentation performance. The imaging-first-sensing-later approach first implements image reconstruction and then performs image-free segmentation. Because the reconstruction precision is usually unsatisfactory at very low sampling rates, this step directly leads to poor segmentation performance. By contrast, the image-free single-pixel segmentation method uses the high-dimensional semantic information extracted from the measurements directly for image segmentation, which avoids information loss and error accumulation. Due to the limited receptive field of the convolutional block and the lack of global feature modeling capability, the segmentation performance of SPS is not as good as our image-free single-pixel segmentation network designed based on the transformer framework.
Figure 5(b) shows the visual experimental results of different image-free single-pixel segmentation methods at different sampling rates. We still employed the proof-of-concept setup in Fig. 4(b) to acquire measurements. The images randomly selected from the WBC testing data set were printed on films as the target scene. We first performed the classical SPI to obtain high-quality reconstructed images using a sampling rate of 100%. Next, we manually annotated the reconstructed images according to the annotations in the WBC data set to obtain the ground-truth segmentation. In Fig. 5(b), we can see that our SPIS method achieves the highest segmentation performance at different sampling rates. Especially, at the extremely low sampling rate of 0.1%, only our method can still accurately segment cells.
3.3 Image-Free Single-Pixel Object Detection
The reported SPIS technique can also be applied for image-free single-pixel object detection. To reduce computational overhead while ensuring detection accuracy, we sampled the scene at a resolution of
Method | Data throughput (Mbps) | Accuracy (mAP) | Time (ms) | Speed (fps) |
R-CNN | 1.280 | 58.51% | 200 | 5.0 |
Faster R-CNN | 2.586 | 73.22% | 99 | 10.1 |
YOLO | 6.732 | 78.64% | 38 | 26.3 |
SSD | 4.915 | 79.63% | 52 | 19.2 |
DETReg | 3.098 | 81.16% | 83 | 12.1 |
SPIS (ours) |
Table 3. Comparison of data throughput and running speed between SPIS and the other existing object detection methods.
Object class | “horse” | “cat” | “bicycle” | “sofa” | “boat” |
Accuracy | 92% | 89% | 78% | 86% | 84% |
Object class | “cow” | “sheep” | “person” | “bird” | “table” |
Accuracy | 81% | 76% | 76% | 86% | 74% |
Table 4. Experimental detection results of SPIS at 5% sampling rate on the testing set of Pascal VOC2012.
Figure 6.Image-free single-pixel object detection experiment. (a) Statistic results of pattern sampling performance comparison and noise interference experiments. (b) Visualization results of image-free single-pixel object detection. The “min” and “max” represent the relative coordinates of the upper left corner and lower right corner of the target bounding box, respectively. To better demonstrate the detection results, we visualized the output of SPIS on the input scene.
Figure 6(a) validates that our SPIS technique using small-sized patterns outperforms the image-free object detection techniques using other patterns. Even at a low sampling rate of 5%, SPIS still produced an accuracy of 82.41% (mAP). To further validate the robustness of our method to measure noise (details on how to add noise can be found in Sec. 1 in the Supplementary Material), we also added Gaussian noise to the 2D measurements with an SNR from 10 to 20 dB. Figure 6(a) shows that the SPIS technique still performs well with measurement noise, even at 10 dB SNR and a 5% sampling rate; the detection accuracy achieved was 78.17% (mAP).
To validate that SPIS can greatly reduce data redundancy and data throughput, we conducted a data throughput comparison with the existing object detection algorithms, as shown in Table 3. We can see that SPIS achieves higher object detection accuracy with 1 order of magnitude less data throughput than the existing target detection methods; the detection speed is also faster. This is because the image-free sensing strategy eliminates the computational overhead of scene reconstruction, and the small-sized pattern achieves better sampling performance with fewer parameters. In addition, the LC block in the decoder uses a parallel design to combine the LWSA with CWC, allowing the decoder to obtain the global receptive field while maintaining linear complexity, thereby reducing the number of parameters and computational overhead.
We used the proof-of-concept setup in Fig. 4(b) to validate the experimental image-free object detection performance of SPIS. All the parameter settings of the proof-of-concept setup remain unchanged, implying that the SPIS technique can use one hardware device to achieve multiple image-free sensing tasks. The images randomly selected from the Pascal VOC testing data set were printed on films and used as the target scene. We first performed the classical SPI to obtain high-quality reconstructed images using a sampling rate of 100%. Next, we manually annotated the reconstructed images according to the annotations in the Pascal VOC data set to obtain the ground-truth detection results. Under the sampling rate of 5%, the average time to complete spatial light modulation and image-free object detection per scene is 0.016 s. It is faster than performing scene reconstruction (0.05 s)22 first and then object detection (0.018 s).27Figure 6(b) presents the detection results corresponding to several exemplar scenes. Among them, the attention heat maps validate that the transformer-based encoder can indeed reinforce the network’s attention to the targets. Table 4 shows the statistical accuracy of various objects and the overall accuracy. We can see that the SPIS technique maintains a high detection accuracy on different classes of objects (the average detection accuracy of all the object classes achieves 82.2%).
4 Conclusion and Discussion
In this work, we reported the large-scale SPIS technique, which can achieve megapixel high-quality SPI and highly efficient image-free sensing at a low sampling rate. The SPIS technique utilizes an encoder–decoder structure, in which the illumination patterns are jointly optimized during the network training. Unlike the conventional full-sized patterns,20,22,24 we introduced small-sized optimized patterns to scan and sample the target scene, which achieves higher sampling performance with 1 order of magnitude fewer parameters. Moreover, we designed the encoder module based on the transformer architecture,26 which can better model global features and extract high-dimensional semantic features for high-quality reconstruction and sensing. On the other hand, we can replace the decoder module with a task-specific plug-and-play decoder, providing great adaptivity to different tasks. Considering that texture-rich and edge regions are more difficult to reconstruct, we introduced a novel UDL loss function to reinforce the network’s attention to these regions, thus further improving the imaging and sensing precision.
With the high-resolution SPI and high-efficiency image-free sensing performance at a very low sampling rate, the SPIS technique can adapt to the applications of low bandwidth or limited computational resources. For example, it can be applied on mobile platforms with limited loads, such as vehicle radar and UAVs. The SPIS technique can help them achieve efficient scene reconstruction and intelligent image-free sensing at a low cost.
In the experiments, we noticed that the size of the small-sized pattern will affect the performance of imaging and image-free sensing. Theoretically, the smaller the pattern size, the better the image-free sensing performance. This is because a smaller pattern can retain more detailed location information and capture rich local features. However, as the pattern becomes smaller, the luminous flux becomes lower, which will undoubtedly reduce the performance of imaging and image-free sensing. We tried six different resolutions of small-sized patterns (including
We also noticed that the practical implementation of small-sized patterns also has an impact on sampling efficiency. In our experiments, we embed nonoverlapping small-sized patterns in zero-initialized full-sized patterns and implement small-sized pattern sampling by quickly switching the full-sized patterns. The illumination speed is limited by the DMD. Another way to improve sampling efficiency is to first use the DMD to modulate the beam and generate a small-sized light pattern and then use a resonant galvanometer scanner set to quickly scan the pattern over the scene without overlapping. The resonant scanner and galvanometer mirrors are optically coupled through a relay lens set. This small-sized pattern sampling method can increase sampling speed.
How to apply the SPIS technique to various complex real-world scenes is challenging. In this work, the measurement noise obtained from simulation or laboratory environments is relatively homogeneous, which may be different from complex real-world environments. To suppress noise in complex real-world environments, we can first study the complex photon flow model to accurately characterize multiple physical noise sources44 and collect SPI data sets in real-world scenes to calibrate noise model parameters. With this real-world physical noise model, we can improve the generalization ability of the SPIS technique and achieve robust high-precision imaging and sensing in practical applications.
5 Appendix A: Training Details
We implemented SPIS on ubuntu20 using the Pytorch framework and trained it using the Adam optimization algorithm on NVIDIA RTX 3090.
In the large-scale SPI experiments, we trained SPIS and other methods for comparison on the Flickr2K32 data set. We enhanced the training set by cropping, rotating, and flipping the images. All the images were resized to
In the image-free single-pixel segmentation experiments, we trained our SPIS and other methods for comparison on the WBC33 data set. We enhanced the training set by cropping, rotating, and flipping the images. All the images were resized to
In the large-scale SPI and imaging-free single-pixel segmentation experiments, to reinforce the network’s attention to the edge regions and the regions containing rich textures and details, we introduced a UDL function. The training process was divided into two stages. In the first stage, the network estimated both the reconstructed results and the uncertainty values. In the second stage, the uncertainty values were used to generate a spatially adaptive weight to guide the network to prioritize the image regions with rich texture and edge regions, thus improving the reconstruction quality of these regions. The parameters of the encoder and decoder in both stages were updated simultaneously.
In the image-free single-pixel object detection experiments, we trained our SPIS and other methods for comparison on the Pascal VOC data set. We enhanced the training set by cropping, rotating, and flipping the images. All the images were resized to
When the SPIS technique was applied for the image-free object detection task, we adopted another two-step training strategy. The first training stage aimed to train the encoder’s high-dimensional semantic feature extraction ability and obtain optimized small-size modulation patterns. The function of the decoder in this stage was to reconstruct the target scene. The encoder and decoder parameters were updated simultaneously in the first training stage. During the second training stage, the decoder was replaced with our designed object detection network. The encoder and decoder were updated simultaneously to find the optimal network parameters.
6 Appendix B: Sampling Rate Calculation
All the sampling rates mentioned in this paper can be calculated as the ratio between measurement number and sampling resolution,
For the conventional full-sized modulation patterns, one pattern can produce one measurement. Therefore, the sampling rate of full-sized modulation patterns can be directly derived as
For the small-sized optimized pattern reported in this work, a small-sized optimized pattern scans the entire scene and produces multiple measurements. The measurement number can be calculated as
7 Appendix C: DMD Scanning Strategy
As shown in Fig. 7, we implemented small-sized pattern sampling by embedding nonoverlapping small-sized patterns in multiple zero-initialized full-sized patterns. For example, when the sampling rate is 5% and the scene resolution is
Figure 7.Sampling process of small-sized patterns. We take a
The above sampling method through small-sized optimized patterns combines the advantages of both compressed sensing and point scanning imaging.34 Compared with the conventional full-sized pattern, our small-sized pattern can retain the position information of the target and improve sampling efficiency. Compared with the point scanning system, our method samples a much larger portion of the scene at once, thus reducing sampling time and increasing sampling speed.
Lintao Peng received his BS degree from the School of Computer Science and Technology, Xidian University, Xi’an, China, in 2020. He is currently pursuing a PhD with the School of Information and Electronics, Beijing Institute of Technology, Beijing, China. His research interests include computer vision, computational photography, and deep learning.
Siyu Xie received his BS degree from the School of Communication Engineering, Jilin University, Changchun, China, in 2021. He is currently pursuing an MS degree in Electronic Information Engineering with the School of Information and Electronics, Beijing Institute of Technology, Beijing, China, and is expected to graduate in 2024. His research interests include computational photography and deep learning.
Hui Lu received her BS degree from the School of Communication Engineering, Jilin University, Changchun, China, in 2022. She is currently pursuing an MS degree in Electronic Information Engineering with the School of Information and Electronics, Beijing Institute of Technology, Beijing, China. Her research interests include computational photography and deep learning.
Liheng Bian received his PhD from the Department of Automation, Tsinghua University, Beijing, China, in 2018. He is currently an associate professor with the Beijing Institute of Technology. His research interests include computational imaging and computational sensing. More information of him can be found at https://bianlab.github.io/.
References
[7] S. Ota et al. Ghost cytometry. Science, 360, 1246-1251(2018).
[12] M. Wenwen et al. Sparse Fourier single-pixel imaging. Opt. Express, 27, 31490-31503(2019).
[13] M. Lyu et al. Deep-learning-based ghost imaging. Sci. Rep, 7, 17865(2017).
[14] P. Ryczkowski et al. Ghost imaging in the time domain. Nat. Photonics, 10, 167-170(2016).
[16] O. Graydon. Imaging: retina-like single-pixel camera. Nat. Photonics, 11, 335(2017).
[22] C. F. Higham et al. Deep learning for real-time single-pixel video. Sci. Rep., 8, 1-9(2018).
[26] A. Vaswani et al. Attention is all you need, 6000-6010(2017).
[27] J. Redmon, A. Farhadi. YOLO9000: better, faster, stronger, 6517-6525(2017).
[28] A. Dosovitskiy et al. An image is worth 16 x 16 words: transformers for image recognition at scale(2021).
[29] F. Chollet. Xception: deep learning with depthwise separable convolutions, 1800-1807(2017).
[30] Q. Ning et al. Uncertainty-driven loss for single image super-resolution, 16398-16409(2021).
[31] M. Everingham et al. The PASCAL visual object classes challenge 2010 (VOC2010) results(2010).
[35] D. Jha et al. Kvasir-seg: a segmented polyp dataset. Lect. Notes Comput. Sci., 11962, 451-462(2020).
[37] S. Ren et al. Faster R-CNN: towards real-time object detection with region proposal networks(2015).
[38] I. Goodfellow et al. Generative adversarial nets, 2672-2680(2014).
[42] W. Liu et al. SSD: single shot multibox detector. Lect. Notes Comput. Sci., 9905, 21-37(2016).

Set citation alerts for the article
Please enter your email address