Large-scale single-pixel imaging and sensing

Lintao Peng; Siyu Xie; Hui Lu; Liheng Bian

doi:10.1117/1.APN.4.2.026010

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Abstract

Existing single-pixel imaging (SPI) and sensing techniques suffer from poor reconstruction quality and heavy computation cost, limiting their widespread application. To tackle these challenges, we propose a large-scale single-pixel imaging and sensing (SPIS) technique that enables high-quality megapixel SPI and highly efficient image-free sensing with a low sampling rate. Specifically, we first scan and sample the entire scene using small-size optimized patterns to obtain information-coupled measurements. Compared with the conventional full-sized patterns, small-sized optimized patterns achieve higher imaging fidelity and sensing accuracy with 1 order of magnitude fewer pattern parameters. Next, the coupled measurements are processed through a transformer-based encoder to extract high-dimensional features, followed by a task-specific plug-and-play decoder for imaging or image-free sensing. Considering that the regions with rich textures and edges are more difficult to reconstruct, we use an uncertainty-driven self-adaptive loss function to reinforce the network’s attention to these regions, thereby improving the imaging and sensing performance. Extensive experiments demonstrate that the reported technique achieves 24.13 dB megapixel SPI at a sampling rate of 3% within 1 s. In terms of sensing, it outperforms existing methods by 12% on image-free segmentation accuracy and achieves state-of-the-art image-free object detection accuracy with an order of magnitude less data bandwidth.

Keywords

deep learning image-free object detection image-free segmentation single-pixel imaging

1 Introduction

Single-pixel imaging (SPI) is a novel imaging technique that captures spatially modulated two-dimensional (2D) scenes using just one bucket detector.1^,2 SPI offers numerous benefits, such as high signal-to-noise ratio (SNR), wide spectrum coverage, and low cost, which have made it widely used in various fields, including multispectral imaging,3^,4 optical encryption,5^,6 and image-free sensing.7^,8 However, the reported pixel resolution of SPI systems ranges from $32 pixel \times 32 pixel$ to $128 pixel \times 128 pixel$ , which is below the imaging standards achieved by conventional methods.9^,10 This low resolution is due to the trade-off among the acceptable compression ratio, limited digital micromirror device (DMD) modulation frequency, and reasonable reconstruction time. The challenge of recovering large-scale scenes with reduced measurements still poses a significant obstacle, limiting the practical use of SPI for high-resolution and wide field-of-view vision applications.1

Researchers have explored various approaches to achieve large-scale SPI. Stojek et al.11 made an attempt to develop a high-resolution SPI scheme. However, it requires the target scene to be spatially sparse; otherwise, the recovered image will lose details. Wenwen et al.12 applied a compressive sensing (CS) algorithm to retrieve target scenes. However, the traditional CS-based restoration techniques are costly in terms of memory and time, and it is hard to reconstruct large-scale scenes. Recently, with the development of deep learning, convolutional neural network (CNN)-based approaches have been widely applied to SPI. Lyu et al.13 proposed a ghost imaging14 technique based on deep learning, which validated the feasibility of deep learning in simulating the physical process of ghost imaging. Kulkarni et al.15 input SPI measurements into the CNN-based ReconNet and sent the output feature map as an intermediate reconstruction result to the off-the-shelf denoiser to produce the final high-quality reconstruction image. However, due to the limited receptive field and globally consistent attention of CNN networks, the reconstructed images still suffer from the problems of blurring, missing details, and low resolution.

In recent years, the development of single-pixel systems16 has extended beyond imaging techniques and gradually expanded into the field of semantic sensing.17^,18 Different from SPI, image-free single-pixel sensing directly perceives the target scene from the measurement values, eliminating the scene reconstruction step and saving much computational overhead. In image-free single-pixel sensing tasks, the sampling performance is strongly correlated with the type and number of patterns.19 Although reducing the number of unoptimized patterns (such as random pattern20 and Hadamard pattern21) can significantly improve sampling efficiency, the sampling performance is not satisfactory. Increasing the number of patterns can alleviate the problem of poor sampling performance, but it will reduce the efficiency of single-pixel sensing. Recently, researchers have started to use convolutional networks to simulate the SPI process22^,23 and use full-size convolution kernels to simulate the modulation patterns. This operation embeds modulation patterns into the network and optimizes these patterns by training the network. Based on the optimized full-sized patterns, we have produced several performance-leading single-pixel sensing works.24^,25 However, the full-sized optimized pattern still has the problem of excessive parameters, and it cannot preserve the position information of the target in the scene, resulting in unsatisfactory imaging and sensing performance. Therefore, how to balance the number of patterns (sampling rate or sampling speed) with sensing performance becomes the major challenge for single-pixel sensing.

Sign up for Advanced Photonics Nexus TOC. Get the latest issue of Advanced Photonics Nexus delivered right to you！Sign up now

To tackle the above challenge, we report a large-scale uncertainty-driven single-pixel imaging and sensing (SPIS) method that can achieve high-quality $1024 \times 1024$ SPI and highly efficient image-free sensing from a small number of measurements. Specifically, unlike previous full-sized patterns, SPIS uses optimized small-sized patterns to sample the scene and obtain measurements [as shown in Fig. 1(a)] that achieve higher sampling performance with $\sim 1$ order of magnitude fewer pattern parameters. Then, the 2D measurements are input into the transformer-based26 encoder to extract effective high-dimensional semantic features, which are then sent to the task-specific plug-and-play decoder to complete megapixel SPI or image-free sensing [as shown in Fig. 1(b)]. The decoder used for imaging and image-free segmentation consists of multiscale upsampling residual convolutional layers. The decoder used for image-free object detection consists of a multi-scale attention net (MSAN) and a bidirectional multi-scale feature pyramid (BMFP)27 module. Compared with other model architectures, MSAN can better extract multiscale scene features, and the BMFP module can help features of different scales to fully integrate and complement each other. The MSAN and BMFP are constructed by stacking multiscale basic parallel blocks. These blocks combine local-window self-attention (LWSA)28 and channel-wise convolution (CWC)29 in a parallel design, termed LWSA-CWC (LC) blocks. Such a design allows the interaction of spatial and channel features, thereby establishing cross-window connections and expanding the receptive field while maintaining linear complexity. In addition, considering that texture-rich regions and edge regions are more difficult to reconstruct, we use an uncertainty-driven self-adaptive loss (UDL) function inspired by Ning et al.30 that can reinforce the network’s attention on the regions with rich textures and edges, thus improving the imaging and image-free segmentation quality.

Figure 1.Overview of the SPIS technique. (a) The optical setup of the SPIS technique. The structured illumination was generated using a DMD and a white-light source. A single-pixel detector was used to collect the light reflected from the target scene. The collected 1D measurements were digitized and then reshaped into 2D measurements. (b) In SPIS, we scan and sample the scene using small-sized optimized patterns, which achieve higher sampling performance with an order of magnitude fewer pattern parameters. The 2D measurements were then fed into the encoder to extract high-dimensional semantic features, and the features were sent to the task-specific plug-and-play decoder to complete large-scale SPI or image-free sensing. (c) The transformer-based encoder and UDL function can guide SPIS to pay more attention to the target area with more details in the scene, thus extracting high-dimensional semantic features that are effective for imaging and sensing. (d) The existing state-of-the-art SPI method ReconNet¹⁵ cannot reconstruct clear images at a sampling rate of 3% and a resolution of $1024 \times 1024$ , but our SPIS can still reconstruct high-quality images in this case.

The performance of the reported technique was demonstrated on three public data sets.31^–33 In terms of imaging, SPIS achieved 24.13 dB $1024 \times 1024$ high-quality SPI at a sampling rate of 3%. As for image-free sensing, SPIS reduced data throughput by 1 order of magnitude while improving sensing accuracy. Specifically, in image-free segmentation, our method achieved a segmentation accuracy of 82.1% at a 3% sampling rate. For image-free object detection, our method achieved an accuracy of 82.41% at a 5% sampling rate with a refresh rate of 63 frame per second (fps).

The rest of this paper is organized as follows: Sec. 2 details our proposed SPIS approach, including the network structure and training strategy. Section 3 presents the experimental results to verify the performance of SPIS in large-scale SPI and image-free sensing tasks. Finally, Sec. 4 summarizes our research results, explores the advantages and challenges of this approach in different application scenarios, and proposes future research directions.

2 Method

As shown in Fig. 2(a), the SPIS network consists of two main modules, including the encoder and the decoder. The encoder is composed of an encoding module (which consists of several convolution layers) and a transformer-based high-dimensional semantic feature extraction module.26 The decoder type is determined by specific tasks. When SPIS is applied for SPI and image-free segmentation, the decoder consists of a multiscale upsampling pyramid convolutional block with residual connections. When SPIS is used for image-free object detection, the decoder consists of an MSAN module, a BMFP module, and a predict head.

Figure 2.SPIS network structure. (a) Overview of the SPIS technique. (b) The transformer-based encoder. (c) The decoder for large-scale imaging and image-free segmentation. The decoder consists of multiple upsampling convolution blocks, each of which consists of multiple convolutional layers and one upsampling layer. (d) The decoder for image-free object detection.

2.1 Detailed Structures of the Encoder and the Decoder

The encoder module consists of several convolutional blocks with a kernel size of $32 \times 32$ and a stride of 32, which are used to simulate the sampling process of SPI. The trained encoding module $g \in R^{k \times k \times n}$ ( $k$ is pattern size set to be 32) is extracted and used as the optimized small-sized pattern in practice. Assuming that the scene is $s \in R^{H * W * C_{in}}$ ( $H$ , $W$ , and $C_{in}$ are the height, width, and channel number, respectively), we use the pattern $g$ to scan and sample the scene $s$ to obtain the coupled 2D measurements $F_{m} \in R^{\frac{H}{32} * \frac{W}{32} * C}$ ( $C$ represents the number of channels of the acquired 2D-coupled measurements), and the above process can be characterized as $F_{m} = f_{k * k} (s * g) .$ (1)

We implement small-sized pattern sampling by embedding nonoverlapping small-sized patterns in multiple zero-initialized full-sized patterns, quickly switching between full-sized patterns (more details are referred to in Appendix C). The patterns in all three applications are found by first decoding the original image, then switching to a different decoding task and fine-tuning the patterns in the convolutional layers. This sampling method combines the advantages of compressed sensing and point-scanning imaging.34 Compared with the conventional full-sized pattern sampling method, the small-sized sampling approach can retain the position information of the target and improve sampling efficiency. Compared with the point-scanning system, our method samples a larger portion of the scene at once, thus reducing sampling times and increasing sampling speed. Moreover, to improve SNR, we binarized the illuminating patterns and performed normalization to signals by maintaining an equal white/black ratio in each illuminating pattern.

The high-dimensional semantic feature extraction module of the encoder consists of several transformer layers [as shown in Fig. 2(b)]. A transformer is a basic deep-learning network structure.26 Thanks to its excellent contextual information capturing and global feature modeling capabilities, the Global Vision Transformer26 has achieved great success in the image processing field recently. Our transformer-based encoder can guide the network to focus on the regions with interesting targets to extract high-dimensional semantic features that are effective for imaging and sensing26 (more details about the encoder can be found in Sec. 4 in the Supplementary Material).

When SPIS is applied for large-scale SPI or image-free single-pixel segmentation, the decoder [as shown in Fig. 2(c)] consists of a multiscale upsampling pyramid network with residual connections. Each upsampling block is composed of multiple $3 \times 3$ convolutional layers and an upsampling function. The rearranged layer is used to rearrange the number of channels after upsampling.

When SPIS is applied for image-free object detection, the decoder consists of an MSAN module, BMFP module, and predict head. The MSAN and BMFP modules are constructed by stacking multiscale LC blocks. As shown in Fig. 2(d), the LC block combines local-window ( $7 \times 7$ in this work) self-attention and CWC in a parallel design to model cross-window connections, expanding its receptive field and capturing contextual information.

The feature $F_{e}$ output from the encoder is first fed into the MSAN module for feature extraction, and the extracted features are termed feature layers. In the backbone part, we acquire three effective feature layers. Then, the three effective feature layers are fed into the BMFP module [as shown in Fig. 2(d)] for bidirectional multiscale feature fusion. In BMFP, we upsample and downsample the features simultaneously and perform feature fusion to fully fuse the feature information at different scales. After the MSAN and BMFP processing, we obtain three enhanced effective feature layers. They are then fed into the predict head module for the final object detection. As shown in Fig. 2(d), we divide the predicted head into two parts to implement classification and regression separately and finally integrate them when making predictions (more details about the decoder can be found in Sec. 5 in the Supplementary Material).

2.2 Training Strategy and Loss Function

To reinforce the network’s attention to the edge and texture-rich regions, we use a UDL loss function in the large-scale SPI task and imaging-free single-pixel segmentation task. The UDL loss function is inspired by Ref. 30. The training of the network is divided into two stages. In the first stage, the network estimates both the reconstructed result and uncertainty values. In the second stage, the uncertainty values are used to generate a spatially adaptive weight to guide the network to prioritize the image regions with rich textures and details, thus improving the reconstruction quality of these regions. The parameters of the encoder and decoder in both stages are updated simultaneously, and the trained encoding module $g \in R^{k \times k \times n}$ ( $k$ is the pattern size set to be 32) is extracted and used as the small-sized optimized pattern in practice.

For the detailed derivation of the UDL function, please refer to Ref. 30. We made two changes to the UDL function, specifically for large-scale SPI tasks. First, we use a smoother and slowly increasing monotonically increasing function ${\hat{s}}_{i} = \ln (1 + e^{s_{i}})$ to assign higher priority to regions with higher uncertainty because some overly high priorities may lead to poor reconstruction of other texture-rich regions and edge regions with lower priorities. Our monotonically increasing function can make the uncertainty distribution more uniform, prevent the network from ignoring texture regions and edge regions with lower priorities, and prevent overenhancement of regions with higher uncertainty.

Second, we select different hybrid loss functions for different downstream tasks and embed them into the uncertainty-driven loss function. Specifically, we selected a hybrid loss function composed of structural similarity index (SSIM) loss and $L_{1}$ loss function in the SPI task, ${Loss}_{UDL} = \frac{1}{N} \sum_{i = 1}^{N} {\hat{s}}_{l} ({Loss}_{SSIM} + {Loss}_{L_{1}}),$ (2)where ${Loss}_{SSIM}$ and ${Loss}_{L_{1}}$ denote the SSIM loss and $L_{1}$ loss, respectively.

In the image-free segmentation task, we selected a hybrid loss function composed of the IOU loss function and the BCE loss function, ${Loss}_{UDSL} = \frac{1}{N} \sum_{i = 1}^{N} {\hat{s}}_{l} ({Loss}_{IOU} + {Loss}_{BCE}),$ (3)where ${Loss}_{IoU}$ and ${Loss}_{BCE}$ denote the intersection over union (IoU) loss and binary cross-entropy (BCE) loss, respectively. These loss functions are defined in the same way as in Refs. 35 and 36. In ${Loss}_{UDSL}$ (UDSL represents uncertainty-driven segmentation loss), the edge pixels with higher uncertainty tend to have greater weights than smooth regions. In summary, the uncertainty estimation serves as the bridge connecting two steps: it is the output of the first step and passed on to the second step as the guidance required for calculating ${Loss}_{UDSL}$ .

To validate the effectiveness of the UDL function, we performed ablation experiments on imaging and image-free single-pixel segmentation. The results are shown in Fig. 3(c), from which we can see that UDL can enhance the imaging and segmentation performance on texture-rich regions and edge regions.

Figure 3.Two-step training strategy of the SPIS network. (a) The overview of the two-step training strategy for imaging and image-free segmentation. In the first stage, the network estimates both the output result and the uncertainty values. In the second stage, the uncertainty values are used to generate a spatially adaptive weight to guide the network to prioritize the pixels in the texture-rich regions and edge regions. (b) The overview of the two-step training method for image-free single-pixel object detection. (c) Ablation study of the UDL loss function. “Step 1” represents the results output by SPIS that has not been trained by UDL, and “step 2” represents the results output by SPIS that has been trained by UDL.

When SPIS is applied for the image-free object detection task, we adopt another two-step training strategy, as shown in Fig. 3(b). The first training stage aims to train the encoder’s high-dimensional semantic feature extraction ability and obtain an optimized spatial light modulation mask. Training at this stage is performed on a large-scale self-supervised pretraining data set, whereas the function of the decoder in this stage is to reconstruct the scene image. The parameters of the encoder and decoder are updated simultaneously, and the parameter value range of the encoder is limited to [0, 1]. In the second training stage, the decoder is replaced with our designed object detection network module, as shown in Fig. 2(d). The encoder and decoder are updated simultaneously to find the optimal network parameters. The second training stage is performed on the public object detection data set Pascal VOC 2012. The reason for not using UDL in object detection training is that, in the image-free object detection task, our main goal is to detect the location, size, and category of the interest object in the scene rather than accurately reconstruct the scene image. Using uncertainty maps to increase the peak signal-to-noise ratio (PSNR) of the reconstructed result by 1 or 2 dB has a very limited effect on improving object detection accuracy, and it will increase the computational cost and time during training.

As for the loss function, in the first training stage, we use the $L_{1}$ loss function, ${Loss}_{L_{1}} (I_{RHQ}, I_{HQ}) = E_{I_{RHQ}, I_{HQ}} [‖ I_{RHQ} - I_{HQ} ‖] .$ (4)Among them, $I_{RHQ}$ stands for the high-quality image reconstructed by the network, and $I_{HQ}$ stands for the ground truth. In the second training stage, the loss function consists of regression loss $L_{reg}$ ,37 confidence loss $L_{con}$ ,27 and classification loss $L_{cls}$ .37 The entire loss function is denoted as $Loss = α L_{reg} + β L_{con} + μ L_{cls},$ (5)where $α$ , $β$ , and $μ$ are hyperparameters that aim to keep the three subloss functions in the same order of magnitude.

3 Results

3.1 Large-Scale Single-Pixel Imaging

To validate the imaging performance of SPIS, we first conducted simulations on the public data set Flickr2K,32 which contains 2650 images. We used 90% of the images for training and the remaining 10% for testing. In the simulation, the imaging resolution was $1024 \times 1024$ , and the original gray-scale images from the data set were used as the ground truth. We used the existing ReconNet,15 DCAN,22 and GAN23 methods for performance comparison. ReconNet first introduced deep learning for SPI reconstruction. It uses a convolutional neural network to replace the previous iterative reconstruction methods and achieves efficient SPI. DCAN uses a convolutional network to optimize full-sized patterns, which helps the reconstruction network module achieve higher reconstruction quality at low sampling rates. GAN inherits the ideas of ReconNet and DCAN and introduces the generative adversarial network38 for SPI reconstruction, thereby achieving higher reconstruction fidelity. The results are shown in Table 1. The corresponding visual comparison results are referred to in Sec. 1 in the Supplementary Material.


	SR = 3%	SR = 5%	SR = 7%	SR = 10%	SR = 15%
PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DCAN²²	18.17	0.51	18.86	0.54	19.63	0.59	20.03	0.62	20.63	0.63
GAN²³	18.82	0.58	19.35	0.61	20.19	0.65	20.61	0.68	21.44	0.72
ReconNet¹⁵	19.89	0.62	20.01	0.67	20.90	0.72	21.12	0.75	22.41	0.75
Ours	24.13	0.83	24.64	0.86	25.33	0.86	25.69	0.88	26.17	0.89

Table 1. Quantitative results of large-scale SPI comparison^a

View all Tables

For the comparison methods (ReconNet,15 DCAN,22 and GAN23), we used the code made available by the respective authors on their original papers or websites. Because their original reconstruction resolution did not reach $1024 pixel \times 1024 pixel$ , we added several upsampling residual convolution layers at the end of their networks, which were consistent with the upsampling residual convolutional layer structure in SPIS’s image reconstruction decoder. In this way, we increased their reconstruction resolution to $1024 pixel \times 1024 pixel$ . In addition, the sampling resolution of the comparison methods was also increased to $1024 \times 1024$ . Except for the above modifications, other parameters followed the original settings of each method. In Table 1, we can see that the reported SPIS technique achieves the highest score both in PSNR and SSIM. In addition, the imaging time of SPIS at $1024 \times 1024$ resolution is 0.17 s, whereas the shortest imaging time of other methods at $1024 \times 1204$ resolution is more than 1 s (the imaging time only includes the algorithm reconstruction time with an RTX 3090 GPU and an AMD 3700× CPU). The results validate that the SPIS technique is capable of more efficient and high-quality large-scale SPI. The superior performance originates from the following several aspects. First, the proposed small-sized optimized pattern sampling method achieves better sampling performance with fewer patterns. Second, the encoder designed based on the transformer architecture can better model global features and extract high-dimensional semantic features that are effective for reconstruction. Third, the UDL function reinforces the network’s attention to the regions with rich textures and details.

In Fig. 4(a), we presented the statistical results of pattern comparison, noise robustness, and ablation study for uncertainty loss functions. For pattern comparison, we compared the reported small-sized optimized pattern with the conventional Hadamard pattern,22 random pattern,20 and the optimized full-sized pattern.24 The results show that our method can maintain a stable performance even at an extremely low sampling rate of 3%. The performance of Hadamard and random patterns did not perform as well as the optimized patterns. The optimized full-sized pattern did not perform well because it can only acquire one-dimensional measurements and cannot retain the position information of the target. In summary, the optimized small-sized pattern is superior for large-scale SPI.

Figure 4.SPI experiment. (a) The statistical results of pattern comparison, noise robustness, and ablation study for uncertainty loss functions. (b) The proof-of-concept setup for large-scale SPI and image-free single-pixel sensing. (c) The visualization results of large-scale SPI on 3D scenes at a sampling rate of 3% and a resolution of $1024 \times 1024$ . (d) The visualization results of large-scale SPI on a resolution board at a sampling rate of 3% and a resolution of $1024 \times 1024$ . Step 1 marks the results output by SPIS after the first training stage, which has not been trained by the UDL. Step 2 denotes the reconstruction results of SPIS with UDL.

To examine the robustness of measurement noise, we also added noises with an SNR of 10 to 20 dB to measurements (details on how to add noise can be found in Sec. 1 in the Supplementary Material). The results in Fig. 4(a) show that the reported technique still outperforms the others. Even at a sampling rate of 3% and a noise level of 20 dB, our method still produced a reconstruction PSNR of 22.13 dB. In addition, the PSNR gaps of SPIS reconstructed images under different noise levels are small, which shows that our method is robust to noise interference.

Also, we conducted an ablation study of the reported UDL function and two-stage training strategy, with the results shown in Fig. 4(a). There were three models involved in the comparison, including the SPIS model without the UDL function (W/O uncertainty), the SPIS model after the first training stage (step 1), and the full SPIS model after two training stages (step 2). We can see that the UDL function can effectively improve reconstruction quality. This benefits from that the uncertainty map $θ$ generated in the first training step can reinforce the network’s attention to the texture-rich regions and edge regions in the second training stage, thus improving the reconstruction quality of these regions.

To experimentally demonstrate the large-scale imaging performance of the reported SPIS technique, we built a proof-of-concept setup [as shown in Fig. 4(b)] based on the optical system shown in Fig. 1(a). The structured illumination was generated using a DMD (V-9501, Vialux) and a white-light source (Thorlabs SLS302). With the help of the universal high-performance programming and development tool (ALP4.3), we can achieve a refresh rate of up to 50 kHz by utilizing only some of the DMD pixels. It is equivalent to increasing the refresh speed of DMD by sacrificing DMD resolution under limited data bandwidth. A single-pixel detector (PDA100A2, Thorlabs) was used to collect the light reflected from the target scene. The collected 1D measurements were digitized using a data acquisition card (PCIE8514, 10 MHz) and then fed into the computing unit. The computing unit will convert the collected 1D measurements into 2D measurements and send these 2D measurements to the SPIS network to complete imaging or image-free sensing tasks.

We used multiple 3D plaster sculpture models and a resolution board as the target scenes and conducted large-scale SPI comparison at a resolution of $1024 pixel \times 1024 pixel$ and a sampling rate of 3%. Based on the results of Table 1, we chose ReconNet for comparison. To obtain the ground truth, we performed the classical SPI to reconstruct the scene using a sampling rate of 100%. We displayed the Hadamard patterns22 on the DMD, and the traditional compressive sensing-based total variation (TV)39 algorithm was employed to reconstruct the ground-truth images. The experimental results are shown in Figs. 4(b) and 4(c). Thanks to the ultrahigh refresh rate (50 kHz) using partial pixels of the DMD, we can complete scene sampling and acquire measurements within 0.63 s at a sampling rate of 3% and a resolution of $1024 pixel \times 1024 pixel$ . On the other hand, the reconstruction time for one image with a resolution of $1024 pixel \times 1024 pixel$ is 0.17 s. From the experimental results in Figs. 4(b) and 4(c), we can see that the reported SPIS technique outperforms the comparison methods with higher imaging quality and more fine details. This superiority originates from the following aspects. First, the proposed small-sized optimized pattern method achieves better sampling performance with fewer patterns. Second, the UDL reinforces the network’s attention to the regions with rich textures and edges, thus improving image reconstruction quality.

3.2 Image-Free Single-Pixel Segmentation

The SPIS technique can be applied to the image-free semantic segmentation task without modifying the network structure. To demonstrate the segmentation performance of our method, we trained our SPIS network on the cell segmentation data set WBC33 and conducted an experimental comparison with the existing state-of-the-art single-pixel segmentation methods. The experimental results are shown in Table 2 and Fig. 5. More experimental results can be found in Sec. 2 in the Supplementary Material.


	SR = 3%	SR = 5%	SR = 7%	SR = 10%	SR = 15%
Dice	mIoU	Dice	mIoU	Dice	mIoU	Dice	mIoU	Dice	mIoU
Hadamard²²	0.148	0.071	0.218	0.099	0.151	0.072	0.224	0.095	0.237	0.096
Random²⁰	0.207	0.092	0.121	0.061	0.123	0.062	0.162	0.077	0.173	0.074
Optimized²⁴	0.716	0.634	0.721	0.652	0.739	0.649	0.763	0.684	0.793	0.658
Imaging¹⁵^,⁴⁰	0.638	0.408	0.690	0.466	0.715	0.524	0.713	0.546	0.674	0.522
SPS²⁵	0.705	0.628	0.717	0.659	0.729	0.673	0.743	0.691	0.785	0.648
Ours	0.821	0.701	0.823	0.704	0.884	0.793	0.889	0.803	0.921	0.854

Table 2. Quantitative performance comparison of different image-free segmentation methods at different SRs. The methods for comparison include the image-free segmentation methods using the Hadamard pattern,²² random pattern,²⁰ and optimized full-sized pattern (“Optimized”),²⁴ the imaging-first¹⁵-sensing-later⁴⁰ method (“Imaging”), and the UNet-based image-free segmentation method (SPS).²⁵ In the “Imaging” method, we used the method in Ref. 15 for imaging and UNet⁴⁰ for semantic segmentation.

View all Tables

Figure 5.Image-free single-pixel segmentation experiment. (a) Experimental results of segmentation performance of the five methods involved in the comparison and our proposed image-free single-pixel image segmentation method at different sampling rates. (b) Visualization results of image-free single-pixel image segmentation performance comparison experiments at different sampling rates (SRs).

We would like to note that the WBC data set contains a total of 400 pairs of images. We used 90% of the data set for training and the remaining 10% for testing. Because the resolution of the WBC data set33 is not high, all the simulation results in Table 2 and Fig. 5(a) were produced at a resolution of $256 pixel \times 256 pixel$ . The methods involved in the comparison include Hadamard,21 random,20 optimized full-sized patterns,24 imaging-first39-sensing-later,40 and SPS.25 Among them, the first three methods were implemented by replacing the modulation patterns in SPIS. The other settings were kept consistent with SPIS. The imaging-first-sensing-later method used ReconNet15 for SPI reconstruction and input the imaging results into UNet40 for subsequent segmentation. The SPS method is a recently reported imaging-free single-pixel segmentation technique. It uses optimized full-sized patterns to sample the scene and uses a convolutional network to preliminarily process the measurements and output intermediate results. Finally, it inputs the intermediate results into a UNet to perform segmentation. We added an upsampling residual convolution layer (consistent with the upsampling residual convolutional layer structure in SPIS’s image reconstruction decoder) to increase the output resolution to $256 pixel \times 256 pixel$ (originally, it was $128 \times 128$ ), and the other settings remain the same as the original publication.

As shown in Table 2, our method achieved superior image-free segmentation performance compared with other methods. The Hadamard pattern and random pattern are both without network optimization. These patterns sample all regions with the same weight, which means that they cannot adaptively assign higher sampling weights to the targets, resulting in relatively poor sampling efficiency. Although the optimized full-sized pattern can achieve global self-adaptive sampling, it can only produce 1D sampling measurements, which cannot retain the location and size information of the targets, resulting in unsatisfactory segmentation performance. The imaging-first-sensing-later approach first implements image reconstruction and then performs image-free segmentation. Because the reconstruction precision is usually unsatisfactory at very low sampling rates, this step directly leads to poor segmentation performance. By contrast, the image-free single-pixel segmentation method uses the high-dimensional semantic information extracted from the measurements directly for image segmentation, which avoids information loss and error accumulation. Due to the limited receptive field of the convolutional block and the lack of global feature modeling capability, the segmentation performance of SPS is not as good as our image-free single-pixel segmentation network designed based on the transformer framework.

Figure 5(b) shows the visual experimental results of different image-free single-pixel segmentation methods at different sampling rates. We still employed the proof-of-concept setup in Fig. 4(b) to acquire measurements. The images randomly selected from the WBC testing data set were printed on films as the target scene. We first performed the classical SPI to obtain high-quality reconstructed images using a sampling rate of 100%. Next, we manually annotated the reconstructed images according to the annotations in the WBC data set to obtain the ground-truth segmentation. In Fig. 5(b), we can see that our SPIS method achieves the highest segmentation performance at different sampling rates. Especially, at the extremely low sampling rate of 0.1%, only our method can still accurately segment cells.

3.3 Image-Free Single-Pixel Object Detection

The reported SPIS technique can also be applied for image-free single-pixel object detection. To reduce computational overhead while ensuring detection accuracy, we sampled the scene at a resolution of $128 pixel \times 128 pixel$ in the image-free object detection task. Different from the image-free single-pixel segmentation task, image-free object detection does not require an uncertainty map for training. Therefore, in the first training stage, we did not output the uncertainty map but only the reconstructed image. The purpose of this training step is to enhance the high-dimensional semantic feature extraction ability of the encoder. In the second stage of training, we replaced the decoder with a task-specific plug-and-play decoder consisting of MSAN and BMFP modules for end-to-end image-free object detection training [as shown in Fig. 3(b)]. We tested the performance of our method for image-free object detection on the Pascal VOC data set and conducted a performance comparison with existing state-of-the-art object detection methods. The experimental results are shown in Tables 3 and 4 and Fig. 6. More experimental results are referred to in Sec. 3 in the Supplementary Material.


Method	Data throughput (Mbps)	Accuracy (mAP)	Time (ms)	Speed (fps)
R-CNN⁴¹	1.280	58.51%	200	5.0
Faster R-CNN³⁷	2.586	73.22%	99	10.1
YOLO²⁷	6.732	78.64%	38	26.3
SSD⁴²	4.915	79.63%	52	19.2
DETReg⁴³	3.098	81.16%	83	12.1
SPIS (ours)	0.396	82.41%	16	62.5

Table 3. Comparison of data throughput and running speed between SPIS and the other existing object detection methods.

View all Tables


Object class	“horse”	“cat”	“bicycle”	“sofa”	“boat”
Accuracy	92%	89%	78%	86%	84%
Object class	“cow”	“sheep”	“person”	“bird”	“table”
Accuracy	81%	76%	76%	86%	74%

Table 4. Experimental detection results of SPIS at 5% sampling rate on the testing set of Pascal VOC2012.

View all Tables

Figure 6.Image-free single-pixel object detection experiment. (a) Statistic results of pattern sampling performance comparison and noise interference experiments. (b) Visualization results of image-free single-pixel object detection. The “min” and “max” represent the relative coordinates of the upper left corner and lower right corner of the target bounding box, respectively. To better demonstrate the detection results, we visualized the output of SPIS on the input scene.

Figure 6(a) validates that our SPIS technique using small-sized patterns outperforms the image-free object detection techniques using other patterns. Even at a low sampling rate of 5%, SPIS still produced an accuracy of 82.41% (mAP). To further validate the robustness of our method to measure noise (details on how to add noise can be found in Sec. 1 in the Supplementary Material), we also added Gaussian noise to the 2D measurements with an SNR from 10 to 20 dB. Figure 6(a) shows that the SPIS technique still performs well with measurement noise, even at 10 dB SNR and a 5% sampling rate; the detection accuracy achieved was 78.17% (mAP).

To validate that SPIS can greatly reduce data redundancy and data throughput, we conducted a data throughput comparison with the existing object detection algorithms, as shown in Table 3. We can see that SPIS achieves higher object detection accuracy with 1 order of magnitude less data throughput than the existing target detection methods; the detection speed is also faster. This is because the image-free sensing strategy eliminates the computational overhead of scene reconstruction, and the small-sized pattern achieves better sampling performance with fewer parameters. In addition, the LC block in the decoder uses a parallel design to combine the LWSA with CWC, allowing the decoder to obtain the global receptive field while maintaining linear complexity, thereby reducing the number of parameters and computational overhead.

We used the proof-of-concept setup in Fig. 4(b) to validate the experimental image-free object detection performance of SPIS. All the parameter settings of the proof-of-concept setup remain unchanged, implying that the SPIS technique can use one hardware device to achieve multiple image-free sensing tasks. The images randomly selected from the Pascal VOC testing data set were printed on films and used as the target scene. We first performed the classical SPI to obtain high-quality reconstructed images using a sampling rate of 100%. Next, we manually annotated the reconstructed images according to the annotations in the Pascal VOC data set to obtain the ground-truth detection results. Under the sampling rate of 5%, the average time to complete spatial light modulation and image-free object detection per scene is 0.016 s. It is faster than performing scene reconstruction (0.05 s)22 first and then object detection (0.018 s).27 Figure 6(b) presents the detection results corresponding to several exemplar scenes. Among them, the attention heat maps validate that the transformer-based encoder can indeed reinforce the network’s attention to the targets. Table 4 shows the statistical accuracy of various objects and the overall accuracy. We can see that the SPIS technique maintains a high detection accuracy on different classes of objects (the average detection accuracy of all the object classes achieves 82.2%).

4 Conclusion and Discussion

In this work, we reported the large-scale SPIS technique, which can achieve megapixel high-quality SPI and highly efficient image-free sensing at a low sampling rate. The SPIS technique utilizes an encoder–decoder structure, in which the illumination patterns are jointly optimized during the network training. Unlike the conventional full-sized patterns,20^,22^,24 we introduced small-sized optimized patterns to scan and sample the target scene, which achieves higher sampling performance with 1 order of magnitude fewer parameters. Moreover, we designed the encoder module based on the transformer architecture,26 which can better model global features and extract high-dimensional semantic features for high-quality reconstruction and sensing. On the other hand, we can replace the decoder module with a task-specific plug-and-play decoder, providing great adaptivity to different tasks. Considering that texture-rich and edge regions are more difficult to reconstruct, we introduced a novel UDL loss function to reinforce the network’s attention to these regions, thus further improving the imaging and sensing precision.

With the high-resolution SPI and high-efficiency image-free sensing performance at a very low sampling rate, the SPIS technique can adapt to the applications of low bandwidth or limited computational resources. For example, it can be applied on mobile platforms with limited loads, such as vehicle radar and UAVs. The SPIS technique can help them achieve efficient scene reconstruction and intelligent image-free sensing at a low cost.

In the experiments, we noticed that the size of the small-sized pattern will affect the performance of imaging and image-free sensing. Theoretically, the smaller the pattern size, the better the image-free sensing performance. This is because a smaller pattern can retain more detailed location information and capture rich local features. However, as the pattern becomes smaller, the luminous flux becomes lower, which will undoubtedly reduce the performance of imaging and image-free sensing. We tried six different resolutions of small-sized patterns (including $4 \times 4$ , $8 \times 8$ , $16 \times 16$ , $32 \times 32$ , $64 \times 64$ , and $128 \times 128$ ) and finally found that the pattern size of $32 \times 32$ produced the best sampling performance (the specific comparison details and results are presented in Sec. 6 in the Supplementary Material). Therefore, the $32 \times 32$ small-sized pattern was used in all the above experiments. However, in different practical applications with different environmental parameters, the most appropriate pattern size may change. For example, in a low-light environment, a larger pattern size can bring more light flux. How to choose the appropriate pattern size according to different application environments is one of the future research directions.

We also noticed that the practical implementation of small-sized patterns also has an impact on sampling efficiency. In our experiments, we embed nonoverlapping small-sized patterns in zero-initialized full-sized patterns and implement small-sized pattern sampling by quickly switching the full-sized patterns. The illumination speed is limited by the DMD. Another way to improve sampling efficiency is to first use the DMD to modulate the beam and generate a small-sized light pattern and then use a resonant galvanometer scanner set to quickly scan the pattern over the scene without overlapping. The resonant scanner and galvanometer mirrors are optically coupled through a relay lens set. This small-sized pattern sampling method can increase sampling speed.

How to apply the SPIS technique to various complex real-world scenes is challenging. In this work, the measurement noise obtained from simulation or laboratory environments is relatively homogeneous, which may be different from complex real-world environments. To suppress noise in complex real-world environments, we can first study the complex photon flow model to accurately characterize multiple physical noise sources44 and collect SPI data sets in real-world scenes to calibrate noise model parameters. With this real-world physical noise model, we can improve the generalization ability of the SPIS technique and achieve robust high-precision imaging and sensing in practical applications.

5 Appendix A: Training Details

We implemented SPIS on ubuntu20 using the Pytorch framework and trained it using the Adam optimization algorithm on NVIDIA RTX 3090.

In the large-scale SPI experiments, we trained SPIS and other methods for comparison on the Flickr2K32 data set. We enhanced the training set by cropping, rotating, and flipping the images. All the images were resized to $1024 pixel \times 1024 pixel$ , and the pixel values were normalized to 0–1 before being fed into networks. During training, the batch size was set to 6, and the SPIS network was trained for a total of 800 epochs. In the first stage, a total of 600 epochs were trained, and the learning rate was 0.0001. The initial learning rate was set to 0.0001, and the learning rate dropped 60% after 200 epochs. In the second stage, a total of 200 epochs were trained. The default values for $β 1$ and $β 2$ were set to 0.5 and 0.999, respectively. The weight decay was set to 0.00005.

In the image-free single-pixel segmentation experiments, we trained our SPIS and other methods for comparison on the WBC33 data set. We enhanced the training set by cropping, rotating, and flipping the images. All the images were resized to $256 pixel \times 256 pixel$ , and the pixel values were normalized to 0–1 before being fed into networks. During training, the batch size was set to 8, and the SPIS network was trained for a total of 600 epochs. In the first stage, a total of 400 epochs were trained, and the learning rate was 0.0001. The initial learning rate was set to 0.0001, and the learning rate dropped 60% after 200 epochs. In the second stage, a total of 200 epochs were trained. The default values for $β 1$ and $β 2$ were set to 0.5 and 0.999, respectively. The weight decay was set to 0.00005.

In the large-scale SPI and imaging-free single-pixel segmentation experiments, to reinforce the network’s attention to the edge regions and the regions containing rich textures and details, we introduced a UDL function. The training process was divided into two stages. In the first stage, the network estimated both the reconstructed results and the uncertainty values. In the second stage, the uncertainty values were used to generate a spatially adaptive weight to guide the network to prioritize the image regions with rich texture and edge regions, thus improving the reconstruction quality of these regions. The parameters of the encoder and decoder in both stages were updated simultaneously.

In the image-free single-pixel object detection experiments, we trained our SPIS and other methods for comparison on the Pascal VOC data set. We enhanced the training set by cropping, rotating, and flipping the images. All the images were resized to $128 pixel \times 128 pixel$ , and the pixel values were normalized to 0 to 1 before being fed into networks. During training, the batch size was set to 16, and the SPIS was trained for a total of 800 epochs. In the first stage, a total of 600 epochs were trained, and the learning rate was 0.0001. The initial learning rate was set to 0.0001, and the learning rate dropped 60% after 200 epochs. In the second stage, a total of 200 epochs were trained. The default values for $β 1$ and $β 2$ were set to 0.5 and 0.999, respectively. The weight decay was set to 0.00005.

When the SPIS technique was applied for the image-free object detection task, we adopted another two-step training strategy. The first training stage aimed to train the encoder’s high-dimensional semantic feature extraction ability and obtain optimized small-size modulation patterns. The function of the decoder in this stage was to reconstruct the target scene. The encoder and decoder parameters were updated simultaneously in the first training stage. During the second training stage, the decoder was replaced with our designed object detection network. The encoder and decoder were updated simultaneously to find the optimal network parameters.

6 Appendix B: Sampling Rate Calculation

All the sampling rates mentioned in this paper can be calculated as the ratio between measurement number and sampling resolution, $SR = \frac{X_{measurement}}{Y_{resolution}},$ (6)where SR represents the sampling rate, $X_{measurement}$ is the total number of measurements, and $Y_{resolution}$ denotes the sampling resolution.

For the conventional full-sized modulation patterns, one pattern can produce one measurement. Therefore, the sampling rate of full-sized modulation patterns can be directly derived as $SR = \frac{X_{patterns}}{Y_{resolution}},$ (7)where $X_{patterns}$ represents the number of modulation patterns.

For the small-sized optimized pattern reported in this work, a small-sized optimized pattern scans the entire scene and produces multiple measurements. The measurement number can be calculated as $X_{measurement} = \frac{Y_{resolution}}{{Size}_{patterns}} X_{patterns},$ (8)where ${Size}_{patterns}$ represents the size of the small-sized optimized pattern. Therefore, for the small-sized optimized pattern, the sampling rate can be calculated as $SR = \frac{\frac{Y_{resolution}}{{Size}_{patterns}} \times X_{patterns}}{Y_{resolution}} = \frac{X_{patterns}}{{Size}_{patterns}} .$ (9)From the above formula, it can be concluded that the calculation formula for the number of small-sized patterns at a specific sampling rate is $X_{patterns} = SR \times {Size}_{patterns} .$ (10)

7 Appendix C: DMD Scanning Strategy

As shown in Fig. 7, we implemented small-sized pattern sampling by embedding nonoverlapping small-sized patterns in multiple zero-initialized full-sized patterns. For example, when the sampling rate is 5% and the scene resolution is $128 \times 128$ , we need 51 $32 \times 32$ small-sized optimized patterns. Each $32 \times 32$ pattern needs to scan and sample 16 times on the $128 \times 128$ target scene; this finally produces a total of 51 $4 \times 4$ 2D measurements. To achieve this goal, we embedded each $32 \times 32$ pattern in 16 $128 \times 128$ zero-initialized patterns without overlapping so that we can get a total of 816 ( $16 \times 51$ ) $128 \times 128$ locally valid patterns. These patterns were then fed into the DMD to sample the target scene and produced 816 1D measurements, which were sequentially reshaped into 51 $4 \times 4$ 2D measurements.

Figure 7.Sampling process of small-sized patterns. We take a $2 \times 2$ small-sized pattern to scan and sample an $8 \times 8$ scene as an example to show the sampling process of our small-sized optimized patterns. We embed the nonoverlapping $2 \times 2$ small-sized patterns in multiple zero-initialized $8 \times 8$ full-sized patterns and quickly switch between full-sized patterns.

The above sampling method through small-sized optimized patterns combines the advantages of both compressed sensing and point scanning imaging.34 Compared with the conventional full-sized pattern, our small-sized pattern can retain the position information of the target and improve sampling efficiency. Compared with the point scanning system, our method samples a much larger portion of the scene at once, thus reducing sampling time and increasing sampling speed.

Lintao Peng received his BS degree from the School of Computer Science and Technology, Xidian University, Xi’an, China, in 2020. He is currently pursuing a PhD with the School of Information and Electronics, Beijing Institute of Technology, Beijing, China. His research interests include computer vision, computational photography, and deep learning.

Siyu Xie received his BS degree from the School of Communication Engineering, Jilin University, Changchun, China, in 2021. He is currently pursuing an MS degree in Electronic Information Engineering with the School of Information and Electronics, Beijing Institute of Technology, Beijing, China, and is expected to graduate in 2024. His research interests include computational photography and deep learning.

Hui Lu received her BS degree from the School of Communication Engineering, Jilin University, Changchun, China, in 2022. She is currently pursuing an MS degree in Electronic Information Engineering with the School of Information and Electronics, Beijing Institute of Technology, Beijing, China. Her research interests include computational photography and deep learning.

Liheng Bian received his PhD from the Department of Automation, Tsinghua University, Beijing, China, in 2018. He is currently an associate professor with the Beijing Institute of Technology. His research interests include computational imaging and computational sensing. More information of him can be found at https://bianlab.github.io/.

References

[1] M. P. Edgar, G. M. Gibson, M. J. Padgett. Principles and prospects for single-pixel imaging. Nat. Photonics, 13, 13-20(2019).

[2] Z. Zhang, X. Ma, J. Zhong. Single-pixel imaging by means of Fourier spectrum acquisition. Nat. Commun., 6, 1-6(2015).

[3] W. Withayachumnankul, D. Abbott. Terahertz imaging: compressing onto a single pixel. Nat. Photonics, 8, 593-594(2014).

[4] R. I. Stantchev et al. Real-time terahertz imaging with a single-pixel detector. Nat. Commun., 11, 2535(2020).

[5] W. Chen, X. Chen. Ghost imaging for three-dimensional optical security. Appl. Phys. Lett, 103, 221106(2013).

[6] P. Zheng et al. Metasurface-based key for computational imaging encryption. Sci. Adv, 7, eabg0363(2021).

[7] S. Ota et al. Ghost cytometry. Science, 360, 1246-1251(2018).

[8] J. Li et al. Spectrally encoded single-pixel machine vision using diffractive networks. Sci. Adv, 7, eabd7690(2021).

[9] P. Kilcullen, T. Ozaki, J. Liang. Compressed ultrahigh-speed single-pixel imaging by swept aggregate patterns. Nat. Commun., 13, 7879(2022).

[10] M.-J. Sun et al. Single-pixel three-dimensional imaging with time-based depth resolution. Nat. Commun., 7, 12010(2016).

[11] R. Stojek et al. Single pixel imaging at high pixel resolutions. Opt. Express, 30, 22730-22745(2022).