Real-time target recognition with all-optical neural networks for ghost imaging
  • photonics1
  • Oct. 26, 2024

Abstract

 

The generation and structural characteristics of random speckle patterns impact the implementation and imaging quality of computational ghost imaging. Their modulation is limited by traditional electronic hardware. We aim to address this limitation using the features of an all-optical neural network. This work proposes a real-time target recognition system based on an all-optical diffraction deep neural network for ghost imaging. We use a trained neural network to perform pure phase modulation on visible light, and directly complete the target recognition task by detecting the maximum value of light intensity signals at different positions. We optimized the system by simulating the effects of parameters, such as the number of layers of the network, photosensitive pixel, unit area etc., on the final recognition performance, and the accuracy of target recognition reached 91.73%. The trained neural network is materialised by 3D printing technology and experiments confirmed that the system successfully performs real-time target recognition at a low sampling rate of 1.25%. It also verified the feasibility and noise resistance of the system in practical application scenarios.

 

© 2024 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement

1. Introduction

Ghost imaging (GI) employ a dual-path optical structure, where one beam passes through the object, with information collected by a bucket detector, while the other beam is directed towards a point detector with spatial resolution. Through correlation calculations between the two beams, clear images of the object can be obtained even without an object in the reference beam path [13]. And after the emergence of computational ghost imaging (CGI) has simplified the system into a single-path optical structure [47]. However, the large size, complex structure, and relatively low modulation rates of spatial light modulator (SLM) in CGI, as well as the binary modulation and limited modulation rates (the highest modulation rate reaching only tens of kilohertz) of digital micro-mirror device (DMD), have constrained the performance development of CGI. As the limitations of traditional integrated circuits become increasingly apparent and the demand for large-scale data processing grows, optical computing has emerged as a promising direction to overcome the bottleneck of electronic computing [810].

Since 2017, when researchers simulated optical computing by directly utilizing the optical field variations present during transmission to achieve specific computational functions, the concept of "computing through transmission" was introduced [11]. All-optical neural networks (AONNs) have achieved rapid development in various fields such as photonics [1214], microelectronics [15,16], algorithms [1722] and computer systems [2326] for optical computing. In 2018, a research team at UCLA [27] firstly proposed an all-optical deep learning framework, which realized a nearly zero-energy, zero-latency deep learning system using light waves. Subsequently, the advancements in multi-layer all-optical artificial neural networks [28], all-optical spiking neural networks [29], and matrix diffraction deep neural networks [30] have brought large-scale optical neural networks closer to practical applications. These frameworks based on physical mechanisms have achieved many complex tasks such as image analysis, feature detection and target recognition, leveraging the inherent advantages of optical information transmission at the speed of light and massively parallel processing capabilities [3136]. The all-optical neural network, utilizing photons as the physical carriers to construct the fundamental computational units of artificial neural network algorithms, represents a novel computing architecture capable of achieving high performance. Due to the inherent capability of optical information transmission at the speed of light and highly parallel processing [3744], we aim to leverage AONN to address the limitation of the speed of speckle modulation by SLM, DMD, and other devices in CGI systems, which is constrained by electronic hardware. Furthermore, we seek to fully exploit the advantages of AONN in high-speed, large-scale parallel computation.

2. Method

When a laser beam is incident on a rough scattering object (e.g., rotating ground glass), the multitude of minute regions on the object’s surface serves as random phase elements for the incident light. Upon reflection or transmission, these elements interfere with each other to generate a stochastic distribution of speckle patterns. Currently, the speckle patterns proposed in GI can essentially be categorized into two main classes: random speckles where the spatial intensity distribution of the speckle field follows either a Gaussian or a Bernoulli distribution, and structured speckles where the spatial intensity distribution satisfies a specific structural pattern. By subjecting the object to a series of N different speckle patterns, corresponding bucket detection values yi are obtained. Subsequent reconstruction of the object image is achieved through second-order correlation calculations. Notably, the speckle size in the speckle field constitutes one of the crucial factors influencing the quality of correlation imaging. The corresponding bucket detector values yi(i=1,2,,N) can be represented as:

yi=?Si(x,y)T(x,y)dxdy,
where Si(x,y) can be directly measured using a charge-coupled device (CCD), or alternatively, it can be obtained by calculating the diffraction integral equation:
Si(x,y)=|Ri(x,y)hz(x,y)|2,
where Ri(x,y)(i=1,2,,N) represent the speckle patterns, denotes convolution, hz(x,y) represents the Fresnel diffraction impulse response at distance z. And T(x,y) is the feature function that characterizes target. For a target with a resolution of pq, we use a speckle pattern of the same pixel size, and the speckle field can be represented as a matrix:
A=[Si(1,1)Si(1,2)?Si(1,q)Si(2,1)Si(2,2)?Si(2,q)????Si(p,1)Si(p,2)?Si(p,q)].

 

If N measurements are taken, then the corresponding sampling rate is defined as N/(pq). In this work, we use AONN instead of electronic devices such as SLM and DMD to perform pure phase modulation on the input light. The above matrix A, where each element corresponds to a point of the speckle, is equivalent to each neuron in the AONN. By using multiple transmission or reflection layers to physically create an all-optical deep diffraction neural network, where each layer of the transmission network is approximated as a thin optical element, and each point represents an artificial neuron connected to the other neurons in the next layer by optical diffraction. Similar to a standard deep neural network, we can think of the transmission or reflection coefficient of each neuron as a multiplicative bias term, i.e. modulating the wavefront of the transmitted field by the phase value of its neuron. The physical mechanism of light propagation is used to simulate the weights and biases during the training of the neural network. The free space propagation module is implemented using the angular spectrum method. According to the Rayleigh-sommerfeld diffraction theory, each neuron of a layer can be considered as a secondary source of a wave consisting of the following optical modes:

wil(x,y,z)=zzir2(12πr+1jλ)e(j2πrλ),
where l denotes the lth layer of the network, i denotes the neuron located at the ith position in this layer, λ is the wavelength of light, at which point the output function of the ith neuron is denoted as:
nil(x,y,z)=wil(x,y,z)til(xi,yi,zi)knkl1(xi,yi,zi)=wil(x,y,z)|A|ejΔθ,
knkl1(xi,yi,zi) is the input wave, |A| is the relative amplitude of the secondary wave, and Δθ is the additional phase delay encountered by the secondary wave from the input wave to the neuron and its transmission coefficient. And where the transmission coefficient til(xi,yi,zi) of the neuron consists of an amplitude term and a phase term:
til(xi,yi,zi)=αil(xi,yi,zi)ej?il(xi,yi,zi).

 

Transmitted secondary waves diffract between layers and interfere with each other, forming a composite wave on the surface of the next layer. Finally, the network adjusts the phase values of each layer of neurons through learning iterations to perform specific functions. We use Adam optimization algorithm for stochastic gradient descent and update the network through error back-propagation algorithm. The definition of the Loss Function is the maximum normalized signal of the detector corresponding to the target object. At the end of the training phase of the network, the model was reconstructed and materialised using 3D printing to achieve pure phase modulation of the light field by converting neuron phase values to relative heights.

We utilize multiple transmission or reflection layers to physically create the all-optical diffraction deep neural network. After materializing the neural network, we constructed an optical system as shown in Fig. 1(a), which consists of a light source, a target all-optical diffraction deep neural network structure (i.e. optical mask layer), an object to be recognized, a lens, and a CCD according to the direction of light propagation. In the optical system, the light source and the all-optical diffraction deep neural network structure are roughly equivalent to the light source for identifying the target in GI. Each pixel point, that is, each neuron in the neural network, performs pure phase modulation on the incident light or reflected light through the phase value of each point. Each layer of the network is equivalent to random speckle, and the calculation is completed through the propagation of light. Ultimately, the light intensity information is received by the detector without spatial resolution. The light intensity information s=[s0, s1, …, s9]. According to the training network model, the position number of the maximum intensity information detected at this time corresponds to the classification category of the target.

 figure: Fig. 1.

Fig. 1. (a) Schematic diagram of the structure of an optical recognition system for ghost imaging based on an all-optical neural network. (b) Visualization of a trained 5-layer phase mask.

Download Full Size | PDF

3. Simulation and results

3.1 Simulation

Due to the characteristics of light propagation, we set up a fully connected network with a size of 2cm * 2cm, where each neuron is 20μm * 20μm. This work is based on the MNIST dataset, and the network training is performed on an Intel Xeon Gold 6154 CPU @ 3.00GHz 2.99 GHz device. The MNIST dataset consists of 70,000 images, of which 60,000 are in the training set and 10,000 are in the test set. The training batch size is set to 32, and the optimizing algorithms is Adam. Adam combines the principles of momentum and adaptive learning rate, and is one of the most commonly used optimizing algorithms in deep learning. The initial learning rate is set to 1e-4. The GI optical system as illustrated in Fig. 1(a) and the trained all-optical diffraction deep neural network is visualized as shown in Fig. 1(b).

The performance of the network is evaluated using the Loss Function and Accuracy, where the Loss Function is defined based on the difference between the predicted and true values. During the model training process, the gradients of the Loss Function with respect to the model parameters are calculated using the back-propagation algorithm. And the model parameters are updated using an optimization algorithm to minimize the Loss Function. As illustrated in Fig. 2, the curves of Loss value and Accuracy value are shown for the simulated experiment with an epoch of 30 under the aforementioned parameters. The diagram shows that the fitting of the training process is good, without over fitting or under fitting issues. The change of Accuracy value indicate that after a certain number of iterations, the recognition accuracy stabilizes. At this stage, the Train Accuracy and Test Accuracy can reach 90.20% and 91.02%, respectively. It is observed that a stable training model is achieved after around 20 epochs, so we set the epochs to 20 in the subsequent work.

 figure: Fig. 2.

Fig. 2. Curves showing the changes in Loss value and Accuracy value of optical system when training for 30 epochs.

Download Full Size | PDF

Additionally, we explored the optimal neural network architecture for this model using the controlled variable method. By solely varying the number of neural network layers, the analysis of Fig. 3 shows that with an increase in the number of network layers, the recognition accuracy improves. With a multi-layer optical neural network, the recognition accuracy can reach up to 91.46% (for 10 layers), but this improvement in recognition accuracy comes at the cost of longer training times. Therefore, balancing performance and computational efficiency, we selected a 5-layer neural network as the mask for pure phase modulation in our subsequent work. This configuration results in a total of 1 million neurons ((1000*1000)*5) and 4 trillion connections ((10001000)2 *4), as the recognition accuracy only marginally improves when the number of layers exceeds 5.

 figure: Fig. 3.

Fig. 3. Curves showing the changes in Accuracy of optical system when the mask of layer is set to 1-10.

Download Full Size | PDF

3.2 Results and discussion

To investigate the generalization capability of our target recognition system, we maintained all other parameters constant while varying only the dataset. We conducted training and validation using the Fashion-MNIST and the MNIST dataset, the labels of handwritten digits directly correspond to the position sequence 0-9. For the Fashion-MNIST dataset, we mapped the original category labels to the position sequence 0-9 to maintain consistency with the MNIST format. As shown in Fig. 4 presents visualized images of various targets (including handwritten digits, t-shirts, trousers, sneakers, etc.) alongside their corresponding CCD output signals. The position of the maximum value in the CCD output signal indicates the predicted class of the target.

 figure: Fig. 4.

Fig. 4. Visualization of the targets to be recognized (handwritten digit, t-shirt, trouser, sneaker etc.) and the output signals.

Download Full Size | PDF

The yellow bars in the column chart represent the output signal peaks, which correspond to the predicted labels of the input images. This implies that, after a certain degree of network training, the model has acquired the capability of recognizing and classifying the targets. On the MNIST dataset, the final recognition accuracy of the system reaches 91.52%. Under the same conditions, if the Fashion-MNIST dataset is used, the recognition accuracy is 80.13%, indicating that the model has a certain classification capability on both datasets, although its performance on Fashion-MNIST is slightly weaker than on MNIST. The difference in performance can be attributed to the nature of the datasets. The MNIST dataset contains handwritten digit images, which tend to have more regular patterns and clearer features. In contrast, the Fashion-MNIST dataset consists of clothing images, which are inherently more complex with greater variations in shapes and textures. This increased complexity in the Fashion-MNIST data likely contributes to the lower recognition accuracy. These findings suggest that the model possesses a degree of generalization capability, as it can perform reasonably well on both datasets despite their differences. However, the performance gap indicates that further optimization and training may be required for the model to handle more complex or irregular images effectively. Future work could focus on enhancing the model’s ability to extract and process more sophisticated features, potentially through architectural modifications or advanced training techniques.

In contrast to traditional optical imaging methods, the spatial resolution in GI under structured illumination is often constrained by the resolution of the SLM or DMD rather than by the diffraction limit. In other words, GI using structured illumination is typically not limited by the diffraction limit but rather by the pixel resolution of the device. To verify this relationship, we conducted simulation experiments varying the pixel resolution of the modulation device. Currently, SLM, DMD, and similar devices can achieve pixel resolutions on the order of tens of micrometers, thus we set the pixel resolution size to range from a few micrometers to tens of micrometers. From Fig. 5(a), we can see that after a certain number of training epochs, the recognition accuracy eventually stabilizes around 90%. As the pixel resolution increases, it implies more refined modulation, which in turn leads to higher recognition accuracy. This suggests that higher pixel resolutions can provide more detailed and precise information, making it easier to recognize and classify the target. However, it’s important to note that the improvement in recognition accuracy due to increasing pixel resolution is not linear and has diminishing returns. While finer pixel resolutions generally lead to better performance, the relationship is not absolute, and other factors may also influence the system’s overall effectiveness.

 figure: Fig. 5.

Fig. 5. (a) Curves showing the changes in Train Accuracy and Test Accuracy of optical system when pixel resolution is at the micron level. (b) Curves showing the changes in Train Accuracy and Test Accuracy of optical system when pixel resolution and number of pixels change, but the product remains unchanged.

Download Full Size | PDF

Additionally, we adjusted the relationship between pixel resolution and the number of pixels, while keeping the effective area of the device constant. As shown in the analysis of Fig. 5(b), we found that the system’s recognition accuracy for the target remains relatively stable when the effective area is kept constant (i.e., the product of the number of pixels and the pixel size remains unchanged). This suggests that changes in the number and size of pixels compensate for each other, maintaining a consistent photon density when the effective area remains constant, thus resulting in a relatively stable recognition accuracy.

This implies that changes in the number and size of pixels will compensate for each other, as the photon density will remain consistent when the effective area remains constant, thus having a relatively stable impact on the final recognition accuracy. While higher pixel resolutions can provide more detailed information, our results indicate that the recognition accuracy is not solely dependent on pixel resolution. The effective area of the device appears to be a more critical factor, as it determines the overall photon density received by the detector. In conclusion, our study reveals a complex interplay between pixel resolution, number of pixels, and effective area in determining the performance of GI systems. These insights could guide future developments in GI technology, potentially leading to more efficient and effective systems for various applications.

The detector used for collecting intensity information of transmitted or reflected light from targets in GI typically lacks spatial resolution capability. Herein, our signal reception uses a CCD with many photosensitive pixels. These pixels generate charges proportional to the intensity of the incident light, and perform two-dimensional measurements on the output electrical signal to complete the photoelectric conversion. We do not rely on the spatial distribution characteristics of CCD, but measure the total intensity information of a specified area to achieve target recognition. Initially, while ensuring uniform distribution characteristics among regions, we investigated the influence of altering the parameter of dispersion degree of designated regions on the final system’s recognition accuracy. We selected ten regions on the CCD detection plane for our investigation. We incrementally increased inter-region distances while maintaining individual region effective areas constant.

The changes in the optical system’s training accuracy and test accuracy, as the spatial distribution of the light intensity signal positions in the specified areas detected by the CCD varies, are clearly shown in Table 1. The bold data in the table are the highest recognition accuracy rates under different dispersion degrees, all fluctuating around 89%. We observed negligible variations in recognition accuracy, suggesting that the spatial arrangement of the detection regions has a relatively weak effect on the system’s recognition accuracy. These results suggest that the network can compensate for positional variations through phase modulation, leading to minimal impact on the system’s overall recognition accuracy. This finding implies that the system demonstrates robustness to changes in the spatial arrangement of detection regions.

Further, we maintained a constant effective detection size within the CCD but varied the number of photosensitive pixels in a single detection region, i.e., enhancing the resolution of individual detection regions within the CCD. The number of photosensitive pixels was adjusted from 200*200 to 400*400, 500*500, and 800*800 successively, with 20 training cycles conducted. As depicted in the Fig. 6, when the number of photosensitive pixels increases from 40000 to 250000, the recognition accuracy of the system is significantly improved, but the enhancement effect was limited and not strictly proportional to the increase in the number of photosensitive pixels. When the number of photosensitive pixels increases to 640000, the recognition accuracy decreases, which may be due to the decrease in sensitivity caused by the increase in the number of photosensitive pixels per unit area in CCD. While light scatters in space with varying directions and angles upon emission from the source, encountering different degrees of attenuation, deformation, and external interference before reaching the CCD, the number of photosensitive pixels in the CCD affects the sensitivity of received light signals. Increased photosensitive pixels tend to improve imaging resolution within a certain range, enhancing signal intensity and signal-to-noise ratio. However, further increases may compromise CCD’s sensitivity to processed signals, potentially leading to a decline in the system’s recognition accuracy. While the number of photosensitive pixels per unit area in the CCD influences the system’s recognition accuracy, this effect typically requires comprehensive consideration of other factors and optimization based on specific application requirements.

 figure: Fig. 6.

Fig. 6. Curves showing the changes in Train Accuracy and Test Accuracy of optical system when the CCD specifies the photosensitive pixels within a single area from 40,000 to 640,000.

Download Full Size | PDF

4. Physical experiment and results

To evaluate the feasibility of our designed AONN for target recognition, we conducted a physical demonstration experiment based on our simulation. The physical experiment setup is depicted in Fig. 7(a), using vector network analyzer (VNA) from Keysight Technologies and 170-260 GHz VNA frequency extension module from Astroniks Technologies for to transmit and receive 200 GHz millimeter waves. Due to the limitation of 3D printing accuracy, we adjusted the frequency to 200 GHz and used the MNIST dataset for training. Therefore, as shown in Fig. 7(a), we used a 3D printer to print digital boards, and then materialized the trained neural network through modeling and printing to build an experimental system.

 figure: Fig. 7.

Fig. 7. (a) Diagram of the experimental optical path, modeling and printing schematic diagram of the object to be identified and the 3D printing assembly diagram of all-optical neural network. (b) After inputting the 3D printed target to be recognized into the experimental optical path, normalize the experimental data and obtain the light intensity distribution at different positions.

Download Full Size | PDF

Figure 7(b) illustrates the processed experimental data. The digital image on the top of the figure is a simulation image of the identified object placed during the experiment. The horizontal axis of the bar graph below corresponds to the position number of its signal receiver, and the vertical axis corresponds to the received signal value, which are the bucket signal values of the designated area received when different target objects are placed, that is, the total light intensity value. It can be clearly seen from the figure that the position number of the maximum signal corresponds to the classification category of the target object. The sampling rate of the experimental system at this time was 1.25%, which confirms that the trained all-optical network still maintains robust recognition capabilities at relatively low sampling rates after 3D printing. This result verifies that the target recognition task is directly achieved by detecting the maximum value of the total light intensity signal in the real scene.

In real-world applications, noise is a significant factor that must be taken into account. Therefore, we verified the target recognition ability of the system under different noise backgrounds through experiments. Identifying the target object as numbers 1-5, as shown in Fig. 8. We can observe that when the signal-to-noise ratio (SNR) is 7-15dB, the optical system can still accurately recognize the target object. The target can be directly identified by the position number of the maximum value, thus verifying that the system has good noise resistance.

 figure: Fig. 8.

Fig. 8. Diagram of light intensity distribution under different noise backgrounds when identifying targets as numbers 1-5.

Download Full Size | PDF

However, compared with simulation data, the positional characteristics of signal values at different positions have slightly attenuated. The difference feature between the maximum signal and the residual signal is reduced. The small deviation shown in the experimental results, although within an acceptable range of error, is still worth analyzing. We mainly believe that the deviation comes from the following three primary factors. First, the inherent deviation of the experimental setup is a factor that cannot be ignored. Although we strive for precision, there are inevitably slight deviations in the center alignment of the experimental setup. Second, the intrinsic properties of 3D printing materials may also affect the experimental accuracy. These materials are usually elastic, so that the thin transfer layer may bend slightly during operation. Finally, the limitations of 3D printing technology itself are also potential sources of error. Current 3D printing technologies still have limitations in replicating fine structures and sharp edges. These factors work together and may lead to the observed experimental deviations. Nevertheless, these deviations are still within the acceptable range and do not affect the overall validity of the experiment.

5. Conclusion

In summary, we have explored the use of an all-optical diffraction deep neural network to replace SLM and DMD for random speckle modulation in GI, and achieves real-time target recognition at low sampling rates through experiments. This attempt fully leverages the advantages of optical computing, especially in the face of the growing challenges brought by the increase in data volume. Additionally, the power consumption of the system is relatively low because once the training is completed and deployed in practical applications, no further energy input is required. This energy-efficient characteristic makes it it beneficial for long-term operation or mobile applications in harsh environments. Furthermore, the system can achieve different phase modulations by adjusting the connection weights of the network, enabling the system to adapt to various real-world environments and task requirements. We hope to leverage the advantages of low energy consumption and flexibility in optical computing to bring new opportunities to imaging and recognition, laying the foundation for achieving more efficient, energy-saving, and flexible optical imaging systems.