
- Advanced Photonics Nexus
- Vol. 4, Issue 2, 026009 (2025)
Abstract
1 Introduction
Convolutional neural networks (CNNs) represent a significant milestone in image classification, recognition, and tracking.1 CNNs, for example, AlexNet, are composed of several convolutional layers that adaptively learn spatial representations from input images. Although powerful, the convolution operation is computationally expensive, leading to high latency and power consumption. In fact, it has been estimated that
For decades, it has been known that a
Advantageously, the convolution operation can also be performed using free-space optics and requires only a single element. The resultant image produced by any optics is the input convolved with the point spread function (PSF) of the optics.13
However, a challenge to all optically implemented CNN approaches is that nonlinear layers are interspersed with the linear layers. For example, AlexNet consists of five convolutional layers followed by three fully connected layers.1 Specifically, each convolutional layer in the architecture utilizes the rectified linear unit (ReLU) as its nonlinear activation function, followed by the local response normalization and max pooling layer, ensuring an effective mechanism for spatial hierarchy extraction. Therefore, nonlinearity is consistently applied across all layers, serving as a foundational element of the network design to enhance its learning capability. Without the nonlinear activations (e.g., ReLU), the classification accuracy of the CNN drops by
In this work, we experimentally demonstrate a hybrid optical–electronic CNN consisting of a single optical convolution layer with an electronic single fully connected layer to achieve similar accuracy as AlexNet on handwritten digit classification tasks. To overcome the limitation of the absence of optical nonlinearity, we apply knowledge distillation (KD) to remove the nonlinear layers and compress multiple layers into a single linear layer.17 KD (more details in Sec. 4) circumvents the need for nonlinearity without a significant reduction in the performance by transferring knowledge from a larger, pretrained network (the “teacher” network) to a more compact network (the “student” network). Here, we use a modified AlexNet, denoted AlexNet-Mod, as the teacher network and a single convolutional layer coupled with a single fully connected layer as the student network. We use this architecture to demonstrate a hybrid meta-optical platform, wherein an optical front end based on a single meta-optic performs the linear convolution operation, followed by an electronic backend that contains a linear calibration layer and a fully connected layer. In such a way, the most computationally expensive operation is performed optically to leverage the benefits of optical computing, namely, high spatial bandwidth and low power consumption. The use of a single meta-optic layer drastically simplifies the experimental setup and provides a compact geometry.
The optical front end of this network is realized using inverse-designed meta-optics. The meta-optics are arrays of subwavelength scatterers that act as phase masks, imparting spatially coded phase shifts to incident light. Here, we design meta-optics to perform the desired convolutional steps of the CNN by engineering the PSF. We fabricate and experimentally validate the performance of the designed optics using incoherent green-light illumination from a light-emitting diode (LED), centered at 525 nm. Further, we experimentally demonstrate the classification accuracy of the entire hybrid CNN on the MNIST data set. The hybrid CNN is described in Fig. 1, where we compare the architectures of multilayer electronic CNNs, compressed electronic CNNs (a linear single-layer CNN), and our hybrid system. The number of multiply-accumulate (MAC) operations in the entire network is reduced by over 2 orders of magnitude by compressing multiple convolutional layers into a single layer and implementing them optically. The classification accuracy of the compressed hybrid CNN is reduced by only 5% from AlexNet-Mod (98% accuracy) to achieve 93% classification accuracy on the MNIST data set.
Figure 1.Schematic of CNNs for image classification tasks. (a) All-electronic multi-layered CNN. (b) All-electronic compressed CNN. (c) Hybrid CNN that combines an optical meta-optic front end and electronic backend. (d) Number of MAC operations of each network configuration, with convolutional MACs in green and fully connected MACs in brown.
2 Results
In each convolutional layer of a CNN, an optimized kernel is convolved with the input to generate a feature map, which is then passed to the next layer. Using the KD approach, we optimize eight convolutional kernels (each
2.1 Compressing Multiple Convolutional Layers Using KD
To select the ideal network architecture for the linear optical–electronic hybrid system, underfitting and overfitting issues must be avoided. Smaller networks face underfitting concerns, particularly in optical settings that limit the system to just eight kernels and remove nonlinear functions.29 This is in stark contrast to AlexNet, which uses over 300 kernels.1 However, complex models are prone to overfitting, and practical issues such as fabrication noise and misalignments introduce further challenges for optically implementing a large number of kernels.
Therefore, to obtain a balanced network that could be implemented optically, we use KD to compress an AlexNet-Mod as a base model. AlexNet-Mod consists of five convolutional layers and three fully connected layers (eight total layers) that we compress to the desired structure of one convolutional layer and one fully connected layer (two layers). The KD approach assumes that the teacher network is already trained and performs the desired task with high accuracy; in this case, we use AlexNet-Mod as the teacher network, which achieves
2.2 Optical Convolution Using Meta-Optics
The optical component of this network is implemented using inverse-designed meta-optics. We used the Gerchberg–Saxton (GS) phase-retrieval method30 to design phase masks that correspond to the optimized convolutional kernels. Specifically, each suboptic is designed to have a PSF that resembles the convolutional kernel. As electronic convolutional kernels include both positive and negative values, we separated each kernel into positive and negative parts and designed meta-optics for each. The positive and negative images were computationally subtracted afterward to produce the net convolution. Two-dimensional phase maps of an example set of suboptics are shown in Fig. 2(c). The phase maps were implemented by silicon nitride pillars that are 750 nm tall but with varied widths corresponding to their relative phase delay. Scanning electron microscope (SEM) images of the fabricated optics are also shown for comparison, highlighting the fabrication quality. All 16 optics (corresponding to 8 convolutional kernels) were fabricated on a single substrate (more details in Sec. 4), allowing convolution from 16 different kernels in a single image capture. A photograph of the fabricated meta-optic is shown in Fig. 2(b). Each kernel meta-optic is
Figure 2.Schematic of the optical system. (a) PSF measurement setup using a monochromatic point light source (left) and optical convolution measurements using a micro-LED display (right). (b) Photograph of the fabricated meta-optics. The meta-optic contains 16 different suboptics, spatially distributed in a single layer, operating in parallel for classification tasks. (c) Phase maps and SEM images of exemplary suboptics corresponding to the positive and negative parts of a particular convolutional kernel. (d) Positive and negative parts of an example convolutional kernel (left), the corresponding PSF simulation (middle), and experiment (right). (e) Simulated electronic output (left) and optical experiment (right) convolved output for the example kernel, for the case of an input “7” from MNIST.
To verify the performance of the optics, we measured the PSF of each suboptic using a single-mode fiber as a point light source, as shown in Fig. 2(a). The PSFs from all 16 suboptics are simultaneously captured by the two-dimensional CMOS camera (more details in Sec. 4). Accordingly, the convolved images from all 16 suboptics are also captured at the same time when we replaced the single-mode fiber with the display, as described in Fig. 2(a). Exemplary electronic convolutional kernels, which represent the ground truth PSFs for the optically implemented kernels, are shown in Fig. 2(d). We also present the simulated PSFs from the meta-optics using angular spectrum propagation.31,32 The experimentally measured PSFs shown in Fig. 2(d) match the simulated PSFs well, confirming fabrication accuracy. However, due to the constraints on physically realizable PSFs, there are notable differences between the ground-truth PSFs and experimentally measured PSFs. To correct for these differences, as well as slight noise and misalignments in the optical system, we introduce a calibration layer to the computational backend, further discussed in Sec. 4.2.
2.3 Hybrid Network Classification Results
To address optical noise and misalignment affecting image classification performance, an additional calibration layer is introduced to adjust optical representations for compatibility with the computational backend. Specifically, this calibration layer is a single fully connected neural network layer with an input dimension of 288 and an output dimension of 288. We fine-tune it with only 10% of the training data set and ensure that the computational backend does not need to be retrained. Therefore, the electronic backend of the hybrid network consists of the calibration layer followed by the original compressed electronic backend.
We compare three CNN architectures (AlexNet-Mod, the compressed electronic network, and the hybrid optical–electronic network) in Table 1. The MAC for the convolutional layer depends on the image size
Network architecture | Train (%) | Test (%) | MAC operations |
AlexNet-Mod | 98.9 ± 0.33 | 98.4 ± 0.32 | 17,323,520 |
Compressed electronic CNN (without KD) | 84.2 ± 0.47 | 82.1 ± 0.69 | 228,672 |
Compressed electronic CNN (KD) | 97.2 ± 0.35 | 96.2 ± 0.29 | 228,672 |
Hybrid CNN (KD) | 93.9 ± 0.25 | 93.4 ± 0.22 | 85,824 |
Table 1. Classification results.
Figure 3 illustrates the confusion matrices for three neural network configurations. Each matrix visually represents the model’s tested performance across different classes (in this case, digits labeled 0 through 9), with the true labels on the rows and the predicted labels on the columns. The multilayer electronic CNN, AlexNet-Mod, displays high values along the diagonal, exceeding 98.1% accuracy in each class. The compressed electronic CNN, while having a slight decline in diagonal values, still demonstrates robust classification accuracy with a minimum of 94.3%. The hybrid network exhibits a more diverse range of values along the diagonal, with some classes exhibiting lower predictive accuracy compared with the compressed electronic network. We attribute this slight decline to noise in the optical experiment, which may be due to optics fabrication, camera sensor noise, and optical noise due to vibrations that cannot be fully compensated for by the calibration layer. Despite these factors, the hybrid network still performs reasonably well with the network, correctly predicting each class with a minimum of 87.6% accuracy. This indicates that the hybrid network maintains a reasonable level of accuracy against these noises and discrepancies.
Figure 3.Confusion matrices for different network architectures. (a) Classification results for AlexNet-Mod (multiple-layer electronic CNN). (b) Classification results for the all-electronic CNN compressed without using KD. (c) Classification results for the all-electronic CNN compressed with KD. (d) Classification results for the hybrid optical–electronic CNN.
3 Discussion
3.1 Ablation Study and Principal Component Analysis
To understand the contribution of each component of the hybrid optoelectronic CNN, we perform an ablation study and principal component analysis (PCA). In the ablation study, we evaluate the classification accuracy when using only the electronic backend structure for classification. We summarize the ablation study results in Table 2, and the numbers in brackets represent the accuracy gain compared with “backend only” and “calibration + backend,” respectively. For a fair comparison, the backend layer here is the same structure as used in the hybrid network but is re-optimized for the best performance in the absence of any convolutional front end. Specifically, we compare the performance of a backend layer only (a single fully connected layer), the calibration and backend layers together (two fully connected layers), and the entire hybrid network. For a single fully connected layer alone, an accuracy of 89% was attained, which is less than that of the hybrid network; this highlights the utility of the optical front end. Further, for two fully connected layers (representing the calibration and backend layer) without any convolutional front end, the accuracy is reduced to 84%. This can be attributed to the fact that two layers are fully connected and, in the absence of nonlinear activation functions, tend to converge toward a “saddle point” in the optimization landscape.33
Configuration | Train accuracy | Test accuracy |
Backend only | 89% | 87% |
Calibration + backend | 84% | 80% |
Optics + calibration + backend | 94% (+5%/+10%) | 93% (+6%/+13%) |
Table 2. Network ablation study.
We hypothesize that the calibration layer only remaps the optical representations and does not improve the performance of the electronic backend alone. To examine this, we further analyze the effect of the calibration layer and the overall performance of the hybrid network as compared with an all-electronic network using PCA. PCA is a statistical method widely used in various fields, especially in analyzing the quality of neural networks,34 to project original, high-dimensional data into a new, simpler coordinate system for explicit interpretation. Specifically, PCA computes the eigenvalue decomposition of the covariance matrix or the singular value decomposition of input data to determine the principal direction (known as “principal components”). Principal component 1 denotes the axis of maximum variance, encapsulating the most substantial relationships among variables, whereas principal component 2, orthogonal to principal component 1, captures the second most significant variance direction. Notably, the first two principal components typically contain the most crucial information. In this study, we compress the output dimensions from electronic and optical convolutional layers into two principal components, respectively, to compare the classification efficacy of each approach; this is observed by comparing the clusters observed in PCA visualizations.35
In Fig. 4, we use PCA to show that the raw experimental data do identify the fundamental components necessary for classification, but that the calibration function is necessary to shift optical representations to a form that is compatible with the predesigned electronic backend. As shown in Figs. 4(a) and 4(c), we observe that both the all-electronic CNN outputs and the uncalibrated hybrid network experimental results can effectively distinguish among different classes due to the clustering behavior of specific classes, e.g., light blue (number 6) and navy blue (number 7). This clustering behavior indicates that despite any observable shifts in the PCA plot, the fundamental capacity of the hybrid network to classify data remains comparable to that of all-electronic networks. However, due to the differences between the optical experimental output and the expected input to the electronic backend, directly using the original backend network results in a notable drop in accuracy, down to 16.3%. The calibration layer is designed to recalibrate the outputs from the optical convolution layer back to the original outputs, thereby enabling the use of the original backend without retraining. As shown in Figs. 4(b) and 4(c), the calibrated experimental result exhibits very similar clustering behavior to the all-electronic network, further demonstrating that the hybrid network classification is comparable to that of the all-electronic network.
Figure 4.PCA of the hybrid CNN. (a) PCA of the uncalibrated experimental hybrid CNN classification data. (b) PCA of the calibrated experimental data, which has been remapped and exhibits clustering behavior similar to that of the compressed electronic CNN data. (c) PCA of the compressed electronic CNN data.
3.2 PSF-Engineered Meta-Optics
We emphasize two advantages of our PSF engineering method for performing optical convolution. First, we highlight the simplicity of the optical system, as this method requires only a display, a single layer of optic, and a camera, making it compact and simple to execute. Incoherent illumination is used, so this approach can be applied to real-world image classification scenarios. Second, we highlight the ease of integration with the optimized electronic system. One of the challenges faced by optical computing is that electronic computing is already extremely powerful, having had decades of research and development into algorithms and hardware.3 In our approach, the optics are designed to implement the electronically optimized kernels. These kernels may be modified to reflect improvements in electronic CNN models and architectures, and the optics can accordingly be adapted to implement convolutions with arbitrary kernel matrices.
Furthermore, the GS algorithm30 used to design the optics is a well-established technique. The GS algorithm is an iterative phase-retrieval algorithm to determine the phase (in the optic plane) that produces an intensity pattern in another desired plane (the focal plane). In other words, the meta-optics are phase-only holograms producing the desired PSFs as their images. Due to the fact that only amplitude, and not phase, contributes to the intensity pattern, the iteratively designed phase masks are not unique. More details on the implementation of the GS algorithm are available in the Supplementary Material. In the Supplementary Material, we also discuss an alternative design method based on automatic differentiation, which also produces viable optics, but we found the GS-designed optics to produce slightly brighter, clearer images.
There are, however, two major limitations of the PSF engineering approach. One limitation is that there is no guarantee that the desired PSF is physically realizable. That is, a single-phase mask that satisfies the desired amplitude constraints may not exist. However, by introducing an electronic calibration layer, the resultant PSF does not need to be perfect to effectively classify the data. Alternatively, to ensure physically realizable PSFs, one could adopt an end-to-end optimization scheme wherein the phase mask is simultaneously optimized with a backend. However, training this large phase mask (on the order of
A second limitation of the described approach is that we assume the PSF is spatially invariant. In reality, light from different spatial locations on the imaging object intersects the meta-optic at various angles of incidence, resulting in a different PSF for the off-axis rays. In contrast, we design the optics assuming a normal incident illumination. To mitigate this discrepancy, we ensure the incoming angles of incidence are relatively small by placing the display far away from the optics (90 mm) relative to the focal length (2.4 mm). Therefore, for a displayed image size of
3.3 Outlook—Computational Effectiveness of the Hybrid CNN
The number of MAC operations required of a network serves as a metric of computational complexity that is independent of the employed hardware technology. In a modern digital system, one MAC operation consumes
The benefit of optically implementing the convolutional step becomes more significant as the number of input pixels is increased. The electronic computational complexity of each convolutional layer is determined by the height and width of the input images
In summary, we demonstrate single-layer optical convolution with an electronic backend to achieve similar accuracy as AlexNet-Mod on MNIST handwritten digit classification, with a 99.5% reduction in computational complexity. To circumvent the nonlinearity of AlexNet-Mod, we use KD to compress the CNN into linear layers, which are then implemented in a hybrid format. As a further innovation, we implement the convolution optically via engineering the PSF of meta-optics, which results in a more compact and resilient optical front end than the commonly used
4 Materials and Methods
4.1 Transfer Knowledge to Linear Networks
The KD algorithm is designed to compress neural networks. KD accomplishes this by transferring knowledge from a larger, pretrained network (referred to as the “teacher model”) to a more compact network (referred to as the “student model”). In our implementation, we use AlexNet-Mod as the teacher network and a linear electronic network as the student network. The student network only comprises a single CNN coupled with a single fully connected layer. Straightforwardly training this linear network tends to easily converge to suboptimal saddle points; however, with the KD approach, the student network converges faster and obtains better performance than it would achieve without the KD training.
The KD algorithm optimizes the linear student network by combining two types of losses: temperature loss and student loss. Similar to other conventional losses in image classification, KD uses the training labels as “hard labels” and computes the probabilities distribution vector
Then, KD treats the prediction of the teacher model as “soft labels” that inform the student model because training with hard labels only is a sensitive process, especially for compact linear networks. KD includes a softening parameter,
Therefore, the total loss is then calculated as a weighted summation of the two losses,
4.2 Calibrating Optical Experimental Results with Limited Data
The calibration function is designed to minimize the discrepancy between the ideal model and the previously trained backend. This addition addresses a variety of differences that may occur between the optical and electronic counterparts, including scaling, translation, rotation, and optical noise. With the addition of the calibration layer, the weights of the fully connected layer are preserved, and the optical front end can be integrated into the existing network framework without retraining the backend or fine-tuning the optical alignment.
Specifically, the backend includes a calibration layer and the original backend layer used in the “compressed CNN,” which are both fully connected layers. The input of the original backend layer is 288, and the output is 10, where 288 is the flattened image size after the optical convolutional layer. The input and output of the calibration layer are both 288 to align with the optical front end and the original backend. We summarized the network structure details in Supplementary Material S1. The loss function used in this process is defined as
To obtain accuracy, the model generates scores for each input image. These scores are the output of the last layer (size 10 by 1). We use the following equation to convert the scores to accuracy:
4.3 Meta-Optics Design
The meta-optics are designed with our experimental setup in mind. Due to the sensitivity of our camera (GT-1930C) and available light sources, we design the optics specifically for 525 nm illumination. For all electromagnetic simulations, we use a simulation grid size of 586 nm to be both comparable to the wavelength of the light and evenly divisible by the size of the camera pixels (
4.4 Meta-Optics Fabrication
The convolutional meta-optics are fabricated on a silicon nitride layer on a quartz substrate. We first deposited silicon nitride on a double-side polished quartz wafer using plasma-enhanced chemical vapor deposition (Oxford; Plasma Lab 100). Then, we patterned on a positive-tone resist (ZEP-520A) using e-beam lithography (JEOL; JBX6300FS). We used alumina as a hard mask for etching the silicon nitride layer, so we deposited the alumina using e-beam evaporator (CHA; SEC-600) and did liftoff with 1-methyl-2-pyrrolidinone. We etched the silicon nitride layer with a plasma etcher (Oxford; PlasmaLab 100, ICP-180) using fluorine-based gases. To minimize the stray light of the meta-optics, we blocked the light except for the 16-kernel suboptics by putting apertures around each one using photolithography (Heidelberg; DWL66+) and metal deposition followed by a liftoff process.
4.5 Optical Measurements
Two-dimensional PSFs of 16 different optical kernels are measured simultaneously with a simple measurement setup. A single-mode optical fiber-coupled light source acts as a point source, and the meta-optics are placed 92 mm apart from the source. As we put the meta-optics on a three-axis linear stage and a kinetic mount with two rotation-adjusting knobs, both the position and angle of the meta-optics can be well defined with respect to the designed setup. Then, we put a high-resolution color camera (GT-1930C with
Biographies of the authors are not available.
References
[1] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks(2012).
[2] X. Li et al. Performance analysis of GPU-based convolutional neural networks, 67-76(2016).
[3] P. L. McMahon. The physics of optical computing. Nat. Rev. Phys., 5, 717-734(2023).
[16] K. Wei et al. Sci. Adv., 10(2024).
[20] H. Zheng et al. Meta-optic accelerators for object classifiers. Sci. Adv., 8, eabo6410(2022).
[22] L. Huang et al. Photonic advantage of optical encoders. Nanophotonics, 13, 1191-1196(2023).
[29] H. Cai et al. Network augmentation for tiny deep learning(2022).
[30] R. W. Gerchberg. A practical algorithm for the determination of phase from image and diffraction plane pictures. Optik, 35, 237-246(1972).
[31] J. W. Goodman. Introduction to Fourier Optics(2005).
[39] Y. Zheng et al. BI-MAML: balanced incremental approach for meta learning(2020).
[42] S. B. Damelin, W. Miller. The Mathematics of Signal Processing(2011).
[43] B. Dube. Phase retrieval II: iterative transform algorithms(2023).

Set citation alerts for the article
Please enter your email address