
- Photonics Research
- Vol. 12, Issue 11, 2524 (2024)
Abstract
1. INTRODUCTION
High dynamic range (HDR) imaging is an essential imaging technology which aims to preserve as much effective information as possible from the natural scene, which satisfies the needs of applications in extreme conditions in which low dynamic range (LDR) imaging fails. For example, autonomous driving using the LDR visual pipeline could be saturated and over-exposed by highlight, resulting in the loss of information and an impact on decision-making, while HDR imaging alleviates these problems and guarantees the safety of autonomous driving.
Dynamic range (DR) is expressed as the ratio between the highest and lowest luminance value [1], also defined as the ratio of the highest to the lowest pixel value in logarithmic form within digital image domain [2]:
From the moonlight illumination at night to bright sunshine conditions, the DR of the natural world reaches approximately 280 dB. The widest range of luminance that human eyes can perceive is 120 dB [3], while the DR of a conventional 8-bit camera cuts down to only 48 dB, according to Eq. (1). Generally, the process of capturing an HDR scene to acquire one LDR image via a conventional LDR camera can be modeled by three major steps: dynamic range clipping, non-linear mapping, and quantization [4]. Dynamic range clipping leads to a hard cutoff at some top irradiance determined by the full well capacity of the sensor, while quantization operation causes missing information for pixels in the under-exposed regions determined by the inherent noise of the sensor. It can be imagined how tremendous the information loss is during such a process in an LDR imaging pipeline. Yet, abundant applications are posing more requirements on extending dynamic range, including robotics, driver assistance systems, self-driving, and drones. Thus, in order to capture and accurately represent the rich information in natural scene content, plenty of HDR imaging techniques have emerged and drawn much attention from researchers over the last decades.
Sign up for Photonics Research TOC. Get the latest issue of Photonics Research delivered right to you!Sign up now
The mainstream approach which is widely deployed on current smartphone devices is to fuse a series of LDR images captured by a conventional sensor under different exposures [5–8]. By sequentially capturing LDR measurements and merging them through brackets [9], these conventional HDR methods successfully expand the dynamic range and perform well with static scenes. However, these multi-shot fusion methods suffer from motion artifacts and difficulty with image alignment when dealing with motional scenes or a moving target. Although a batch of modified fusion solutions aims to eliminate these problems by leveraging optical flow [10] or using deep learning [11,12], they still have to face same inherent problems of these multi-shot fusion approaches: (1) the high memory consumption due to the large data volume (at least 3 LDR images for a single HDR output); (2) the high computational workload caused by complicated fusing algorithms. The above drawbacks limit the usage of multi-shot fusion methods in application scenes whose performance requirements are beyond mobile phone HDR.
To eliminate the aforementioned limitations of fusion methods, researchers dived into achieving HDR via the single-shot scheme. The prevailing single-shot HDR schemes are broadly categorized into two domains: software solutions concentrating on algorithmic enhancements and hardware strategies achieved through the revision of the sensor or optical path.
In the software domain, inverse tone mapping or inverse LDR camera pipelines, which developed deep networks to explicitly learn to reverse the image formation process [4], have provided a way for the reconstruction of HDR image from a single LDR image. Yet the end-to-end deep learning method still posed challenge on computational workload and generalization. Besides, HDR reconstruction is hallucinated from LDR images, which cannot be a credible alternative for downstream vision tasks that require accurate content from the real world.
In the hardware domain, researchers have attempted to capture multiple exposures simultaneously, within a single exposure period. To this end, some works aimed to redesign the pixel structure of the imaging sensor [13,14] and fuse multi-exposure measurements of optically aligned multiple sensors [15,16]. These methods achieve single-shot HDR reconstruction at the cost of fabrication complexity, device cost, alignment difficulty, footprint, and practical applicability.
As mentioned above, reconstructing HDR images from a single LDR image captured by a conventional sensor is an ill-posed inverse problem. A line of studies has attempted to overcome the challenge of ill-posed problems through the hardware domain by leveraging similar strategies from the field of computational imaging. They placed various optical components as optical encoders in front of the sensor in order to modulate the light field flexibly. The optical encoder encoded the saturated details, providing physical priors for the subsequent algorithms’ decoder to reconstruct an HDR image from a single LDR capture. The utilization of diffractive optical elements (DOEs) with point spread function (PSF) optimization strategies [17,18] was one aspect of these approaches, while spatially varying pixel exposures (SVEs) [2,19] were proposed as another aspect. Loading the custom patterns on spatial light modulators (SLMs) is a representative way of SVE methods, in which liquid crystals on silicon (LCoS) [20,21] or a digital micromirror device (DMD) [22,23] was employed to modulate the light field and encode the HDR information within the spatial domain or frequency domain. These existing optical-encoding-based single-shot HDR methods suffer drawbacks either in a further increase on DR, the recovery of an HDR scene with larger saturated regions, or time latency for a single HDR capture.
In this paper, we proposed a novel framework for a specific HDR vision task, namely a pixel-wise optical encoder driven by video prediction (POE-VP), which mainly consists of two modules: (1) the pixel-wise optical encoder module (POEM), as the hardware part; (2) the coded mask prediction module (CMPM), as the software counterpart. In POEM, we employ a DMD as an optical encoder to modulate the incident light intensity and selectively control the exposure, providing physical fundamentals for single-shot pixel-wise highlight suppression. The modulation speed of the DMD can reach a high level, up to several hundred or thousand hertz. In order to make full use of the DMD’s performance, we introduce the CMPM into our framework, which activates the high-speed light modulation potential of the DMD. Drawing inspiration from predictive models which perform well on decision-making applications and motional scenes, we utilize a deep-learning-based video prediction (VP) model as the CMPM. We leverage its ability to infer the motions in a video via extracting representative spatiotemporal correlations within the frame sequence [24] so as to predict the motion and variation of highlight regions in motional scenes. Thus, by inputting current and previous captured frames, the CMPM can predict a coded mask for the future frame in advance. Then, the mask will be loaded on the DMD to suppress local highlight during the capture of next frame, allowing us to acquire high-speed HDR capture via a single-shot method. To evaluate the proposed system’s performance on revealing lost information in saturated regions, we select moving license plate recognition (LPR) in highlight conditions as an HDR vision task on a motional scene. We established the first dataset consisting of paired HDR-LDR moving license plate videos for model training and framework simulation tests. To further demonstrate our methodology, a DMD camera prototype is fabricated to evaluate an HDR moving object recognition scene. Driven by the VP-based mask prediction, the proposed single-shot framework is adaptive to the real motional scene, performing well on both the simulated dataset and real scene. To the best of our knowledge, this is the first work investigating the integration of per-pixel exposure control and video prediction in the HDR imaging domain. The proposed single-shot framework overcomes the data volume and artifact drawbacks in multi-shot techniques. Our approach is able to faithfully recover the HDR information at relatively short time cost, which addresses the trade-off between DR and time efficiency. Overall, our approach raises a new single-shot solution in the field of motional HDR imaging, especially HDR vision tasks which desire both sufficient information and high time efficiency.
2. PRELIMINARIES ABOUT VIDEO PREDICTION
VP is a self-supervised extrapolation task, aiming to infer the subsequent frames in a video, based on a sequence of previous frames used as a context [24]. For deep-learning-based VP, the spatial-temporal correlations within the consecutive input frames are learned to form a prediction of the future. Drawing inspiration from its application in motional scenes such as autonomous driving, we adopt VP in POE-VP for HDR motional scenes. For the VP algorithm utilized in our work, we put much emphasis on the inference speed and generalization ability, in order to achieve high-speed HDR and make our framework adaptive for various over-exposure scenes. The VP model employed in our proposed framework is mainly based on the architecture of the dynamic multi-scale voxel flow network (DMVFN) [25]. By estimating the dynamic optical flow between adjacent frames, the DMVFN models various scales of motions between frames in order to predict future frames. The utilization of optical flow enhances the quality of prediction frames as well as the generalization ability and stability over diverse kinds of scenes. On the other hand, the DMVFN consists of a differentiable routing module, which is able to select different architecture sizes adaptive to the inputs with different motion scales. The routing module lowers the computational costs and significantly reduces the inference time. Specifically, for predicting a frame with resolution of
3. PRINCIPLE
As mentioned in Section 1, the proposed POE-VP in this paper consists of two modules, namely the POEM and CMPM, respectively. The POEM is driven by the CMPM; in particular, the coded grayscale mask loaded on the DMD in the POEM is predictively computed by the VP model in the CMPM. The conceptual framework of POE-VP performing in an HDR motional license plate scene is illustrated in Fig. 1.
Figure 1.The simplified illustration of the conceptual framework of POE-VP, demonstrating with an HDR motional license plate recognition downstream vision task. POEM (shown in the first row in beige color) and CMPM (shown in the second row in purple color) are two core modules that build up POE-VP.
We employed the DMD to achieve per-pixel optical encoding in order to control the exposure on each spatial position. The imaging hardware used in POE-VP is summarized inside the orange dashed box in Fig. 1. The arrangement of these optical components of a practical prototype will be discussed in Section 7.
The first row of Fig. 1 (in the beige background color) illustrates the circumstance when no mask is loaded on DMD. In this case, the whole imaging pipeline can be seen as a conventional LDR imaging system, which allows the entire original light from the HDR scene to enter the sensor without any modulations. The highlight in the HDR scene forms the over-exposure regions on the imaging sensor, leading to LDR captures (Frame t−1/t) whose license plate cannot be recognized by any license plate detection and recognition (LPDR) approaches because of the information loss caused by saturation.
As for the second row of Fig. 1 (in the purple background color), it depicts the performance of the CMPM. In the purple dashed box, the VP model, DMVFN, takes the previous captured LDR frame sequence (Frame t−1/t,
As shown in the third row of Fig. 1, as the objects in the scene keep moving, the predicted mask will be loaded on the DMD at a relatively high refresh rate concomitantly. Then, in the next frame’s capture process, the pattern of the mask (Predicted Mask Frame
According to the aforementioned mechanism, for the first 2 frames at the very beginning, there are no predicted masks for them because of the absence of previous frames as inputs. Therefore, when operating, our framework has an initialization process. The first 2 frames captured by the POEM are used for initializing the VP model instead of HDR imaging. After the initialization, the whole system will operate under the aforementioned mechanism, capturing HDR images through POE-VP frames by frames at a certain rate.
To sum up, the key features of POE-VP are as follows. (1) We employ the DMD as the optical encoder in order to achieve pixel-wise exposure control, which is the hardware basis for expanding DR. The utilization of the DMD places a part of the computational procedures into the optical domain, which is originally done in the algorithm domain by electronics. Compared with approaches in the software domain, this pre-sensor optical processing lowers the backend computational costs and enhances the processing speed due to the high refresh rate of the DMD. (2) We use VP for predictively computing the coded mask to drive DMD, so that the patterns of grayscale masks are able to follow the motional change in the real scene. As a result, there are marked distinctions between our masks and other SVE methods’ masks. Our spatiotemporal varying mask sequence contains the spatiotemporal correlations depicting the motion of over-exposure regions. In contrast, the SVE methods generally use periodic static masks missing the temporal dimension or iteratively generated masks, which have an attenuation in temporal dimension. The principle of our per-pixel exposure control hits the goal of single-shot HDR, alleviating the motional artifacts and reducing data size.
4. IMPLEMENTATION OF LEARNING IMAGE SEQUENCE PREDICTION
A. Dataset Preparation
In the field of HDR, there is a common challenge on acquiring paired HDR-LDR datasets. In general, a natural HDR scene turns out to be the LDR image via an LDR camera. As a result, it is difficult to get an HDR ground truth from the real scene, which is corresponding to LDR measurements captured by the conventional LDR camera. Among the existing benchmark datasets for HDR, there is still a gap in single exposure with motional scenes [26].
In our work, we prepared our dataset mainly based on the LSV-LP dataset, a large-scale and multi-source video-based LPDR dataset [27]. The LSV-LP consists of three categories: move versus static, static versus move, and move versus move. The move versus static means that the data collection device is moving while the vehicle stays still. The static versus move means that the device is static while the vehicle is moving. The move versus move shows that the device and vehicle are both moving. We extract a subset from the category move versus static, including 68 videos, with 18,322 frames which are divided into a training dataset (62 videos, 18,185 frames) and a test dataset (6 videos, 137 frames). We also collect 26 videos with 2298 frames of a motional license plate on the road and organize them as another part of the test dataset. A mobile phone camera is used as the video data collection device. While the mobile phone keeps moving, the videos are recorded in 1080p, 60 fps (frames per second) from a road driving vehicle and a parked vehicle. All frames are resized to
To set up a paired HDR-LPR license plate dataset, we need to acquire HDR information at first. However, the LSV-LP dataset is collected via a conventional LDR camera; thus, HDR information has already been cut off during the collection process in this dataset. Therefore, it is necessary to preprocess the original LSV-LP dataset. The preprocess approach is illustrated in Fig. 2. Based on these videos, we apply a process strategy called artificial highlight to generate HDR information. Specifically, we operate a Hadamard product at each frame from the original LSV-LP dataset with a grayscale map. The map, namely the artificial highlight map, follows a 2D Gaussian distribution whose extreme point aligns with the center point of license plate, written as
Figure 2.(a) The normalized artificial highlight map
Then, we imitate the main step of the LDR image formation pipeline, the dynamic range clipping, to simulate the scene of capturing LDR measurements via a conventional LDR camera.
In each HDR frame we generate, those pixels whose values are over 255 are cut off to 255, while other pixels remain unchanged. The over-255 pixels are saturated and form over-exposure regions, which produces a simulated LDR frame for us, as shown in Fig. 2(d). Via the aforementioned steps, we collected the paired HDR-LDR license plate dataset, namely HDR-LDR-LP.
B. VP Model Training
The preprocessing parameters E and R for the training dataset are set to 5 and 50, respectively. The DMVFN model used in our framework is trained on
5. EVALUATION METRICS
Our framework in this work is emphasized on HDR imaging towards specific vision tasks, instead of HDR photography for human vision. As a result, the evaluation metrics in this paper put more weights on the amount of information we can reveal to enhance the downstream vision tasks in highlight conditions than image quality and perceptual quality. The main quantitative metrics we use are introduced and defined as follows.
A. Recognition Accuracy
Recognition accuracy (Rec-Acc) is calculated per test video, defined as Eq. (4).
B. Information Entropy
To quantitatively measure the amount of information around the license plate region in both the HDR frame sequence and LDR frame sequence, we introduce the information entropy S [28] defined as follows:
C. Local Information Entropy
A certain pixel’s local information entropy is calculated in a
6. SIMULATION ASSESSMENT
A. Simulation Experiment
The test dataset prepared for the simulation experiment consists of two parts. The first part is the test set extracted from the original LSV-LP dataset, which contains 6 videos, with 137 frames. We employ the preprocess strategy mentioned above for the test set, setting the Gaussian map parameters E to 9 and R to 50. Furthermore, in order to evaluate the generalization ability of our framework, we capture additional videos of the license plate from a different province in China via an iPhone 13 Pro Max camera and fill up the second part of test dataset, which contains 26 videos. The second part of test dataset is then divided into three subsets, with a different E (9, 29, 49) and the same R (50) in preprocessing. The three configurations of over-exposure rate E simulate the various highlight conditions, which allow us to evaluate the generalization performance and stability of our POE-VP.
Figure 3 illustrates the simulation pipeline. The original frame sequence extracted from each video in the test dataset is sent into preprocess unit, yielding paired HDR information maps and simulated LDR frames. The simulated LDR frames are resized to
Figure 3.The illustration of the simulation pipeline. The exemplary frames of “JingA 3MU55” video from the test dataset is used as an example. For this example, preprocessing parameters (
We evaluate our framework’s performance on HDR via a downstream license plate recognition (LPR) task. We use the architecture in Ref. [29] as ALPR Net, which employs yolov5 [30] and CRNN [31] as the license plate detection model and license plate recognition model, respectively. For comparative experiment, the simulated LDR frames are evaluated via the same ALPR Net.
Note that we directly send the simulated sensor captures into ALPR Net, as our target task is recognition task. As for HDR image reconstruction, the simulated sensor captures need to be first divided by corresponding mask predictions to obtain the full HDR reconstruction matrix, in which pixel values over 255 appear. Then, by utilizing tone mapping [32], the HDR results can be displayed properly on the LDR screen for human observation.
B. Simulation Results
Figure 4.(a) Simulation results of “JingA 3MU55” license video, which is from the R50E29 class of test datasets. Exemplary frames are shown with the frame index annotations (
We also compare the proposed framework with existing techniques, including classical multi-exposure fusion [8], modified multi-exposure fusion with artifact reduction [33], and deep optics single-shot HDR [17]. The qualitative results are shown in Fig. 5, and all the methods are evaluated using the same dataset mentioned in Section 4.A. Note that in the implementation of multi-exposure fusion methods [8,33], we use three consecutive frames as a group for generating a fused HDR frame result. We simulated the exposure difference on three frames via multiplying them by three different global exposure coefficients. The coefficients we set in comparison simulations are 1/8, 1/5, and 1, corresponding to low exposure, medium exposure, and high exposure, respectively. The 3-frame groups were selected one by one, from the first three frames to the last three frames, covering all consecutive 3-frame groups in the video data. It can be observed that our POE-VP outperforms other methods in highlight suppression and HDR information preserving (around target regions). Specifically, as a single-shot HDR method, our POE-VP is able to eliminate the artifacts, which are easily found in the classical exposure fusion method due to the multiple captures of moving scenes. Though the multi-exposure fusion with artifact reduction method remarkably reduces the artifact, it still faces a performance limit in some extreme illumination scenes. For example, in the bottom row of Fig. 5, modified multi-exposure fusion suppresses less highlighting around the character “M” in the license plate, compared to our POE-VP. For the test videos in the case of minimal over-exposure (preprocess parameter
Figure 5.The exemplary frames of comparison simulations with the other 3 existent methods. The results of our POE-VP are highlighted with the red color. Parts of license plate are zoomed in for detailed comparison on motional artifacts between the multi-shot method and proposed single-shot POE-VP. Note that all the HDR results of the 4 methods are tone mapped under the same approach for displaying.
It can also be evaluated quantitatively in Table 2, which shows the comparison on the average recognition accuracy of the LPR task. The result shows a prominent superiority of our work towards other methods when working in higher saturation cases. Besides, time efficiency is an important feature of our work as our target vision task is a motional HDR task. We compare the lowest running time of the algorithms for a single frame. The comparison results in Table 2 show that POE-VP has a higher time efficiency, which enables faster capture speed.
Quantitative Results of Comparison Simulation
Capture Mode | Method | Average Recognition Accuracy | Average Running Time/ms |
---|---|---|---|
Multi-shots (at least 3 shots) | Exposure fusion (classical) | 65.07% | 135.261 |
Exposure fusion (artifact reduced) | 88.56% | 424.831 | |
Single-shot (with physical setup) | Deep optics | 21.64% | 1161.458 |
The numbers in bold represent the best performance.
7. REAL SCENE EXPERIMENT
A. DMD Camera Prototype
Figure 6.(a) The schematic of optical layout inside the DMD prototype. (b) The DMD prototype. (c) The photograph of the light box in the experimental setup. The red arrow shows the moving direction of the license plate. (d) The overall view of the experimental setup.
B. Experimental Setup
The real scene experimental setup is shown in Fig. 6(d). The moving license plate scene in the highlight condition we create is shown in Fig. 6(c), corresponding to the red dashed box part in Fig. 6(d). We use two different real license plates as the objects in the experiment. The license plate is fixed on a sliding optical bench, which makes the license plate move along the oblique optical rail. We use an LED luminescent array in the light box (VC-118-X 3nh) as highlight source A, which is tuned to the max illuminance. Besides, we employ another LED flash light (MSN-400pro) as highlight source B, which is tuned to its max illuminance as well. The two light sources emit the light towards the license plate and form a highlight illuminate area, which leads to the saturation of a conventional image sensor without the DMD’s modulation. The DMD is set in front of the light box, at a distance of 5 m. In this experiment, we only use a part of prototype’s field of view, that is, a region of interest (ROI) at the size of
Figure 7 illustrates the mechanism and imaging procedure of our framework implemented via the DMD prototype, including the trigger mode and timing scheme. By generating control signals via an arbitrary waveform generator (AWG, UNI-T UTG2000B), we synchronize the DMD and image sensor [34]. As shown in Fig. 7(e), the trigger signal of the image sensor is a relatively long rectangular wave with pulse width
Figure 7.(a) LDR capture sequence. The red arrow shows the moving direction of the license plate. (b) The grayscale masks predicted by VP. In the real scene experiment, we set mask coefficients (
C. Results and Evaluations
Figure 8.(a) Hardware experiment results of “JingN F9H28” license plate motional scene. (b) Hardware experiment results of “JingL J7629” license plate motional scene. Exemplary frames of 2 result videos are shown with the frame index annotations (
Quantitative Results of Two Real HDR License Plate Recognition Scenes
License Plate | Captured Frame Count | LDR Rec-Acc | HDR Rec-Acc | LDR Entropy | HDR Entropy | Time/ms |
---|---|---|---|---|---|---|
JingN F9H28 | 28 | 0 | 1.2018 | 16.807 | ||
JingL J7629 | 39 | 0 | 1.2038 | 14.317 | ||
Average | 0 | 1.2028 | 15.562 |
The numbers in bold represent the best performance.
It can be seen that the POE-VP overcomes the failure that LDR captures have in the recognition task. The Rec-Acc of HDR captured videos has an improvement of 95.15% on average, compared to the Rec-Acc of LDR videos. The information entropy calculated on HDR captures is 5.9021 better on average than on LDR captures, which shows a 490.70% increase. POE-VP successfully suppresses the highlight and preserves much richer information on the license plate from the original HDR scene. The local information entropy maps of exemplary frames are shown in Fig. 9 for further demonstration, which visualizes the considerable information entropy increase obtained by the involvement of POE-VP. In the heat maps, the deep blue color in LDR license plate regions indicates that there are information losses. In contrast, the warm colored HDR license plate regions show the ability that POE-VP can preserve a relatively large amount of information in the same highlight condition. Using a 36-patch dynamic range test chart (Sine image), we estimate the DR of the prototype. Without optical encoding (fully white patterns on the DMD), the prototype scores only 70.35 dB in DR, while the prototype with optical encoding expands DR to 120.74 dB, which achieves a gain of 71.63%.
Figure 9.(a) The local information entropy heat maps of the exemplary frames in “JingN F9H28” license videos. (b) The local information entropy heat maps of the exemplary frames in “JingL J7629” license videos. The HDR and LDR license plate results are set by corresponding entropy maps for comparison.
The above results indicate that POE-VP enhances the performance on motional object recognition tasks in highlight conditions to a great extent. Furthermore, we also calculate the time consumption and analyze the high-speed property of POE-VP. In the real scene experiment, we utilize a PC to receive previous captures from the image sensor, operating VP, and output masks to control the DMD, as illustrated in Fig. 6(a). We run the VP and mask calculation algorithm on an NVIDIA GeForce RTX3090 GPU on a PC and summarize the running time in Table 4. The average running time for one frame in the real scene experiment is 15.562 ms, which is nearly the theoretical bound of time consumption of our POE-VP working in the real scene. However, due to the limitation on the data transfer rate of the USB and the DMD driving mode of the PC, the actual capturing speed cannot reach the ideal one, despite that the DMD can refresh its patterns at a rate over thousands of Hz. Thus, the example video results we demonstrate in
Last but not least, acquiring a single HDR output via our framework occupies memory with only two previous frames. Compared with multi-frame fusion methods which at least take 3 frames of different exposures, POE-VP achieves lower captured data size, reducing the data volume at a ratio of 66.67% at a minimum.
8. DISCUSSION AND CONCLUSION
In summary, POE-VP offers a new paradigm for single-shot HDR vision tasks of motional scenes. By implementing via a DMD, the POEM realizes pixel-wise spatially varying exposure control for suppressing highlight of each captured frame. As the motional artifact and time latency are two critical challenges to overcome in motional scene HDR, we propose the CMPM utilizing VP for ahead-of-time prediction. The CMPM introduces a temporal consistency between the masks for consecutive captures and the motional scene. Driven by VP, DMDs are capable of producing a mask sequence whose spatiotemporal correlation is highly similar to the capture sequence of the highlight motional scene. This spatiotemporally varying optical encoding mask is the core feature of POE-VP, distinguished from other SVE approaches in the field of HDR. Thereby, a single-shot HDR is achieved, giving the credit to the ahead-of-time mask prediction via VP and the high-speed mask loading via the high refresh rate DMD. The dynamic range of the prototype could reach 120 dB under test. We choose the highlight motional license plate recognition as the downstream HDR vision task to evaluate the performance of POE-VP. According to subjective and objective analysis of results in simulation and the real scene experiment, POE-VP shows obvious advantages in expanding DR and preserving HDR information. Compared with conventional LDR captures, POE-VP enhances the recognition performance, which reflected the recognition accuracy and information entropy. Running on an NVIDIA GeForce RTX3090 GPU, the model and algorithm of POE-VP take 15.562 ms in total for a single HDR capture. In the practical engineering deployment, the frame rate can reach over 100 fps by using a large bandwidth and a high transmission rate edge computing accelerator assisted with a data bus, which achieves real-time HDR for motional scenes to some extent. Thus, POE-VP alleviates the artifacts in motional HDR imaging, which we have discussed in comparison simulation in Section 6. Furthermore, compared to the mainstream multi-frame fusion methods, the data volume is also reduced by POE-VP, leading to a lower data size. It should be mentioned that the performance improvements on dynamic range expansion and artifact suppression of POE-VP mainly benefit from its physical setup, i.e., the use of the DMD as a pixel-wise optical modulator and its controller, which of course has a limitation on the device size and additional cost compared to multi-exposure methods with a conventional camera in some practical cases, for example, HDR imaging for static scenes (almost no artifact) with more requirements on device portability and cost. In summary, limited by current technology on the DMD and optical design, POE-VP trades some miniaturization and hardware cost for the prominent improvement on time efficiency, data size, artifact reduction, and dynamic range expansion. Besides, POE-VP is always able to provide additional dynamic range for the used sensor in imaging systems via its physical setup regardless of the sensor’s bit depth. Overall, it can play the role of an enhancement component for various systems in practical applications, which is able to bring additional dynamic range and fast speed, as well as additional size and cost.
Our POE-VP can be improved in several aspects. First, with the DR, we expand focusing on the high illuminance conditions, while at the opposite side in the axis of DR, there is low illuminance scenario to be extended. Expanding the ability of lowlight sensing is a future work. Second, the POE-VP proposed in this paper is optimized and finetuned for a specific vision task. A generic HDR scheme and HDR imaging at the photography level for human visualization based on the framework of POE-VP are attractive works for further exploration. For instance, the parameters in Eq. (2) that control the mask generation need to be optimized more finely for more complicated HDR tasks or even photography-level HDR imaging. Making them differentiable and training it end-to-end with a video prediction network and other imaging process networks for specific HDR tasks or general HDR photography are further research subjects, which we will dive into. Third, POE-VP has a limitation based on the characteristic of the VP algorithm we used. That is, when the moving object has a precipitous large margin motion, e.g., a sudden 180° turn, the video prediction may fail to catch up with such sudden variation of the scene within two frames, resulting in the imperfect highlight suppression among those sudden movement frames. To achieve improvement, we will dive into the joint optimization of imaging optics, optical encoding strategies, and image processing algorithms.
APPENDIX A: DIGITAL MICROMIRROR DEVICE USING A PIXEL-WISE OPTICAL ENCODER
The digital micromirror device (DMD), a high-speed amplitude SLM, carries several thousand microscopic-mirrors in a rectangular array configuration on its surface. Being controlled by two pairs of electrodes, each of those micromirrors can be individually tilted
Here, we dive into the control mechanism of micromirrors to further discuss the pixel-wise features of POE-VP’s special exposure modulation strategy. In PWM mode, the mirrors are held in the state corresponding to the previous loaded data until the next bit is loaded [
By ensuring each mirror of the DMD aligns with its corresponding pixel on the sensor, the precision of optical coding can be enhanced. After aligning the DMD with the sensor, the process in which DMD modulates the light field via a coded mask and forms a measurement on sensor can be described mathematically as follows [
APPENDIX B: VIDEO PREDICTION MODEL
The video prediction model, the dynamic multi-scale voxel flow network (DMVFN), utilizes 9 convolutional sub blocks named the multi-scale voxel flow block (MVFB) to estimate dynamic optical flow between adjacent frames. The
There are two further evaluations on the VP model that need to be mentioned here. The first evaluation is about the training dataset of the prediction model. The preprocess method in Section
Results of Models Trained on the Regular RGB Dataset (Test on RGB Images), Models Trained on the Synthesized RGB Dataset (Test on RGB Images), and Models Trained on the Synthesized RGB Dataset (Test on Grayscale Images)
Training Dataset | Test Data | PSNR/dB | MS-SSIM | LPIPS |
---|---|---|---|---|
Regular image sequences (RGB) | RGB | 25.37 | 0.8469 | 0.1397 |
Synthesized image sequences (RGB) | RGB | 25.71 | 0.8603 | 0.1379 |
Synthesized image sequences (RGB) | Grayscale | 25.78 | 0.8620 | 0.1338 |
The higher PSNR and SSIM scores, and lower LPIPS scores, the better quality of predictions.
In our hardware experiment, we transfer the single-channel grayscale capture data to the 3-channel image as the input of the video prediction net by copying the single channel thrice and concatenating them along the channel dimension. So, the second evaluation is about the generalization of DMVFN, which was trained on RGB data but tested on such grayscale data. We use the synthesized test dataset (RGB) mentioned in Section
APPENDIX C: PROOF DETAIL OF LDR CAPTURES VIA THE POE-VP PROTOTYPE WITH DMD-OFF IN THE HARDWARE EXPERIMENT
When a fully white pattern is loaded on the DMD, all the micromirrors are switched “on” and tilted to the same angle, which reflects the light to the sensor direction. In this case, the DMD looks like a big mirror made up of all micromirrors with the same angle, which reflect the incident light of each micromirror to its corresponding pixel in the imaging sensor. We conduct an additional experiment in order to prove that POE-VP with the fully white DMD pattern is equivalent to a traditional LDR camera, i.e., an off-state DMD does not affect the LDR image capture. We set up a high illumination scene using the same LED luminescent array and light box mentioned in Section
Figure 10.(a) The high illuminance scene of the additional experiment. (b) The setting of the two compared LDR capture devices, with the POE-VP prototype in the orange dashed box and the LDR camera in the blue dashed box. (c) The LDR images captured by two devices. (d) The local information entropy maps of two LDR captures.
APPENDIX D: THEORETICAL ANALYSIS ON TIME CONSUMPTION AND REAL-TIME PROPERTY
As mentioned in the main paper, the limitation on time consumption in our real scene experiment is mainly caused by the limited running speed of algorithms and low data transfer rate. For engineering implementation, by using a field programmable gate array (FPGA) linked with an image sensor and DMD by a data bus, we can not only reduce the time cost of data transfer but also speed up the VP and mask calculation algorithm. In this section, we provide detailed theoretical calculation of possible time consumption for capturing a single HDR frame via the proposed POE-VP in engineering condition, from which we derived the estimated theoretical frame rate of POE-VP.
In our estimated calculation, we select a 32-bit width advanced high performance bus (AHB) under 200 MHz clock frequency for data transfer between the computing unit and DMD, as well as the computing unit and image sensor. The time consumption brought by AHB data transfer for a single frame is calculated as follows:
For the role of the edge computing unit, we adopt the FPGA (Zynq XC7Z035)-based CNN accelerator designed for pixel-to-pixel applications from Ref. [
The total time consumption is calculated as follows:
The
Main Notations for Calculation of Time Consumption
Notation | Value | Description |
---|---|---|
1280 | Image height | |
720 | Image width | |
8-bit | Image bit depth | |
200 MHz | Clock frequency of AHB | |
32-bit | Bit width of AHB | |
1.152 ms | Time consumption on AHB | |
Computational complexity | ||
Computing power | ||
Running time on FPGA | ||
300 Hz | Modulation speed of DMD | |
3.333 ms | Time consumption on DMD | |
9.597 ms | Total time consumption |
References
[1] E. Reinhard, G. Ward, S. Pattanaik. High Dynamic Range Imaging: Acquisition, Display, and Image-Based Lighting(2010).
[2] S. K. Nayar, T. Mitsunaga. High dynamic range imaging: spatially varying pixel exposures. IEEE Conference on Computer Vision and Pattern Recognition, 472-479(2000).
[3] F. Dufaux, P. Le Callet, R. Mantiuk. High Dynamic Range Video: From Acquisition, to Display and Applications(2016).
[4] Y.-L. Liu, W.-S. Lai, Y.-S. Chen. Single-image HDR reconstruction by learning to reverse the camera pipeline. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1651-1660(2020).
[5] P. E. Debevec, J. Malik. Recovering high dynamic range radiance maps from photographs. 24th Annual Conference on Computer Graphics and Interactive Techniques, 369-378(1997).
[6] F. Banterle, A. Artusi, K. Debattista. Advanced High Dynamic Range Imaging(2017).
[7] S. W. Hasinoff, F. Durand, W. T. Freeman. Noise-optimal capture for high dynamic range photography. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 553-560(2010).
[9] E. Onzon, F. Mannan, F. Heide. Neural auto-exposure for high-dynamic range object detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7710-7720(2021).
[10] Ce. Liu. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis(2009).
[12] Q. Yan, D. Gong, Q. Shi. Attention-guided network for ghost-free high dynamic range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1751-1760(2019).
[13] T. Asatsuma, Y. Sakano, S. Iida. Sub-pixel architecture of CMOS image sensor achieving over 120 dB dynamic range with less motion artifact characteristics. International Image Sensor Workshop, R31(2019).
[14] S. Iida, Y. Sakano, T. Asatsuma. A 0.68 e-rms random-noise 121 dB dynamic-range sub-pixel architecture CMOS image sensor with LED flicker mitigation. IEEE International Electron Devices Meeting (IEDM), 10.2.1-10.2.4(2018).
[16] J. Han, C. Zhou, P. Duan. Neuromorphic camera guided high dynamic range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1730-1739(2020).
[17] C. A. Metzler, H. Ikoma, Y. Peng. Deep optics for single-shot high-dynamic-range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1375-1385(2020).
[18] Q. Sun, E. Tseng, Q. Fu. Learning rank-1 diffractive optics for single-shot high dynamic range imaging. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1386-1396(2020).
[19] S. Hajisharif, J. Kronander, J. Unger. Adaptive dualiSO HDR reconstruction. EURASIP Journal on Image and Video Processing, 1-13(2015).
[25] X. Hu, Z. Huang, A. Huang. A dynamic multi-scale voxel flow network for video prediction. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6121-6131(2023).
[30] https://doi.org/10.5281/zenodo.3908559. https://doi.org/10.5281/zenodo.3908559
[32] E. Reinhard, M. Stark, P. Shirley. Photographic tone reproduction for digital images. Seminal Graphics Papers: Pushing the Boundaries, 2, 661-670(2023).
[33] https://github.com/dario-loi/exposure-fusion. https://github.com/dario-loi/exposure-fusion
[36] D. Doherty, G. Hewlett. 10.4: Phased reset timing for improved digital micromirror device (DMD) brightness. SID Symposium Digest of Technical Papers 29, 125-128(1998).
[37] W. Sun, C. Tang, Z. Yuan. A 112-765 GOPS/W FPGA-based CNN accelerator using importance map guided adaptive activation sparsification for pix2pix applications. IEEE Asian Solid-State Circuits Conference (A-SSCC), 1-4(2020).
[40] M. Jaderberg, K. Simonyan, A. Zisserman. Spatial transformer networks. Advances in Neural Information Processing Systems, 1-9(2015).
[42] Z. Wang, E. Simoncelli, A. Bovik. Multiscale structural similarity for image quality assessment. 37th Asilomar Conference on Signals, Systems & Computers, 2, 1398-1402(2003).
[43] R. Zhang, P. Isola, A. A. Efros. The unreasonable effectiveness of deep features as a perceptual metric. IEEE Conference on Computer Vision and Pattern Recognition, 586-595(2018).

Set citation alerts for the article
Please enter your email address