• Photonics Research
  • Vol. 13, Issue 11, 3121 (2025)
Haoze Song1, Yibo Feng1, Xilong Dai1, Xinyue Su1, and Liheng Bian1、2、*
Author Affiliations
  • 1State Key Laboratory of CNS/ATM, Beijing Institute of Technology, Beijing 100081, China
  • 2Yangtze Delta Region Academy of Beijing Institute of Technology (Jiaxing), Jiaxing 314019, China
  • show less
    DOI: 10.1364/PRJ.570419 Cite this Article Set citation alerts
    Haoze Song, Yibo Feng, Xilong Dai, Xinyue Su, Liheng Bian, "Single-photon dead-time imaging via temporal super-resolution," Photonics Res. 13, 3121 (2025) Copy Citation Text show less

    Abstract

    Single-photon imaging provides high photon sensitivity and the capability to capture ultrafast dynamics. However, temporal cutoff characteristics in single-photon avalanche diode (SPAD) arrays arise from in-frame dead time caused by the avalanche process and inter-frame dead time caused by the readout circuit, limiting the achievable frame rate when exposure time is reduced. We first studied a physics-based temporal model that introduces in-frame and inter-frame dead time, and proposed two reconstruction strategies that achieve higher fidelity and temporal resolution. Then we designed a transformer network with temporal and spatial feature extractors, which achieved 2× temporal resolution, 2× spatial resolution, and average peak signal-to-noise ratio improvement of 8.14 dB. We applied the technique to a series of observation experiments, including fan rotation, plasma discharge, and fluorescence quenching dynamics. These experiments validate the technique’s state-of-the-art temporal and spatial super-resolution SPAD imaging performance.

    1. INTRODUCTION

    Single-photon avalanche diode (SPAD) arrays provide single-photon sensitivity and eliminate readout noise [16]. SPAD technology has already been widely adopted and continues to advance in the fields of LiDAR [79], fluorescence lifetime microscopy [10], and quantum imaging [11,12]. However, the practical implementation of SPAD arrays in high-speed imaging systems is limited by two types of dead time introduced by the quenching and readout circuit. The first, in-frame dead time, refers to the interval after an avalanche event during which the quenching circuit deactivates the SPAD arrays [13,14]. Photons arriving in this interval are undetected, disrupting continuous photon counting and degrading temporal fidelity. The second is the inter-frame dead time caused by the hardware readout time, as SPAD arrays could not detect photons until completing data readout [1]. Although SPADs support ns-scale acquisition, transferring 8-bit data typically requires around 3 μs to 4 μs. The combined impact of these two dead time causes interruptions in photon detection and compromises the continuity of data acquisition. This limits the performance of high-speed imaging systems.

    Recent hardware developments have focused on reducing dead time and improving the temporal resolution of SPAD arrays by employing architectural mechanisms. Active quenching is a circuitry technique that, immediately after an SPAD avalanche, forcibly drops the diode’s bias to stop the current, drains the residual charge, and then quickly restores the bias, allowing the diode to be rebiased almost immediately. Malanga et al. demonstrated an integrated active-quenching circuit (AQC) that limits per-pixel dead time to 50 ps while occupying only 0.068  mm2 [15]. Giudici et al. further introduced a fully digital active-recharge scheme that compresses dead time to the 4 ns range without enlarging the readout footprint [16]. These methods could effectively improve the imaging fidelity but cannot increase the temporal resolution because they do not improve the readout circuit. Different from conventional frame-driven readout, event-based SPAD arrays emit time-stamped events only when photons are present. Severini et al. report a readout dead time of 330 ns [17]. Zheng et al. achieved a time resolution of 102 ps and a range of 417 ns [18]. However, event-driven SPAD arrays still have several limitations. High photon flux can overflow on-chip first-in, first-out (FIFO) buffers, resulting in dropped events. Asynchronous pixel triggers introduce timing jitter, which results in noise. Furthermore, the readout dead time continues to limit achievable frame rates. Alternative sensor architectures, such as macro-pixel designs that couple multiple SPADs to a single readout, have been proposed to increase the fill factor [19,20]. However, this approach introduces a shared dead-time across the macro-pixels and makes it impossible to distinguish how many photons were detected or which specific sub-cell was triggered.

    Deep-learning-based interpolation techniques tackle temporal constraints introduced by dead time. Choi et al. introduced CAIN, a channel-attention network that produces high-quality video frame interpolation [21]. Lu et al. developed VFIformer, which employs a transformer architecture to capture long-range pixel correlations and delivers results on public benchmarks [22]. Zhang et al. presented EMA-VFI, enabling both fixed and arbitrary time step interpolations with competitive efficiency [23]. For spatial enhancement, deep-learning-based methods such as SwinIR and the hybrid attention transformer can improve resolution across diverse scenes [24,25]. However, these methods rely on conventional RGB video datasets and, therefore, cannot address the spatial noise and detector dead time from SPAD arrays, limiting their ability to improve both the temporal and spatial resolutions of SPAD imaging.

    In this study, we first establish a physics-based space–time model for SPAD imaging. A key feature of SPAD arrays is their parallel integrate-and-read architecture, which enables photon counting and data readout to operate simultaneously. Figure 1 shows the resulting inter-frame dead time and how it changes with the hardware integration interval. We then present two reconstruction strategies. The non-equivalent time integration strategy (NETIS) reconstructs two frames from a single measurement, while the equivalent time integration strategy (ETIS) reconstructs an intermediate frame from two consecutive measurements. Next, we design a transformer architecture integrating a temporal encoder, a spatial transformer encoder, and a decoder to extract complementary temporal and spatial features. We built three experimental setups to capture both macroscopic and microscopic scenes. Experimental results confirm state-of-the-art temporal and spatial super-resolution performance in single-photon imaging.

    SPAD pixel hardware timing and reconstruction strategy. (a) Non-equivalent time integration strategy integration, readout, and inter-frame dead time. (b) Equivalent time integration strategy integration, readout, and inter-frame dead time. (c) Illustration of measurement and reconstruction with NETIS and ETIS. (d) Results of reconstructed frames with NETIS and ETIS.

    Figure 1.SPAD pixel hardware timing and reconstruction strategy. (a) Non-equivalent time integration strategy integration, readout, and inter-frame dead time. (b) Equivalent time integration strategy integration, readout, and inter-frame dead time. (c) Illustration of measurement and reconstruction with NETIS and ETIS. (d) Results of reconstructed frames with NETIS and ETIS.

    2. METHOD

    Single-photon imaging systems that use SPAD arrays detect individual photons with high sensitivity. However, this technology introduces challenges that do not arise with conventional CMOS sensors. One key difference is the correlation between speed and sensitivity. In traditional cameras, shorter exposure times lead to higher frame rates. In our type of SPAD system, where the frame rate is determined by the hardware readout time, reducing the integration time below this readout-defined limit does not increase the fundamental frame acquisition rate. This decoupling arises from dead time. Hardware constraints introduce in-frame and inter-frame dead time during integration and readout, which prevents continuous acquisition and sets an upper bound on the frame rate. The detection mechanism adds further complexity. Each photon event results in a pixel being inactive for a brief period of time. The inter-frame dead time limits the system’s information capacity, reduces the achievable frame rate, and constrains the bit depth of each measurement. It often leads to trade-offs in image quality or temporal resolution that are not present in conventional sensors.

    Overcoming the temporal limitations and noise of SPAD data is crucial for high-performance imaging. Our framework starts with a physics-based spatial–temporal model that captures both dead time and noise sources. We then explore two reconstruction strategies. The first is the non-equivalent time integration strategy, which features long hardware integration time and negligible inter-frame dead time. This causes the output to reflect the total photons detected in each integration period. The second strategy is equivalent time integration strategy (ETIS), which employs an integration time shorter than the readout time, resulting in inter-frame dead time and limiting frame rate improvements despite shorter exposures. Finally, we introduce the single-photon temporal and spatial resolution network (SPTSR-Net), a bespoke deep learning architecture that fuses temporal and spatial features to deliver high-fidelity reconstructions directly from raw SPAD outputs.

    A. Physics-Based Temporal-Spatial SPAD Imaging Model

    1. Dead Time Modeling

    Dead time is the period after photon detection during which the SPAD cannot detect new photons. This period reduces measurement accuracy and temporal resolution. We model two categories of dead time.

    In-Frame Dead Time. In SPAD imaging the in-frame dead time refers to the interval after each photon detection during which the detector remains inactive. The system operates with a synchronization period of 20 ns, of which 10 ns is used for integration. After an avalanche event the quenching circuit requires between 50 and 150 ns to reset the diode. Any photons arriving during this period are undetected. Consequently, the total number of independent detection intervals per frame is reduced. Denoting the frame duration by Tframe, the synchronization period by Tsyn, and the in-frame dead time by Tdead, the maximum number of detection intervals Nsub is given by NsubTframeTsyn+Tdead.For Tframe=5200  ns, Tsyn=20  ns, and Tdead=60  ns, this leads to Nsub65. These parameters are based on the manufacturer’s hardware specifications for this device.

    Inter-Frame Dead Time. This interval occurs between frames because of integration and readout cycles. We define hardware integration time THIT and readout time THRT. The resulting inter-frame dead time Tifdt depends on these values.

    Non-Equivalent Time Integration Strategy (NETIS). In this strategy, the hardware integration time exceeds the readout time by 10.4 μs, while the inter-frame dead time remains below 10 ns. These minimal intervals ensure that each output reflects the total photon count for its integration period, enabling the continuous accumulation of photon events.

    Equivalent Time Integration Strategy (ETIS). In this strategy, the hardware integration time is shorter than the readout time. Consequently, the sensor disables photon detection until the readout of previous frames is complete. During this inactive period, arriving photons are not counted, which limits the frame rate despite reductions in integration time.

    The overall measurement is modeled as Meas=f(Image,Tdead,Tifdt),where Meas denotes the recorded SPAD data, Image represents the true spatial–temporal scene intensity, and f captures the sensor response, including dead times and noise.

    2. Spatial Noise Modeling

    Accurate noise modeling is crucial for effective data simulation for SPAD arrays. Building on our previous work [26], we represent the total noise N as the summary of multiple independent noise sources, expressed as N=Nshot+Nfp+Ndcr+Nap+Nct+Ndt.Here, Nshot denotes shot noise arising from statistical fluctuations in photon arrival times, Nfp represents fixed-pattern noise, Ndcr corresponds to the dark count rate from thermally generated carriers triggering avalanche events, Nap represents the probability of afterpulsing events, Nct represents crosstalk events occurring when an avalanche in one SPAD cell induces secondary avalanches in adjacent cells, and Ndt accounts for dead time noise, which describes the information loss and signal distortion caused by the SPAD array dead time [1,27].

    To mitigate these noise sources, we employ the same strategies in our previous work [26]. First, we correct for fixed-pattern noise using the manufacturer’s photon detection efficiency map. Subsequently, we acquired 60,000 single-photon (1-bit) dark-field images to calibrate time-dependent noises. Based on a temporal and spatial correlation analysis of these dark frames, we distinguish and quantify the afterpulsing probability, crosstalk probability, and dark count rate for our noise model.

    B. Time Integration Strategies

    Time integration strategies in SPAD arrays improve the temporal resolution and dynamic range of captured images. The choice of strategy affects how temporal information is recorded and reconstructed, as well as influencing the frame rate and data quality. In this study, we consider two strategies: the non-equivalent time integration strategy and the equivalent time integration strategy. Understanding the characteristics and limitations of these strategies enables the effective design of data processing and reconstruction algorithms.

    1. Non-Equivalent Time Integration Strategy

    In NETIS the hardware integration time THIT is set significantly longer than the readout time of 10.40 μs. The inter-frame dead time Tifdt remains minimal, often below 10 ns, due to synchronization signals and fast quenching circuits. Photon events are collected continuously over the integration period.

    The main advantage of this mode is that it enables large photon counts to be accumulated, thereby improving the signal-to-noise ratio and allowing low-light information to be detected. The data acquired in this strategy represents the total number of photons detected during integration.

    The measurement in NETIS can be expressed as Fm(x,y)=N(t0t0+THITI(x,y,t)dt),where Fm(x,y) denotes the measured photon count at pixel (x,y), I(x,y,t) denotes the photon arrival intensity, and N denotes noise effect over the integration period. To recover temporal information from these aggregated measurements, we employ a temporal extraction function H implemented as a deep learning model. This inversion step reconstructs multiple temporal data from the compressed data: (F1(x,y),F2(x,y),,Fn(x,y))=H(Fm(x,y)).Here, H denotes our proposed deep learning model, and Fi(x,y) denotes the reconstructed frame at time step i. We generate two frames based on a single measurement in the NETIS. This approach effectively overcomes the long hardware integration time by generating intermediate frames that fill in the temporal information. Despite its challenges, NETIS suits applications that demand high sensitivity and tolerate lower temporal resolution especially under low photon flux.

    2. Equivalent Time Integration Strategy

    In ETIS, the hardware integration time THIT is shorter than or equal to the readout time. Reducing the integration period enables more accurate temporal information to be captured, but the readout circuitry limits the increase in frame rate. Photons arriving after the integration time but before the readout is complete cannot be counted, which introduces a significant inter-frame dead time.

    The measurement in ETIS for each frame i can be expressed as Fm(i)(x,y)=N(titi+THITI(x,y,t)dt),where ti denotes the start time of the ith integration period and N denotes noise effect in that frame. This strategy achieves higher temporal sampling rate enabling observation of fast-moving objects and dynamic scenes that appear blurred under NETIS temporal averaging. The trade-off lies in reduced photon counts per frame, which increase the relative impact of noise and may degrade image quality.

    To address these challenges we generated simulated datasets based on the physics-based temporal–spatial model for both integration strategies and designed a network architecture that improves temporal and spatial resolution. We generated simulated datasets from the UCF101 dataset [28], and the generation details are provided in Algorithm 1.

    SPAD Data Simulation

    1: Input: Temporal dataset I(x,y,t), illuminance L (lux), calibrated dark count rate Dc, coded aperture mask M(x,y), subframe number Ns
    2: Output: Simulated measurement Y(x,y,t)
    3: In-frame dead-time effect:
    NsubTframeTsyn+Tdead.
    4: Calculate total photon flux:
    Pf=L·A·t·ηh·c.
    5: Downsample spatial resolution from 128×64 to 64×32:
    Id(x,y,t)=Downsample(I(x,y,t)).
    6: Scale temporal data using photon flux:
    Is(x,y,t)=Pf·Id(x,y,t).
    7: Generate time-resolved counts with SPAD noise over Nsub subframes:
    Y(x,y,t)=SPADNoise(Is(x,y,t),Dc,Nsub).
    8: Y(x,y,t)

    The two proposed strategies provide a clear trade-off between temporal resolution and signal quality. The ETIS is designed for dynamic scenes, using short integration times to provide high temporal resolution information. However, this approach results in low photon counts and a reduced signal-to-noise ratio (SNR). It is also more sensitive to the readout circuit. In contrast, the NETIS is optimized for static or slow-moving scenes. It uses long integration times to increase photon counts and SNR, at the expense of lower temporal resolution. Our models can adapt to different integration times and noise levels, providing high-quality reconstructions that achieve the balance between temporal resolution and image quality in SPAD imaging systems.

    C. Single-Photon Temporal and Spatial Resolution Network (SPTSR-Net) Architecture and Evaluation

    We propose the single-photon temporal and spatial resolution network (SPTSR-Net), a framework that jointly enhances temporal and spatial resolutions in single-photon imaging. The network comprises a spatiotemporal encoder, a vision transformer backbone for feature refinement, and a U-Net decoder for image reconstruction as shown in Fig. 2(a). The proposed SPTSR-Net takes single-photon frames of 64×64 pixels as input and outputs reconstructed frames of 128×128 pixels, thereby achieving a 2× spatial resolution enhancement.

    Qualitative comparison of single-photon temporal super-resolution. (a) Overall architecture of the proposed network. (b) PSNR and SSIM comparison across different methods. (c) Reconstruction results in the equivalent time integration strategy. (d) Reconstruction results in the non-equivalent time integration strategy.

    Figure 2.Qualitative comparison of single-photon temporal super-resolution. (a) Overall architecture of the proposed network. (b) PSNR and SSIM comparison across different methods. (c) Reconstruction results in the equivalent time integration strategy. (d) Reconstruction results in the non-equivalent time integration strategy.

    The temporal–spatial encoder takes raw input XRB×C×T×H×W.Here, B, C, T, H, and W refer to the batch size, number of channels, temporal frames, image height, and image width, respectively.

    First, a 3D convolution with batch normalization and GELU activation extracts primary features: F1=GELU(BN(Conv3D(X))).

    Next, depthwise 3D convolution enriches temporal cues: F2=DWConv3D(F1).

    Finally, a pointwise convolution block joins channel information: F3=PWConv(GELU(PWConv(F2))).

    A vision transformer refines these features. The encoder output is split into N patches of dimension D: Fp=PatchEmbed(F3)RB×N×D.

    Each transformer block applies self-attention and a feed-forward network with residual connections: Fl+1=Fl+DropPath(Attn(LN(Fl))),Fl+1=Fl+1+DropPath(FFN(LN(Fl+1))).

    The U-Net decoder integrates skip connections and upsample features to reconstruct the high-resolution frame. At decoder level i, upsampling proceeds as Gi=ConvT(Fupi1),Hi=Concat(Fenci,Fskipi,Gi),Fupi=DoubleConv(Hi).

    The final output is obtained by Y=σ(Conv(Ffinal)).

    Figure 2 presents qualitative simulation results. We used different image samples in Figs. 2(c) and 2(d) to show our method’s performance across diverse reconstruction scenarios rather than identical samples due to limited space constraints. This approach demonstrates the generalization capability of our ETIS and NETIS strategies by showing improvements across various scenarios. Figure 2(c) shows reconstructions with ETIS, and Fig. 2(d) shows results with NETIS. SPTSR-Net effectively suppresses noise and preserves fine motion details. Quantitatively, SPTSR-Net achieves a peak SNR (PSNR) of 25.57 dB and an SSIM of 0.85, significantly outperforming baseline methods as shown in Fig. 2(b). The insets highlight robustness in preserving intricate structures such as limb contours and ground textures. The model was trained for 150 epochs on a single NVIDIA RTX 4090 GPU.

    Inference Speed. All benchmarks were conducted on a single NVIDIA RTX 4090 GPU with mixed-precision inference enabled. For a 64×64 input, the ETIS architecture processed two temporal frames in 10  ms, corresponding to a throughput of 97  frames/s. Under identical conditions, the NETIS network processed a three-frame input in just 8  ms per sample, achieving 125  frames/s. These results show that both networks meet real-time requirements, while NETIS reduces per-frame latency by approximately 25%.

    Loss Functions. We adopt two complementary objectives that address different temporal assumptions. ETIS treats each frame pair symmetrically: LETIS=λ1I^I1+λ2(1SSIM(I^,I))+λ3I^I1,(λ1,λ2,λ3)=(1,0.1,0.05).

    NETIS distinguishes the current and future frames, adds a temporal consistency loss, and reduces motion blur via optical flow: L=αI^tIt1+βI^t+ΔIt+Δ1+ηLflow+δLdiv,(α,β,η,δ)=(1,1,0.1,0.05),where Lflow warps I^t+Δ back to t with the predicted flow and takes an L1 error to I^t, and Ldiv subtracts the mean absolute pixel difference between the two predicted frames.

    3. RESULTS

    All high-speed imaging experiments were performed on an SPAD array MPD-SPC3 with nanosecond-scale dead time and configurable integration timing. The specific timing parameters used in our experiments, such as the 20 ns reference clock period and the 50–150 ns selectable dead time range, are based on the manufacturer’s hardware specifications for this device. In the non-equivalent time integration strategy, frames are acquired in rapid succession with only a 20 ns interval, while in the equivalent time integration strategy, the exposure can be shortened below the readout time to capture fast dynamics at the expense of additional dead time. Using these strategies, we successfully captured and analyzed the high-speed events.

    A. High-Speed Imaging and Temporal Enhancement of Fan Rotation

    To demonstrate the capabilities of our temporal super-resolution imaging enhancement techniques for capturing rapid mechanical motion, we performed an experiment involving a rotating fan, as illustrated in Fig. 3(a). This setup consists of a high-speed camera and an illumination source aimed at the rapidly rotating fan. Capturing the position and structure of the fan blades at high speeds usually results in noisy and blurred images, making accurate analysis challenging. We applied two temporal processing strategies to enhance the captured data.

    High-speed imaging of fan rotation with temporal enhancement. (a) Experimental setup. (b) Comparison of raw frames and reconstructions with ETIS and NETIS. Timestamps indicate the corresponding time. Angles indicate the measured fan orientation.

    Figure 3.High-speed imaging of fan rotation with temporal enhancement. (a) Experimental setup. (b) Comparison of raw frames and reconstructions with ETIS and NETIS. Timestamps indicate the corresponding time. Angles indicate the measured fan orientation.

    The top row of Fig. 3(b) shows results with the ETIS method. Raw frames captured at specific time points such as frame 0 at 0 μs and frame 1 at 10.4 μs show significant noise due to the short exposure times. Although the shape of the fan is visible, the details remain blurred. The ETIS reconstruction performed between input frames at 5.2 μs provides a clear improvement in image quality. Noise is greatly reduced, and the structure of the fan hub and blades appears much sharper. Quantitative tracking captures the change in relative angles from 70.99° to 65.52° over 10.4 μs in the first example and produces a reconstruction angle of 67.86°. The enhanced clarity of the reconstruction enables more precise angle measurements with the noisy raw frames.

    The bottom row of Fig. 3(b) shows results from the NETIS method. Raw frames recorded over a 10 μs interval display motion blur and noise. NETIS reconstructions captured at 5.2 μs and 10.4 μs suppress both noise and motion blur compared to the raw data and provide sharper details of the fan orientation. Measured angles of 71.34° at 5.2 μs and 67.35° at 10.4 μs in the first NETIS example contrast with 69.10° in the raw frame and demonstrate how the enhancement process extracts sharp temporal features from blurred input.

    Both ETIS and NETIS enhance the quality of high-speed video frames of the rotating fan. These improvements enable clearer visualization and more accurate measurement of the angular position in rapid mechanical dynamics occurring on microsecond timescales.

    B. High-Speed Imaging and Temporal Enhancement of Plasma Discharge

    To evaluate the performance of our temporal super-resolution imaging strategies on highly dynamic phenomena, we conducted experiments capturing the arc discharge within a plasma ball. The experimental setup, shown in Fig. 4, involved imaging the plasma ball with SPAD arrays. The plasma ball generates transient discharge patterns from the central electrode that are challenging for conventional imaging due to their rapid changes and complex structures. We employed two different temporal strategies to improve the results of the discharge dynamics shown in Fig. 4.

    High-speed imaging of plasma ball discharge with temporal super-resolution. (a) Experimental setup. (b) Raw frames and reconstructions with ETIS and NETIS. Timestamps denote the corresponding time.

    Figure 4.High-speed imaging of plasma ball discharge with temporal super-resolution. (a) Experimental setup. (b) Raw frames and reconstructions with ETIS and NETIS. Timestamps denote the corresponding time.

    Figure 4(b) top row shows the ETIS method results. Raw frames captured at frame 0 and frame 1 separated by 10.4 μs, show high noise levels and limited details due to the short exposures. Frame 614 shows a very sparse signal, illustrating the photon-limited conditions when imaging such rapid events. The central reconstruction in each ETIS sequence is produced at 5.2 μs, between input frames. These reconstructions reduce noise and sharpen spatial details, making the intricate branching of the plasma filaments visible and enabling effective observation of intermediate measurement results.

    The bottom row of Fig. 4(b) illustrates the NETIS method. Raw frames recorded during 0–10 μs serve as the input. The first enhanced image represents data integrated over 5.2 μs, providing a temporal super-resolution result of discharge activity. The second enhanced image corresponds to 10.4 μs, providing a sharper snapshot at the end of that interval. Both enhanced images reduce noise while preserving the dynamic morphology of the filaments, providing a clearer result of the discharge.

    Both ETIS and NETIS address the issues of noise and low photon counts in raw, high-speed data. Their reconstructions provide much clearer visualizations of transient plasma discharge dynamics, enabling the detailed analysis at the microsecond level.

    C. Temporal Super-Resolution Imaging of Fluorescence Quenching Dynamics

    We conducted an experiment on fluorescence quenching in dyed microspheres to demonstrate the capabilities of fast fluorescence imaging under photon-limited conditions. We observed this process with a microscope that guided emission light onto SPAD arrays, as shown in Fig. 5(a). Because the fluorescence decay was relatively slow, we used each set of 30 raw data intervals as our inputs and employed the equivalent time integration reconstruction method. Two frames were used to record the brightness decay, and our objective was to recover the continuous intensity drop with improved temporal resolution and reduced noise.

    Temporal–spatial super-resolution imaging of fluorescence quenching in microspheres. Here, F1 and F2 refer to the raw input frames Frame 1 and Frame 2, and “Recon.” denotes the reconstructed frame. (a) Experimental setup schematic. (b) Comparative analysis with an upper histogram showing F2–F1 Diff. from raw frames and Recon.–F1 Diff. from reconstructions. The plot tracking means reconstructed intensity in regions A, B, and C for Recon. Frame 1, Recon. Frame 1.5, and Recon. Frame 2. (c) Raw frames from the SPAD arrays. (d) Recon. Frame 1 and Recon. Frame 2 are network outputs when Frame 1 and Frame 2 are used as inputs, respectively, and Recon. Frame 1.5 is reconstructed while Frame 1 and Frame 2 are used as the inputs. (e) Difference maps highlighting intensity change with F2–F1 Diff. on the left and Recon.–F1 Diff.

    Figure 5.Temporal–spatial super-resolution imaging of fluorescence quenching in microspheres. Here, F1 and F2 refer to the raw input frames Frame 1 and Frame 2, and “Recon.” denotes the reconstructed frame. (a) Experimental setup schematic. (b) Comparative analysis with an upper histogram showing F2–F1 Diff. from raw frames and Recon.–F1 Diff. from reconstructions. The plot tracking means reconstructed intensity in regions A, B, and C for Recon. Frame 1, Recon. Frame 1.5, and Recon. Frame 2. (c) Raw frames from the SPAD arrays. (d) Recon. Frame 1 and Recon. Frame 2 are network outputs when Frame 1 and Frame 2 are used as inputs, respectively, and Recon. Frame 1.5 is reconstructed while Frame 1 and Frame 2 are used as the inputs. (e) Difference maps highlighting intensity change with F2–F1 Diff. on the left and Recon.–F1 Diff.

    Figure 5(c) presents the raw data. Each frame shows the raw photon counts demonstrating the fluorescence quenching process. After applying our temporal super-resolution technique, the corresponding reconstructed sequence is shown in Fig. 5(d). The first and second reconstructed frames correspond to the measurements, while the middle frame is reconstructed through the measurements. We effectively reduce noise, clearly reconstruct individual microspheres, and enable precise tracking of the fading signals in regions A, B, and C.

    The plots in Fig. 5(b) illustrate these results. The histogram compares pixel intensity change between the two sampling measurements for raw and reconstructed data. The reconstructed distribution is narrower but centered on the same mean, which shows that the method preserves the quenching process. The lower graph tracks average reconstructed intensity in regions A, B, and C across three relative time indices. Figure 5(e) complements this analysis. The difference map from the raw frames is noisy, whereas the map from the reconstructed frames clearly shows spatial patterns of diminishing fluorescence. Negative values indicate areas where quenching is strong, demonstrating that the method achieves temporal super-resolution in terms of both numerical values and visual clarity.

    In summary, our temporal super-resolution method significantly improves temporal resolution and reduces photon noise. This enables the reconstruction of dynamics and provides reliable statistics on fluorescence processes.

    D. Evaluation of Geometric Accuracy and Measurement Precision

    Figure 6 benchmarks the proposed method on three fundamental geometric reconstruction experiments, demonstrating improvements in temporal resolution between two synthetic noisy observations. The synthetic data refers to data generated through a physics-based simulation, rather than data captured from SPAD arrays. Figure 6(a) evaluates our method’s ability to perform temporal super resolution for motion along the depth axis. In this synthetic experiment, we use the square’s vertical position in the image as a measure for its depth, simulating the effect of it moving closer to or further from the camera. For instance, given noisy inputs where the square’s position is recorded at 28 and 32 pixels from the border, our method accurately reconstructs the intermediate frame at its ground truth position of 30 pixels.

    Evaluation of the proposed method on synthetic geometric shapes under severe noise. (a) Reconstructions of depth estimation for squares of varying sizes. (b) Reconstructions of translation for triangles with horizontally shifted bases. (c) Reconstructions of rotation for diamonds. Columns one, three, and five show noisy inputs with initial measurements while columns two and four show reconstructions with sharper edges and better measurement accuracy.

    Figure 6.Evaluation of the proposed method on synthetic geometric shapes under severe noise. (a) Reconstructions of depth estimation for squares of varying sizes. (b) Reconstructions of translation for triangles with horizontally shifted bases. (c) Reconstructions of rotation for diamonds. Columns one, three, and five show noisy inputs with initial measurements while columns two and four show reconstructions with sharper edges and better measurement accuracy.

    Figure 6(b) evaluates the accuracy of horizontal translation. Initial noisy measurements indicate the horizontal location at 74, 70, and 66 pixels. The reconstructed frames accurately refine the horizontal positions to 72 and 68 pixels, clearly demonstrating the capability of the method to enhance temporal resolution and precisely reconstruct horizontal translation.

    Figure 6(c) shows the accuracy of estimating rotation with diamond shapes. Initial noisy frames produce rotational angles of 65.64° and 78.53°. Our method accurately reconstructs these frames, providing corrected angles of 71.85° and 83.97°, respectively. This demonstrates the effectiveness of our approach in improving both temporal resolution and rotational accuracy in noisy conditions.

    E. Ablation Studies

    We conducted ablation experiments to evaluate the impact of transformer layer depth and temporal encoding strategies. Table 1 shows the performance measured by PSNR and SSIM. Four transformer configurations with varying layer depths and three temporal encoding methods were examined.

    Results of Ablation Studies on Transformer Layer Depth and Temporal Encoding

    Ablation AspectConfigurationPSNR (dB)SSIM
    Transformer layer depth1 layer25.02990.8362
    2 layers25.21720.8411
    4 layers25.57170.8489
    8 layers25.57390.8500
    Temporal encoding strategy3D encoder25.57170.8489
    2D Conv.25.52670.8466
    Reshape25.32910.8448

    Increasing the number of transformer layers improved reconstruction quality, with the PSNR rising from 25.03 dB for one layer to 25.57 dB for four layers. The corresponding SSIM also improved from 0.84 to 0.85. Further increasing to eight layers only slightly enhanced performance, suggesting limited benefits from deeper architectures beyond four layers.

    Temporal encoding methods affected the results differently. The 3D encoder method produced the best results, achieving a PSNR of 25.57 dB and an SSIM of 0.85. The 2D convolution approach decreased the PSNR slightly to 25.53 dB, while the reshape method produced the lowest performance, with a PSNR of 25.33 dB. This demonstrates the effectiveness of explicit temporal modeling.

    We attribute the robustness of our network to three architectural principles designed to process spatial–temporal information. First, an efficient temporal encoder with depthwise separable convolutions captures temporal dynamics without excessive computational cost. Second, our core fusion encoder implements a dual-path strategy, fusing these encoded temporal features with a parallel stream of raw input features. This fusion ensures that both high-level motion information and low-level spatial details are preserved for the transformer blocks. Finally, within the transformer encoder, we introduce an additional long-range skip connection to continuously provide the initial raw patch embeddings. This persistent access to low-level information serves as a strong regularizer, crucial for reconstructing sharp edges and textures that may not be fully captured by global metrics. Together, these design choices create a robust architecture that produces high-fidelity reconstructions.

    In summary, the most effective model configuration is a combination of a four-layer transformer and the 3D temporal encoding strategy. This provides high-quality reconstruction while achieving the balance between complexity and performance.

    4. CONCLUSION AND DISCUSSION

    We present a physics-based temporal–spatial model combined with a transformer network trained on noise-calibrated simulations. We propose two integration strategies and achieve 2× improvement in both temporal and spatial resolutions. The average PSNR increases by 8.14 dB without any hardware modification. This approach tackles the challenge of the temporal cutoff in single-photon imaging, enabling enhanced fidelity and temporal and spatial resolution that significantly improve SPAD imaging performance. Our method could improve the performance of real-time single-photon microscopy [29], high-speed LiDAR [30,31], live-cell fluorescence imaging [32,33], and plasma diagnostics [34]. Our method increases robustness to SPAD arrays noise and provides a significant improvement on current reconstruction methods.

    However, this implementation relies on training sequences with calibrated statistics. Performance may degrade if the photon flux falls far below the calibrated range. Reconstruction also introduces computational overhead during high-resolution video processing. Furthermore, dependence on simulated data may result in certain complexities of real scenes being ignored, which could lead to differences in performance [35].

    Future work will refine the temporal–spatial integration strategy. We will investigate self-supervised learning using real photon data and extend the framework to multispectral, quantum, and autonomous sensing. Adaptive mechanisms that dynamically respond to fluctuating noise and photon flux will further enhance system robustness. Integrating the method with emerging SPAD technologies and diverse imaging platforms will broaden its applicability to multiple domains.

    References

    [1] C. Bruschini, H. Homulle, I. M. Antolovic. Single-photon avalanche diode imagers in biophotonics: review and outlook. Light Sci. Appl., 8, 87(2019).

    [2] G. Gariepy, F. Tonolini, R. Henderson. Detection and tracking of moving objects hidden from view. Nat. Photonics, 10, 23-26(2016).

    [3] J. Ma, S. Masoodian, D. A. Starkey. Photon-number-resolving megapixel image sensor at room temperature without avalanche gain. Optica, 4, 1474-1481(2017).

    [4] A. Kirmani, D. Venkatraman, D. Shin. First-photon imaging. Science, 343, 58-61(2014).

    [5] K. Zang, X. Jiang, Y. Huo. Silicon single-photon avalanche diodes with nano-structured light trapping. Nat. Commun., 8, 628(2017).

    [6] K. Morimoto, A. Ardelean, M.-L. Wu. Megapixel time-gated SPAD image sensor for 2D and 3D imaging applications. Optica, 7, 346-354(2020).

    [7] R. H. Hadfield, J. Leach, F. Fleming. Single-photon detection for long-range imaging and sensing. Optica, 10, 1124-1141(2023).

    [8] F. Piron, D. Morrison, M. R. Yuce. A review of single-photon avalanche diode time-of-flight imaging sensor arrays. IEEE Sens. J., 21, 12654-12666(2020).

    [9] O. Kumagai, J. Ohmachi, M. Matsumura. 7.3 a 189 × 600 back-illuminated stacked SPAD direct time-of-flight depth sensor for automotive LiDAR systems. IEEE International Solid-State Circuits Conference (ISSCC), 110-112(2021).

    [10] E. Slenders, M. Castello, M. Buttafava. Confocal-based fluorescence fluctuation spectroscopy with a SPAD array detector. Light Sci. Appl., 10, 31(2021).

    [11] H. Defienne, J. Zhao, E. Charbon. Full-field quantum imaging with a single-photon avalanche diode camera. Phys. Rev. A, 103, 042608(2021).

    [12] V. F. Gili, D. Dupish, A. Vega. Quantum ghost imaging based on a ‘looking back’ 2D SPAD array. Appl. Opt., 62, 3093-3099(2023).

    [13] E. Sarbazi, M. Safari, H. Haas. The impact of long dead time on the photocount distribution of SPAD receivers. IEEE Global Communications Conference (GLOBECOM), 1-6(2018).

    [14] Y. Albeck, E. Greenberg, T. Arusi-Parpar. Dead-time effect on SPAD detection efficiency. Appl. Opt., 64, 2163-2168(2025).

    [15] F. Malanga, G. Fratta, G. Acconcia. Integrated active quenching circuit for high-rate and distortionless SPAD-based time-resolved fluorescence applications. IEEE Trans. Biomed. Circuits Syst., 19, 442-453(2025).

    [16] A. Giudici, G. Acconcia, I. Labanca. 4 ns dead time with a fully integrated active quenching circuit driving a custom single photon avalanche diode. Rev. Sci. Instrum., 93, 043103(2022).

    [17] F. Severini, I. Cusini, F. Madonini. Spatially resolved event-driven 24 × 24 pixels SPAD imager with 100% duty cycle for low optical power quantum entanglement detection. IEEE J. Solid-State Circuits, 58, 2278-2287(2023).

    [18] L. Zheng, S. Luo, Y. Han. High-precision time resolution SPAD array readout circuit based on event-driven. IEEE J. Sel. Top. Quantum Electron., 31, 6700307(2025).

    [19] I. Vornicu, F. N. Bandi, R. Carmona-Galán. Compact macro-cell with OR pulse combining for low power digital-SiPM. IEEE Sens. J., 20, 12817-12826(2020).

    [20] I. Gyongy, A. T. Erdogan, N. A. W. Dutton. A direct time-of-flight image sensor with in-pixel surface detection and dynamic vision. IEEE J. Sel. Top. Quantum Electron., 30, 3800111(2024).

    [21] M. Choi, H. Kim, B. Han. Channel attention is all you need for video frame interpolation. Proceedings of the AAAI Conference on Artificial Intelligence, 10663-10671(2020).

    [22] L. Lu, R. Wu, H. Lin. Video frame interpolation with transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3532-3542(2022).

    [23] G. Zhang, Y. Zhu, H. Wang. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5682-5692(2023).

    [24] J. Liang, J. Cao, G. Sun. SwinIR: image restoration using Swin transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 1833-1844(2021).

    [25] M. Cheng, H. Ma, Q. Ma. Hybrid transformer and CNN attention network for stereo image super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1702-1711(2023).

    [26] L. Bian, H. Song, L. Peng. High-resolution single-photon imaging with physics-informed deep learning. Nat. Commun., 14, 5902(2023).

    [27] D. Bronzi, F. Villa, S. Tisa. SPAD figures of merit for photon-counting, photon-timing, and imaging applications: a review. IEEE Sens. J., 16, 3-12(2015).

    [28] K. Soomro, H. Idrees, M. Shah. UCF101: a dataset of 101 human action classes from videos in the wild. arXiv(2012).

    [29] E. Perego, S. Zappone, F. Castagnetti. Single-photon microscopy to study biomolecular condensates. Nat. Commun., 14, 8224(2023).

    [30] F. Villa, S. Tisa, F. Zappa. SPADs and SiPMS arrays for long-range high-speed light detection and ranging (LiDAR). Sensors, 21, 3839(2021).

    [31] X. Qian, W. Jiang, M. J. Deen. Single photon detectors for automotive LiDAR applications: state-of-the-art and research challenges. IEEE J. Sel. Top. Quantum Electron., 30, 3800520(2023).

    [32] J. L. Lagarto, F. Villa, S. Tisa. Real-time multispectral fluorescence lifetime imaging using single photon avalanche diode arrays. Sci. Rep., 10, 8116(2020).

    [33] P. Bruza, A. Petusseau, A. Ulku. Single-photon avalanche diode imaging sensor for subsurface fluorescence LiDAR. Optica, 8, 1126-1127(2021).

    [34] D. Faccio, G. Gariepy, G. S. Buller. SPAD array imaging and applications: from laser plasma diagnostics to tracking objects behind a wall. Imaging and Applied Optics, LM3D.3(2015).

    [35] S. Scholes, G. Mora-Martín, F. Zhu. Fundamental limits to depth imaging with single-photon detector array sensors. Sci. Rep., 13, 176(2023).

    Haoze Song, Yibo Feng, Xilong Dai, Xinyue Su, Liheng Bian, "Single-photon dead-time imaging via temporal super-resolution," Photonics Res. 13, 3121 (2025)
    Download Citation