
- Advanced Imaging
- Vol. 2, Issue 1, 011003 (2025)
Abstract
1. Introduction
Scene captioning is an attractive visual-language task, involving understanding visual content such as images or videos, and generating textual descriptions. While describing what they see is a natural task for most people, it is not trivial for machines to do the same[1], especially in video captioning (VC). For machines, a straightforward pipeline is “imaging-compression-reconstruction-and-then-captioning,” as shown in Fig. 1(a). Specifically, a high-definition (HD) video camera captures videos with high resolution in both spatial and temporal domains, which are further compressed for efficient storage and transmission. Hence, recovering the original video frames is often necessary before generating captions[2].
Figure 1.Comparing our efficient captioning pipeline in (c) with the traditional (multi-stage) pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.
Although most VC methods[3,4] assume that they have already obtained the well-decompressed video, they do not consider the potential drawbacks of the captioning step in the whole video processing pipeline. (i) Information redundancy: With the increasing spatial and temporal resolutions, the captured raw videos and the reconstructed ones exhibit severe information redundancy, resulting in a heavy burden on storage and calculation[5–7], as compared in Fig. 2. (ii) Information loss: To reduce the redundancy in the raw video, (near-)lossless software compression approaches are preferred. However, to handle temporal redundancy in the recovered video, existing VC approaches[8–10] often sample the video frames or video feature maps to reduce computational costs, which in turn may ignore some key information, especially in fast-moving videos. (iii) Less efficient: As we can see, starting from the captured raw video, there is a long way to go to achieve the output caption, with the help of accumulated efforts from every step. However, the redundant information is “reduced-recovered-and-further-reduced” in the “compression-reconstruction-and-sampling” loop, which produces a waste of computational resources during the whole pipeline.
Figure 2.Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our method, traditional multi-stage VC methods, and two-stage methods, respectively. The size of the circle is proportional to the CIDEr score (↑) marked in brackets.
To realize efficient VC and alleviate computational and storage burden, this paper tries to explore a novel pipeline, describing the scene directly from the data captured by an optical camera, i.e., without software-based compression or reconstruction to captioning. Therefore, there are mainly two questions: (i) how to efficiently obtain compressive sensed visual data of the live scene and (ii) how to build an end-to-end captioning model directly from the compressive sensed data.
Sign up for Advanced Imaging TOC. Get the latest issue of Advanced Imaging delivered right to you!Sign up now
To address the aforementioned challenges, we propose to incorporate a typical computational imaging technology, namely video snapshot compressive sensing (CS)[11,12], which physically obtains the compressed measurement during the imaging process. Concretely, let us consider a representative system, coded aperture compressive temporal imaging (CACTI)[12], as shown in Fig. 3. The optical instrument modulates the live scene via a set of dynamic masks, e.g., produced by a spatial light modulator (SLM)[12] or a digital micromirror device (DMD)[13], and then these frames are compressed into a two-dimensional (2D) snapshot measurement by a single exposure of the camera. For a video clip in 30 fps (frame per second) and resolution, a CACTI system exhibits a bandwidth requirement of approximately 3.69 Mbps with a compression ratio of 20. In contrast, a traditional camera, as shown in Fig. 1(a), has a bandwidth requirement of about 73.73 Mbps, significantly higher than the CACTI system. Besides, only after the storage of HD videos is completed will content-based compression techniques such as H.264, H.265, and versatile video coding (VVC) perform video encoding processes. Furthermore, the design of the SLM and the mechanical translations in CACTIs decrease the demand for operating power[12]. Thus, video snapshot CS enjoys the advantages of low power for imaging sensors, low memory for storage, low bandwidth for transmission, etc[2,14].
Figure 3.An illustration of a typical video snapshot CS system, CACTI[
Then, given the coded measurements, different software decoder methods[5,7,13,15,16] can be employed to recover the video realistically. Therefore, applying the two-stage strategy “reconstruction-and-then-captioning” (as the yellow pipeline shown in Fig. 1) is a potential solution for coded measurements, but it still suffers from the low-efficiency problem (as the yellow circles shown in Fig. 2), which is similar to traditional multi-stage VC methods.
To overcome this drawback and achieve an efficient VC, we propose an end-to-end approach directly based on the measurement captured by video snapshot CS. This pipeline is technically feasible, as it is accessible to building supervised data. Given the masks of the real optical system, we can pretty accurately simulate the acquisition of the measurement (further introduced in Sec. 2), to build a large-scale training dataset composed of paired measurements, videos, and captions.
The final challenge now is to construct and train an end-to-end network in a supervised manner.
Nevertheless, it is not an easy road, as noted by previous works[17,18] and our attempts discussed in Sec. 4 (Table 2). This may be ascribed to the fact that, compared with high-quality videos, the captured measurement is heavily blurred with less visual semantics and moving details, which greatly increases the difficulty of learning effective visual-language representations for caption generation. To break through these barriers, in this paper, we propose to build a teacher model whose knowledge is distilled to guide the learning of our end-to-end VC network. Specifically, as shown in Fig. 4, the teacher model focuses on extracting language-related visual features from the ground truth video with the help of a pretrained large vision-language model (VLM) and contrastive language-image pretraining (CLIP)[4]. Therefore, the teacher model not only conveys spatial and temporal details from the ground truth video but also provides abundant prior knowledge from CLIP. With knowledge distillation (KD), the student model is able to reveal a linguistic-related latent representation, which is injected into a transformer-based decoder to generate the caption.
Figure 4.Learning and inference workflows of our proposed SnapCap. The cooperation of (a)–(c) is for training, and only (b) is needed for an end-to-end captioning during testing.
In a nutshell, the main contributions of this paper can be summarized as follows:
- •We propose a novel VC pipeline to realize efficient caption generation, directly from the data captured by video snapshot CS, without compression or reconstruction in the software processing phase. This work is also the first attempt at a reconstruction-free VC method based on the video snapshot CS technology.
- •We employ CLIP to construct a teacher model and utilize KD to guide the student model to learn language-related visual features, which is further input into a transformer decoder for caption generation. The whole model is trained in an end-to-end manner.
- •Comprehensive experimental results on VC benchmarks demonstrate the efficiency and the effectiveness of our proposed SnapCap, which achieves competitive VC scores compared to traditional multi-sage video-based captioning methods and runs at least
faster compared to two-stage approaches with much better caption results.
2. Preliminary and Related Works
2.1. Video Snapshot Compressive Sensing
Let us take a typical video CS system, CACTI[12], as an example. As shown in Fig. 3, we assume that the live scene with high-speed frames is modulated by coding masks .
Within one exposure time, the light to the sensor is integrated, thus compressing these coded frames and producing a 2D measurement via summation, formally as
Therefore, given the coding masks of the real system, one can easily simulate the measurement using synthetic data, saving a significant amount of effort required to capture a large number of real data. Actually, training on simulation and testing on a real data framework is widely used in methods developed for recovering the original high-speed frames from coded measurement[5,13,19–21]. More introduction to these methods can be found in Ref. [11].
Recently, there has been a novel trend toward coupling video snapshot CS with high-level visual understanding tasks, without recovering the original video. Hu et al.[22] realized video object detection based on coded measurement directly using a deep convolutional neural network (CNN). For action recognition, Okawara et al.[18] constructed an end-to-end 3D-CNN model with coded measurement as input. Both these methods show less complexity and more efficient inferences. However, their detection/recognition accuracy still falls behind the methods using high-quality video. Compared with object detection and action recognition, VC is a more challenging task. Because, besides understanding the visual contents, such as objects or actions, the VC model should also learn visual-language relations for cross-modality generation. Though challenging, we have achieved comparable performances with most of the existing HD-video-based VC methods.
2.2. Video Captioning
In recent years, VC has attracted much attention from researchers to understand and describe videos, which can be roughly classified into two groups: traditional methods and vision-language pretraining-inspired captioning methods. In the first group, most of the works[8,10,23,24] usually employ a 2D backbone, e.g., ResNet-101[25], IncepResNetV2[26] to extract vision information, and a 3D network such as C3D[27] to extract motion features. Furthermore, the memory-based augmentation network (MAN)[28] introduces the memory mechanism on both the vision encoder and language decoder to augment the caption quality. Besides, considering that physical entities in the video often play vital roles in describing the scene, some researchers proposed to incorporate an object detection module, achieving object-based VC. Specifically, STG-KD[29] designs an object branch to derive interaction information through the spatial-temporal graph and then distills it to the scene branch, where only the latter one is used during the evaluation. Then, in hierarchical modular network (HMN)[30], researchers introduce a hierarchical captioning framework that links vision representations and linguistic semantics via the entity module, predicate module, and sentence module. Some works also incorporate the audio information[31], and knowledge graph9 or introduce reinforcement learning (RL)[32,33] for caption generations.
In the second group[34–37], researchers intend to learn representations between images and texts or videos and texts by first pretraining on large-scale datasets, such as LAION-400M[38], Howto100M[39], and Webvid-2.5M[40], and then fine-tuning the model on downstream tasks and datasets or even performing zero-shot learning[41]. Among these works, CLIP4Caption[4] fine-tunes CLIP[42] through a video-text retrieval manner to enhance CLIP’s vision and language representation abilities. Built on frozen CLIP to extract visual information, IcoCap[43] proposes an image-video compounding strategy (ICS) to improve the video content diversity and visual-semantic guided captioning (VGC) to achieve better captioning results. Recently, based on the H.264 Codec, some researchers proposed a decoding-free captioning method, CoCap[44], which generates language descriptions directly from the compressed domain.
2.3. Knowledge Distillation
KD[45] aims to transfer knowledge from a complex teacher model to a lightweight student model, which has been widely explored in various applications, such as object detection[46,47], image recognition[48,49], image generation[50,51], and person re-identification[52]. Recently, an increasing number of works have focused on using KD to transfer knowledge from large pretrained models to domain-specific ones for different tasks[53–55], achieving superior performances than traditional train-from-scratch neural networks. In parameter-efficient and student-friendly knowledge distillation (PESF-KD)[56], researchers explore a kind of student-friendly distillation strategy with smoother soft labels for efficient training. Except for single-modality knowledge transferring, some researchers also propose to distill the knowledge for cross-modality tasks based on semantically abundant data sources[57–59].
What we explore in this work is how to transfer the knowledge from the raw data (high-quality video) to the compressed data (coded measurement) via KD.
3. Methodology
To realize efficient captioning directly from the compressively sensed video snapshot captured by a computational camera, we propose a novel video snapshot captioning model, dubbed SnapCap, generating descriptions without compression or reconstruction. In such a cross-modality generative task, the key is to extract language-related visual features that are further used for caption generation. Hence, our model consists of a visual extractor and a caption generator, whose structure details as well as the learning and inference details will be introduced below.
3.1. Visual Encoder via Knowledge Distillation
Given compressed measurement and its corresponding masks shown in Fig. 4(b), a straightforward method to obtain textual predictions is to train a captioning model like most VC methods[10,23,29] and then perform inference. However, owing to the fact that the compressed data is always heavily blurry and noisy with much fewer details than HD video frames, such a direct manner fails to yield satisfactory results[17,18], and it is a very challenging task to capture effective visual features. Thanks to the accessible simulation (as introduced in Sec. 2.1), we can obtain abundant video-measurement data pairs and distill the knowledge from the video to the measurement. Hence, we hope to build a teacher model to capture effective visual information from the ground truth video, which can be employed to guide the feature extraction from the measurement, i.e., the student model .
Specifically, considering the vision-language association knowledge incorporated in the pretrained model CLIP[42], which is trained on large-scale image-text pairs[38], we apply the image encoder of the CLIP to capture the information contained in the video, which is denoted as the teacher model . Nevertheless, given that there is a large discrepancy between the inputs of the teacher and the student models, it is infeasible to directly copy the structure of the teacher model to the student one. To solve this problem, we propose to map the video and the measurement to a shared latent space.
To be more specific, we transform the videos to the first convolutional layer of the CLIP image encoder to get the feature maps in an efficient manner as
Then considering that different CLIP structures contain various parameter settings, we include a two-layer flexible convolutional operation after the encoder for feature map alignment,
In this manner, the feature maps of the video and the measurement can be extracted into a shared latent space. Besides, the follow-up structure of the teacher model, e.g., the “vision transformer(ViT) blocks” in Fig. 4, can be copied to the student model as an initialization. With and in the same dimension and holding similar semantic representations, we further extract language-related vision embeddings for the video and the measurement, respectively, which can be formulated as
With such an efficient design, the abundant semantic information embodied in the video can be distilled into the measurement. Hence, the distillation loss between the teacher model and the student model can be written as
In addition to distilling the knowledge from the videos through the direct feature map alignment, inspired by SENN[60], treating the video as a regularization term can also help and to extract coherent semantics from the blurry measurement[15]. To this end, we design an efficient decoder , which maintains the systematic network architecture as , to recover videos from the latent representation so that both the spatial and temporal details from the video can be conveyed to the measurement, formulated as
Both the distillation loss and the regularization term can help the student model to fully absorb the knowledge from the teacher model and obtain meaningful vision embeddings for captioning.
3.2 Caption Generator
After extracting the language-related visual representation from the student model , we design a transformer-based projector to map the vision embedding to the text space,
3.3. Learning and Inference
During training, given the original frames, we distill the knowledge from the video domain to the blurry coded measurement domain via two objectives, which are treating the video as a regularization term as and transferring the knowledge incorporated in the teacher model through the distillation process and . Given the ground truth annotations , as in most previous VC works[8,23,30], we adopt cross-entropy loss to supervise the learning process:
Taking a step further, considering that the optimization objective of including the videos as a regularization term is not exactly the same as performing feature map alignment, directly optimizing the parameters via the combining loss may bring about the convergence issue. To mitigate it, inspired by masked auto-encoder (MAE), we propose to optimize the encoder and the decoder through first. Then, without the involvement of , we update the parameters of encoder , student model , projector , and language decoder through the loss function:
As shown in Fig. 4(b), during the inference process where only the coded measurement and masks are given, we input them into the encoder and the student model to perform forward mapping and derive language-related vision embedding as
Then the predicted caption is generated in an auto-regressive word-by-word manner.
In Algorithms 1 and 2, we provide the whole training and inference algorithms.
4. Experiments
In this section, we conduct experiments and report results to demonstrate the effectiveness of our proposed framework. We first detail some experimental settings including the datasets, compared methods, evaluation metrics, and devices. Then, we comprehensively evaluate the performance of our framework on both simulated coded measurements and real data. Finally, some ablation experiments are carried out to verify the roles of different components. Note that in all tables, we highlight the best results in boldface.
4.1 Experimental Settings
4.2. Comparison with VC Methods
To validate the effectiveness of our model, we conduct comparisons with state-of-the-art (SOTA) video-based captioning methods on both MSRVTT and MSVD datasets. It should be noted that, given video frames, most SOTA methods employ one or more spatial, motion, and detection characteristics, and others, e.g., external knowledge graphs and audio transcripts, to generate captions, which take more time for inferring and consuming more storage. The quantitative results of these methods are listed in Table 1. We also adopt a TeaCap model, which consists of our teacher model, the pretrained CLIP ViT[42], and a language generator including a two-layer projector and a two-layer decoder. The parameters of the teacher model are frozen, while we train the language decoder through the caption loss as in Eq. (15), similar to Ref. [4]. Compared to previous VC methods, our TeaCap exhibits competitive captioning results on two datasets.
Method | Input modality | MSRVTT | MSVD | ||||||
B | M | R | C | B | M | R | C | ||
Video frame-based methods | |||||||||
Recent[ | V | 39.1 | 26.6 | 59.3 | 42.7 | 52.3 | 34.1 | 69.8 | 80.3 |
SGN[ | V + M | 40.8 | 28.3 | 60.8 | 49.5 | 52.8 | 35.5 | 72.9 | 94.3 |
HMN[ | V + M + D | 43.5 | 29.0 | 62.7 | 51.5 | 59.2 | 37.7 | 75.1 | 104.0 |
CoCap[ | V | 43.1 | 29.8 | 62.7 | 56.2 | 55.9 | 39.9 | 76.8 | 113.0 |
CoCap[ | V (ViT-L/14) | 44.1 | 30.3 | 63.4 | 57.2 | 60.1 | 78.2 | ||
RSFD[ | V + M + A | 43.4 | 29.3 | 62.3 | 53.1 | 51.2 | 35.7 | 72.9 | 96.7 |
IcoCap[ | CLIP features | 47.0 | 31.1 | 64.9 | 60.2 | 59.1 | 39.5 | 76.5 | 110.3 |
Our TeaCap | V | 45.6 | 30.6 | 63.9 | 58.3 | 56.1 | 39.2 | 76.7 | 114.9 |
Coded measurement-based methods | |||||||||
Our SnapCap | Coded measurement | 44.2 | 30.1 | 63.2 | 56.7 | 54.9 | 38.2 | 75.4 | 108.9 |
Our SnapCap(ViT/L-14) | Coded measurement | 40.9 | 117.1 |
Table 1. A Comparison of Proposed Efficient Measurement-Based Captioning and Different Video-Based VC Methods on MSRVTT and MSVD.
Further, for our SnapCap model, we initialize the weights of ViT blocks in the student model with pretrained CLIP ViT models. From Table 1, we can find that our SnapCap achieves competitive performances compared with video frame-based methods. As the training loss is shown in Eq. (16), on the one hand, the student model distills the knowledge from the teacher model (trained on a large-scale dataset). On the other hand, it can also learn the knowledge from the data via training.
Further, in Figs. 5 and 6, we visualize the coded measurement, video frames, and predicted descriptions by our SnapCap as well as the ground truth on MSRVTT[62] and MSVD[63] datasets, respectively, where our SnapCap is able to accurately describe the scene in language.
Figure 5.Qualitative results on the MSRVTT dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
Figure 6.Qualitative results on the MSVD dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.
4.3. Ablation Study
4.3.1. Comparison with two-stage methods
Given the coded measurement, a straightforward and intuitive manner for captioning is to reconstruct frames first and then perform captioning. However, such a two-stage strategy typically consumes more time and computational resources, which poses a tricky dilemma in resource-limited occasions. In this part, we conduct experiments to demonstrate the superiority of our reconstruction-free compression-free compressive learning schema in terms of inference speed, memory consumption, and captioning quality on the same 3090 GPU. For two-stage methods, we load pretrained neural networks BIRNAT[5], STFormer[6], and EfficientSCI[7], three methods, to perform reconstruction first with input measurements. Then, we apply a trained captioning model, TeaCap in Table 1, to generate descriptions. Hence, for all these two-stage methods, they share the same captioning model. The results on the MSRVTT[62] dataset are listed in Table 2, where the inference time is averaged over the whole testing set with batch size 1. We also include the GPU usage of these methods as well as the reconstruction time (“Rec.” in the table) and caption time for comparison. It can be clearly found that our proposed reconstruction-free video snapshot captioning model, SnapCap, has significant advantages in terms of inference speed. It also consumes less GPU memory but achieves the best captioning performances among these methods.
Further, given the fact that in video CS systems such as CACTI, compression ratio plays a determined factor in the quality of recovered videos for software decoders. Usually, the smaller the , the better the reconstruction quality, leading to better captioning performances. To evaluate the robustness of SnapCap, we conduct experiments with different values of and report the CIDEr values in Fig. 7. From the figure, it can be notably found that our SnapCap shows the least performance degradation as increases.
Figure 7.Caption quality (in terms of CIDEr value) comparison of different methods and different multiple compression ratios (Refs. [
1. Input the measurement |
2. Input the latent representation of measurement to the student model |
3. Input visual embedding |
4. Generate the predicted caption word-by-word through the language decoder |
Table 2. Inference Stage
4.3.2. Comparison with the traditional pipeline
As mentioned in Sec. 1 and shown in Fig. 1(a), traditional multi-stage VC methods[8] usually start from the “sampling & captioning” stage while ignoring the data acquisition, compression, and decoding phases. Here, we consider three representative VC methods based on the H.264 Codec, as listed in the top part of Table 2. The CoCap[44] model generates the caption directly on the H.264 compressed videos without decoding the original frames. Although it achieves competitive results on the MSRVTT dataset with our SnapCap using the same backbone network, CLIP ViT-B, it runs about 2 times longer than SnapCap. Further, we have not yet considered the resource consumption to obtain the compressed videos from the live scene. In addition, for a fair comparison to CoCap[44], we also report the decoding time and captioning time of traditional multi-stage VC methods, HMN[30], and refined semantic enhancement method towards frequency diffusion (RSFD)[31], where the decoding is completed through an open-source package ffmpeg.
4.3.3. Effects of regularization and distillation
In Sec. 3, we introduce a novel VC pipeline that takes compressed measurement and the masks as input to derive language-related vision features for captioning. During training, we propose to optimize and through the loss first and then update the parameters of , and under the guidance of the teacher model, secondly, as well as the captioning loss.
To demonstrate the effectiveness of transferring knowledge strategy through the regularization manner and the direct feature map matching schema, we conduct experiments by removing the and step by step. In Table 3, we report the numerical results of the MSRVTT[62] dataset. It can be noted that, without nor , the model is almost unable to describe the scene where, during the training and inference, we observed the severe over-fitting problem. Hence, it is rather difficult to obtain meaningful features directly from coded measurement, which is also observed in previous works[17,18]. However, both proposed knowledge-transferring strategies effectively extract meaningful and language-related vision features for captioning.
B | M | R | C | ||
× | × | 24.7 | 21.7 | 52.0 | 16.8 |
√ | × | 32.1 | 22.6 | 55.6 | 29.3 |
× | √ | 33.0 | 24.9 | 57.0 | 31.6 |
√ | √ |
Table 3. Contributions of Regularization Loss and Distillation Loss on the MSRVTT Dataset.
4.4. Captioning on Real Data
Except for simulation data, we also apply our framework to the real data captured by the CACTI system. To be more specific, we test our model on two color public real snapshot compressive data, Ball Rotate[6] and Hammer[15], which are captured by Ref. [14], and four grayscale public real snapshot compressive data, Domino, hand, pendulum, and Water Ballon. The coded measurement and our predicted caption are presented in Figs. 8 and 9, respectively, where the reconstructions obtained by STFormer[6] (Ball Rotate and four grayscale data) and BIRNAT[5] (Hammer) are also exhibited for reference. It can be clearly and notably noted that our proposed VC pipeline is able to describe the scene accurately in language.
Figure 8.Comparision of captioning results (our model prediction and two-stage model prediction) on two color real data. The top row is about Ball Rotate, and the bottom row is about Hammer. For better understanding, we also plot the reconstructed results of STFromer (top part) and BIRNAT (bottom part).
Figure 9.Comparison of captioning results (our model prediction and two-stage model prediction) on four grayscale real data. From top to bottom, it is about the Domino, hand, pendulum, and Water Ballon. For better understanding, we also plot the reconstructed results of STFromer.
5. Limitations and Future Work
In this paper, we propose an efficient scene captioning model, SnapCap, to generate captions directly from the compressed measurement without reconstruction. Actually, there are some limitations of SnapCap, noted as follows: (1) SnapCap only targets captioning, which is a sub-field in the scene understanding, constraining the applications of SnapCap in real-world scenarios. (2) The CACTI system integrates light over several frames, which may result in motion blur for scenes with fast-moving objects and cause captioning degradations. (3) Under low-light conditions, it could be very hard for the CACTI system to integrate the signals and output measurement. Hence, SnapCap is not suitable for low-light environments.
Toward possible future work, we are trying to explore the following: (1) Framework generalization: SnapCap is only used for scene captioning, which is a sub-field in the scene understanding area. Hence, we will explore extending SnapCap to more related tasks, such as object detection and tracking on the scene. (2) Longer-duration and faster-changing scenes: in this paper, we verify the effectiveness of SnapCap on two short and “static” video benchmarks, MSRVTT and MSVD, which usually last less than 10 s with a little movement transition. Therefore, we will continue applying SnapCap on more long-video benchmarks with rapid movement variations. (3) Generalization of other computational imaging systems: SnapCap begins from a snapshot CS camera, which is a part of computational imaging, and there are other systems like single-pixel imaging and metasurface-based imaging. We will explore extending SnapCap to more computational imaging technologies.
6. Conclusion
In this paper, to achieve efficient VC without software-based compression or reconstruction, we propose a novel end-to-end framework to generate captions directly from the compressed measurement. Specifically, we employ the KD strategy through a pretrained large vision language, CLIP, to transfer the knowledge from the video domain to the measurement domain. In the experimental section, we compare our proposed SnapCap with traditional multi-stage VC methods and two-stage methods on two extensively used benchmarks. Both the quantitative and qualitative results demonstrate that our SnapCap is able to describe the scene efficiently and accurately. We also verify the feasibility of our model in real data.
References
[1] V. Ramanishka et al. Top-down visual saliency guided by captions, 7206-7215(2017).
[3] B. Yang et al. Non-autoregressive coarse-to-fine video captioning, 3119-3127(2021).
[4] M. Tang et al. Clip4caption: Clip for video caption, 4858-4862(2021).
[5] Z. Cheng et al. BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging, 258-275(2020).
[7] L. Wang, M. Cao, X. Yuan. Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging, 18477-18486(2023).
[8] H. Ryu et al. Semantic grouping network for video captioning, 2514-2522(2021).
[9] X. Gu et al. Text with knowledge graph augmented transformer for video captioning, 18941-18951(2023).
[10] S. Chen, Y.-G. Jiang. Motion guided spatial attention for video captioning, 8191-8198(2019).
[13] M. Qiao et al. Deep learning for video compressive sensing. APL Photonics, 5, 030801(2020).
[14] X. Yuan et al. Low-cost compressive sensing for color video and depth, 3318-3325(2014).
[16] X. Yuan. Generalized alternating projection based total variation minimization for compressive sensing, 2539-2543(2016).
[19] X. Yuan et al. Plug-and-play algorithms for large-scale snapshot compressive imaging, 1447-1457(2020).
[20] Z. Wu, J. Zhangt, C. Mou. Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging, 4872-4881(2021).
[23] B. Wang et al. Reconstruction network for video captioning, 7622-7631(2018).
[24] S. Chen, Y.-G. Jiang. Motion guided region message passing for video captioning, 1543-1552(2021).
[25] K. He et al. Deep residual learning for image recognition, 770-778(2016).
[26] C. Szegedy et al. Inception-v4, inception-resnet and the impact of residual connections on learning(2017).
[27] D. Tran et al. Learning spatiotemporal features with 3d convolutional networks, 4489-4497(2015).
[29] B. Pan et al. Spatio-temporal graph for video captioning with knowledge distillation, 10870-10879(2020).
[30] H. Ye et al. Hierarchical modular network for video captioning, 17939-17948(2022).
[31] X. Zhong et al. Refined semantic enhancement towards frequency diffusion for video captioning, 3724-3732(2023).
[34] P. H. Seo et al. End-to-end generative pretraining for multimodal video captioning, 17959-17968(2022).
[35] J. Wang et al. GIT: a generative image-to-text transformer for vision and language(2022).
[36] H. Luo et al. Univl: a unified video and language pre-training model for multimodal understanding and generation(2020).
[37] H. Xu et al. mplug-2: a modularized multi-modal foundation model across text, image and video, 38728-38748(2023).
[38] C. Schuhmann et al. Laion-400m: open dataset of clip-filtered 400 million image-text pairs(2021).
[39] A. Miech et al. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2630-2640(2019).
[40] M. Bain et al. Frozen in time: a joint video and image encoder for end-to-end retrieval, 1728-1738(2021).
[41] Y. Tewel et al. Zero-shot video captioning by evolving pseudo-tokens(2023).
[42] A. Radford et al. Learning transferable visual models from natural language supervision, 8748-8763(2021).
[44] Y. Shen et al. Accurate and fast compressed video captioning, 15558-15567(2023).
[45] X. Jiao et al. Tinybert: Distilling bert for natural language understanding(2019).
[48] K. Xu et al. Feature normalized knowledge distillation for image classification, 664-680(2020).
[50] X. Wang et al. Kdgan: Knowledge distillation with generative adversarial networks(2018).
[51] M. Li et al. Gan compression: efficient architectures for interactive conditional gans, 5284-5294(2020).
[53] J. Chen et al. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only, 699-710(2023).
[54] K. Wu et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance, 21970-21980(2023).
[55] J. Chang et al. Detrdistill: a universal knowledge distillation framework for detr-families, 6898-6908(2023).
[57] T. Zhang et al. Efficient RGB-T tracking via cross-modality distillation, 5404-5413(2023).
[58] S. Gupta et al. Cross modal distillation for supervision transfer, 2827-2836(2016).
[59] W. I. Cho et al. Speech to text adaptation: Towards an efficient cross-modal distillation(2020).
[60] D. Alvarez Melis, T. Jaakkola. Towards robust interpretability with self-explaining neural networks(2018).
[61] J. D. M.-W. C. Kenton, L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2(2019).
[62] J. Xu et al. Msr-vtt: A large video description dataset for bridging video and language, 5288-5296(2016).
[63] D. Chen, W. B. Dolan. Collecting highly parallel data for paraphrase evaluation, 190-200(2011).

Set citation alerts for the article
Please enter your email address