• Advanced Imaging
  • Vol. 2, Issue 1, 011003 (2025)
Jianqiao Sun1, Yudi Su1, Hao Zhang1、*, Ziheng Cheng1, Zequn Zeng1, Zhengjue Wang2, Chunhui Qu1, Bo Chen1、*, and Xin Yuan3、*
Author Affiliations
  • 1National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, China
  • 2School of Telecommunications Engineering, Xidian University, Xi’an, China
  • 3School of Engineering, Westlake University, Hangzhou, China
  • show less
    DOI: 10.3788/AI.2025.10021 Cite this Article Set citation alerts
    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025) Copy Citation Text show less

    Abstract

    Describing a scene in language is a challenging multi-modal task as it requires understanding various and complex scenes, and then transforming them into sentences. Among these scenes, the task of video captioning (VC) has attracted much attention from researchers. For machines, traditional VC follows the “imaging-compression-decoding-and-then-captioning” pipeline, where compression is a pivot for storage and transmission. However, in such a pipeline, some potential shortcomings are inevitable, i.e., information redundancy resulting in low efficiency and information loss during the sampling process for captioning. To address these problems, in this paper, we propose a novel VC pipeline to generate captions directly from the compressed measurement, captured by a snapshot compressive sensing camera, and we dub our model SnapCap. To be more specific, benefiting from signal simulation, we have access to abundant measurement-video-annotation data pairs for our model. Besides, to better extract language-related visual representations from the compressed measurement, we propose to distill knowledge from videos via a pretrained model, contrastive language-image pretraining (CLIP), with plentiful language-vision associations to guide the learning of our SnapCap. To demonstrate the effectiveness of SnapCap, we conduct experiments on three widely used VC datasets. Both the qualitative and quantitative results verify the superiority of our pipeline over conventional VC pipelines.

    1. Introduction

    Scene captioning is an attractive visual-language task, involving understanding visual content such as images or videos, and generating textual descriptions. While describing what they see is a natural task for most people, it is not trivial for machines to do the same[1], especially in video captioning (VC). For machines, a straightforward pipeline is “imaging-compression-reconstruction-and-then-captioning,” as shown in Fig. 1(a). Specifically, a high-definition (HD) video camera captures videos with high resolution in both spatial and temporal domains, which are further compressed for efficient storage and transmission. Hence, recovering the original video frames is often necessary before generating captions[2].

    Comparing our efficient captioning pipeline in (c) with the traditional (multi-stage) pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.

    Figure 1.Comparing our efficient captioning pipeline in (c) with the traditional (multi-stage) pipeline in (a) and a potential two-stage solution in (b), indicated by red, blue, and yellow, respectively.

    Although most VC methods[3,4] assume that they have already obtained the well-decompressed video, they do not consider the potential drawbacks of the captioning step in the whole video processing pipeline. (i) Information redundancy: With the increasing spatial and temporal resolutions, the captured raw videos and the reconstructed ones exhibit severe information redundancy, resulting in a heavy burden on storage and calculation[57], as compared in Fig. 2. (ii) Information loss: To reduce the redundancy in the raw video, (near-)lossless software compression approaches are preferred. However, to handle temporal redundancy in the recovered video, existing VC approaches[810] often sample the video frames or video feature maps to reduce computational costs, which in turn may ignore some key information, especially in fast-moving videos. (iii) Less efficient: As we can see, starting from the captured raw video, there is a long way to go to achieve the output caption, with the help of accumulated efforts from every step. However, the redundant information is “reduced-recovered-and-further-reduced” in the “compression-reconstruction-and-sampling” loop, which produces a waste of computational resources during the whole pipeline.

    Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our method, traditional multi-stage VC methods, and two-stage methods, respectively. The size of the circle is proportional to the CIDEr score (↑) marked in brackets.

    Figure 2.Comparisons on GPU memory, inference time, and CIDEr score of typical VC methods, where red, blue, and yellow indicate our method, traditional multi-stage VC methods, and two-stage methods, respectively. The size of the circle is proportional to the CIDEr score (↑) marked in brackets.

    To realize efficient VC and alleviate computational and storage burden, this paper tries to explore a novel pipeline, describing the scene directly from the data captured by an optical camera, i.e., without software-based compression or reconstruction to captioning. Therefore, there are mainly two questions: (i) how to efficiently obtain compressive sensed visual data of the live scene and (ii) how to build an end-to-end captioning model directly from the compressive sensed data.

    To address the aforementioned challenges, we propose to incorporate a typical computational imaging technology, namely video snapshot compressive sensing (CS)[11,12], which physically obtains the compressed measurement during the imaging process. Concretely, let us consider a representative system, coded aperture compressive temporal imaging (CACTI)[12], as shown in Fig. 3. The optical instrument modulates the live scene via a set of dynamic masks, e.g., produced by a spatial light modulator (SLM)[12] or a digital micromirror device (DMD)[13], and then these frames are compressed into a two-dimensional (2D) snapshot measurement by a single exposure of the camera. For a video clip in 30 fps (frame per second) and 640×480 resolution, a CACTI system exhibits a bandwidth requirement of approximately 3.69 Mbps with a compression ratio of 20. In contrast, a traditional camera, as shown in Fig. 1(a), has a bandwidth requirement of about 73.73 Mbps, significantly higher than the CACTI system. Besides, only after the storage of HD videos is completed will content-based compression techniques such as H.264, H.265, and versatile video coding (VVC) perform video encoding processes. Furthermore, the design of the SLM and the mechanical translations in CACTIs decrease the demand for operating power[12]. Thus, video snapshot CS enjoys the advantages of low power for imaging sensors, low memory for storage, low bandwidth for transmission, etc[2,14].

    An illustration of a typical video snapshot CS system, CACTI[12].

    Figure 3.An illustration of a typical video snapshot CS system, CACTI[12].

    Then, given the coded measurements, different software decoder methods[5,7,13,15,16] can be employed to recover the video realistically. Therefore, applying the two-stage strategy “reconstruction-and-then-captioning” (as the yellow pipeline shown in Fig. 1) is a potential solution for coded measurements, but it still suffers from the low-efficiency problem (as the yellow circles shown in Fig. 2), which is similar to traditional multi-stage VC methods.

    To overcome this drawback and achieve an efficient VC, we propose an end-to-end approach directly based on the measurement captured by video snapshot CS. This pipeline is technically feasible, as it is accessible to building supervised data. Given the masks of the real optical system, we can pretty accurately simulate the acquisition of the measurement (further introduced in Sec. 2), to build a large-scale training dataset composed of paired measurements, videos, and captions.

    The final challenge now is to construct and train an end-to-end network in a supervised manner.

    Nevertheless, it is not an easy road, as noted by previous works[17,18] and our attempts discussed in Sec. 4 (Table 2). This may be ascribed to the fact that, compared with high-quality videos, the captured measurement is heavily blurred with less visual semantics and moving details, which greatly increases the difficulty of learning effective visual-language representations for caption generation. To break through these barriers, in this paper, we propose to build a teacher model whose knowledge is distilled to guide the learning of our end-to-end VC network. Specifically, as shown in Fig. 4, the teacher model focuses on extracting language-related visual features from the ground truth video with the help of a pretrained large vision-language model (VLM) and contrastive language-image pretraining (CLIP)[4]. Therefore, the teacher model not only conveys spatial and temporal details from the ground truth video but also provides abundant prior knowledge from CLIP. With knowledge distillation (KD), the student model is able to reveal a linguistic-related latent representation, which is injected into a transformer-based decoder to generate the caption.

    Learning and inference workflows of our proposed SnapCap. The cooperation of (a)–(c) is for training, and only (b) is needed for an end-to-end captioning during testing.

    Figure 4.Learning and inference workflows of our proposed SnapCap. The cooperation of (a)–(c) is for training, and only (b) is needed for an end-to-end captioning during testing.

    In a nutshell, the main contributions of this paper can be summarized as follows:

    1. We propose a novel VC pipeline to realize efficient caption generation, directly from the data captured by video snapshot CS, without compression or reconstruction in the software processing phase. This work is also the first attempt at a reconstruction-free VC method based on the video snapshot CS technology.
    2. We employ CLIP to construct a teacher model and utilize KD to guide the student model to learn language-related visual features, which is further input into a transformer decoder for caption generation. The whole model is trained in an end-to-end manner.
    3. Comprehensive experimental results on VC benchmarks demonstrate the efficiency and the effectiveness of our proposed SnapCap, which achieves competitive VC scores compared to traditional multi-sage video-based captioning methods and runs at least 3× faster compared to two-stage approaches with much better caption results.

    2. Preliminary and Related Works

    2.1. Video Snapshot Compressive Sensing

    Let us take a typical video CS system, CACTI[12], as an example. As shown in Fig. 3, we assume that the live scene with B high-speed frames {XkRH×W}k=1B is modulated by B coding masks {CkRH×W}k=1B.

    Within one exposure time, the light to the sensor is integrated, thus compressing these coded frames and producing a 2D measurement Y via summation, formally as Y=k=1BXkCk+N,where and N denote the Hadamard (element-wise) product and the noise of the system, respectively. For color video CS systems, the Bayer filter undergoes spectral sampling before it reaches the sensor. Consequently, considering the linear nature of this process, Xk can be regarded as a mosaic frame.

    Therefore, given the coding masks of the real system, one can easily simulate the measurement Y using synthetic data, saving a significant amount of effort required to capture a large number of real data. Actually, training on simulation and testing on a real data framework is widely used in methods developed for recovering the original high-speed frames from coded measurement[5,13,1921]. More introduction to these methods can be found in Ref. [11].

    Recently, there has been a novel trend toward coupling video snapshot CS with high-level visual understanding tasks, without recovering the original video. Hu et al.[22] realized video object detection based on coded measurement directly using a deep convolutional neural network (CNN). For action recognition, Okawara et al.[18] constructed an end-to-end 3D-CNN model with coded measurement as input. Both these methods show less complexity and more efficient inferences. However, their detection/recognition accuracy still falls behind the methods using high-quality video. Compared with object detection and action recognition, VC is a more challenging task. Because, besides understanding the visual contents, such as objects or actions, the VC model should also learn visual-language relations for cross-modality generation. Though challenging, we have achieved comparable performances with most of the existing HD-video-based VC methods.

    2.2. Video Captioning

    In recent years, VC has attracted much attention from researchers to understand and describe videos, which can be roughly classified into two groups: traditional methods and vision-language pretraining-inspired captioning methods. In the first group, most of the works[8,10,23,24] usually employ a 2D backbone, e.g., ResNet-101[25], IncepResNetV2[26] to extract vision information, and a 3D network such as C3D[27] to extract motion features. Furthermore, the memory-based augmentation network (MAN)[28] introduces the memory mechanism on both the vision encoder and language decoder to augment the caption quality. Besides, considering that physical entities in the video often play vital roles in describing the scene, some researchers proposed to incorporate an object detection module, achieving object-based VC. Specifically, STG-KD[29] designs an object branch to derive interaction information through the spatial-temporal graph and then distills it to the scene branch, where only the latter one is used during the evaluation. Then, in hierarchical modular network (HMN)[30], researchers introduce a hierarchical captioning framework that links vision representations and linguistic semantics via the entity module, predicate module, and sentence module. Some works also incorporate the audio information[31], and knowledge graph9 or introduce reinforcement learning (RL)[32,33] for caption generations.

    In the second group[3437], researchers intend to learn representations between images and texts or videos and texts by first pretraining on large-scale datasets, such as LAION-400M[38], Howto100M[39], and Webvid-2.5M[40], and then fine-tuning the model on downstream tasks and datasets or even performing zero-shot learning[41]. Among these works, CLIP4Caption[4] fine-tunes CLIP[42] through a video-text retrieval manner to enhance CLIP’s vision and language representation abilities. Built on frozen CLIP to extract visual information, IcoCap[43] proposes an image-video compounding strategy (ICS) to improve the video content diversity and visual-semantic guided captioning (VGC) to achieve better captioning results. Recently, based on the H.264 Codec, some researchers proposed a decoding-free captioning method, CoCap[44], which generates language descriptions directly from the compressed domain.

    2.3. Knowledge Distillation

    KD[45] aims to transfer knowledge from a complex teacher model to a lightweight student model, which has been widely explored in various applications, such as object detection[46,47], image recognition[48,49], image generation[50,51], and person re-identification[52]. Recently, an increasing number of works have focused on using KD to transfer knowledge from large pretrained models to domain-specific ones for different tasks[5355], achieving superior performances than traditional train-from-scratch neural networks. In parameter-efficient and student-friendly knowledge distillation (PESF-KD)[56], researchers explore a kind of student-friendly distillation strategy with smoother soft labels for efficient training. Except for single-modality knowledge transferring, some researchers also propose to distill the knowledge for cross-modality tasks based on semantically abundant data sources[5759].

    What we explore in this work is how to transfer the knowledge from the raw data (high-quality video) to the compressed data (coded measurement) via KD.

    3. Methodology

    To realize efficient captioning directly from the compressively sensed video snapshot captured by a computational camera, we propose a novel video snapshot captioning model, dubbed SnapCap, generating descriptions without compression or reconstruction. In such a cross-modality generative task, the key is to extract language-related visual features that are further used for caption generation. Hence, our model consists of a visual extractor and a caption generator, whose structure details as well as the learning and inference details will be introduced below.

    3.1. Visual Encoder via Knowledge Distillation

    Given compressed measurement Y and its corresponding masks {CkRH×W}k=1B shown in Fig. 4(b), a straightforward method to obtain textual predictions is to train a captioning model like most VC methods[10,23,29] and then perform inference. However, owing to the fact that the compressed data Y is always heavily blurry and noisy with much fewer details than HD video frames, such a direct manner fails to yield satisfactory results[17,18], and it is a very challenging task to capture effective visual features. Thanks to the accessible simulation (as introduced in Sec. 2.1), we can obtain abundant video-measurement data pairs and distill the knowledge from the video to the measurement. Hence, we hope to build a teacher model to capture effective visual information from the ground truth video, which can be employed to guide the feature extraction from the measurement, i.e., the student model S(·).

    Specifically, considering the vision-language association knowledge incorporated in the pretrained model CLIP[42], which is trained on large-scale image-text pairs[38], we apply the image encoder of the CLIP to capture the information contained in the video, which is denoted as the teacher model T(·). Nevertheless, given that there is a large discrepancy between the inputs of the teacher and the student models, it is infeasible to directly copy the structure of the teacher model to the student one. To solve this problem, we propose to map the video and the measurement to a shared latent space.

    To be more specific, we transform the videos to the first convolutional layer Conv1(·) of the CLIP image encoder to get the feature maps in an efficient manner as fconvt=Mean(Conv1(X1,,XB))Rc×h×w,where Mean(·) denotes the average pooling operation. Given the measurement Y with much blur and fewer details, it is hard to extract meaningful semantics from a single-layer convolutional operation to match the fconvt. Thus, we introduce an encoder f(·,·) consisting of multiple residual blocks to extract the latent representation from the measurement, flatent=f(Y,C).

    Then considering that different CLIP structures contain various parameter settings, we include a two-layer flexible convolutional operation Conv2(·) after the encoder f(·,·) for feature map alignment, fconvs=Conv2(flatent),fconvsRc×h×w.

    In this manner, the feature maps of the video and the measurement can be extracted into a shared latent space. Besides, the follow-up structure of the teacher model, e.g., the “vision transformer(ViT) blocks” in Fig. 4, can be copied to the student model as an initialization. With fconvt and fconvs in the same dimension and holding similar semantic representations, we further extract language-related vision embeddings for the video and the measurement, respectively, which can be formulated as ft=Mean(T(X1,,XB))Rd,fs=S(Y,C)Rd.

    With such an efficient design, the abundant semantic information embodied in the video can be distilled into the measurement. Hence, the distillation loss between the teacher model and the student model can be written as Lconv=LMSE(fconvs,fconvt),Lemb=LMSE(fs,ft),Ldis=Lconv+αLemb,where LMSE is the mean-square-error distance between two terms and α is a coefficient.

    In addition to distilling the knowledge from the videos through the direct feature map alignment, inspired by SENN[60], treating the video as a regularization term can also help f(·,·) and Conv2(·) to extract coherent semantics from the blurry measurement[15]. To this end, we design an efficient decoder g(·), which maintains the systematic network architecture as f(·,·), to recover videos from the latent representation flatent so that both the spatial and temporal details from the video can be conveyed to the measurement, formulated as X^=g(f(Y,C)),Lreg=L1(X^,X).

    Both the distillation loss and the regularization term can help the student model to fully absorb the knowledge from the teacher model and obtain meaningful vision embeddings for captioning.

    3.2 Caption Generator

    After extracting the language-related visual representation fs from the student model S(·), we design a transformer-based projector Proj(·) to map the vision embedding to the text space, t=Proj(fs),tRD,where D is the dimension of the text embedding space. At the position i of the sentence, the word can be generated as c<i=PLM(y<i),zi=Concat(t,c<i),p(Yi)=Dec(zi),where y<i represents the generated words before the position i, PLM(·) means a pretrained language model (PLM) such as BERT[61] to convey the words into the embedding space, Concat(·) is concatenation, and Dec(·) is a transformer-based language decoder to generate yi.

    3.3. Learning and Inference

    During training, given the original frames, we distill the knowledge from the video domain to the blurry coded measurement domain via two objectives, which are treating the video as a regularization term as Lreg and transferring the knowledge incorporated in the teacher model through the distillation process Lconv and Lemd. Given the ground truth annotations Y1:L*, as in most previous VC works[8,23,30], we adopt cross-entropy loss to supervise the learning process: Lcap=i=1Llogp(yi*|fs,y<i*),where L is the length of prediction.

    Taking a step further, considering that the optimization objective of including the videos as a regularization term is not exactly the same as performing feature map alignment, directly optimizing the parameters via the combining loss may bring about the convergence issue. To mitigate it, inspired by masked auto-encoder (MAE), we propose to optimize the encoder f(·,·) and the decoder g(·) through Lreg first. Then, without the involvement of g(·), we update the parameters of encoder f(·,·), student model S(·), projector Proj(·), and language decoder Dec(·) through the loss function: Ltotal=Ldis+βLcap,where β is another coefficient. As suggested by previous works[4,44], we employ transformer architecture as Dec(·).

    As shown in Fig. 4(b), during the inference process where only the coded measurement Y and masks C are given, we input them into the encoder and the student model to perform forward mapping and derive language-related vision embedding as fs=S(Y,C).

    Then the predicted caption is generated in an auto-regressive word-by-word manner.

    In Algorithms 1 and 2, we provide the whole training and inference algorithms.

    Table Infomation Is Not EnableTable Infomation Is Not Enable

    4. Experiments

    In this section, we conduct experiments and report results to demonstrate the effectiveness of our proposed framework. We first detail some experimental settings including the datasets, compared methods, evaluation metrics, and devices. Then, we comprehensively evaluate the performance of our framework on both simulated coded measurements and real data. Finally, some ablation experiments are carried out to verify the roles of different components. Note that in all tables, we highlight the best results in boldface.

    4.1 Experimental Settings

    Datasets: We conduct experiments on Microsoft research video to text (MSRVTT)[62] and Microsoft research video description corpus (MSVD)[63], two extensively used VC datasets. Specifically, the MSRVTT dataset consists of 10K video clips with 20 captions per video, which are separated into 6513 training samples, 497 for validation, and 2990 for testing following previous works[30]. For the MSVD dataset, we separate it into 1200 training videos, 100 for validation, and 670 for testing, respectively, following previous works[30].

    Evaluation Metrics: Following previous VC works[30], we use BLEU@4 (B), METEOR (M), ROUGE (R), and CIDEr (C) as the evaluation metrics using the public tool.

    Measurement Simulation: Considering that no public benchmarks have been introduced to evaluate our methods for now, we propose to synthesize the coded measurement on MSRVTT and MSVD. Specifically, for a given scene, a measurement is generated by compressing and integrating every B high-speed frame using the coding masks {Ck}k=1B, as defined by Eq. (1).

    Implementation Details: All experiments are conducted on a workstation with 16 Intel i7 CPUs 2.50 GHz and an NVIDIA Geforce RTX 3090 GPU. Similar to previous works[31,44], each video is segmented as T=12 clips, where we generate T=12 measurements with the mask ratio B=8 as the model input. In Sec. 3.3, we propose to optimize the parameters of f(·,·) and g(·) first through regularization in Eq. (10) with the AdamW optimizer and the initial learning rate of 3×104 in the beginning 10 epochs and decaying to 1×106 in the remaining 20 epochs. Secondly, to better extract meaningful and useful visual features for generating captions, we use an AdamW optimizer with a learning rate of 3×104 for 30 epochs for the student model and the caption generator through the loss function in Eq. (16), where the coefficients α and β are set to 0.001 and 1. The number of layers for the transformer-based projector Proj(·) and decoder Dec(·) is set as 2.

    4.2. Comparison with VC Methods

    To validate the effectiveness of our model, we conduct comparisons with state-of-the-art (SOTA) video-based captioning methods on both MSRVTT and MSVD datasets. It should be noted that, given video frames, most SOTA methods employ one or more spatial, motion, and detection characteristics, and others, e.g., external knowledge graphs and audio transcripts, to generate captions, which take more time for inferring and consuming more storage. The quantitative results of these methods are listed in Table 1. We also adopt a TeaCap model, which consists of our teacher model, the pretrained CLIP ViT[42], and a language generator including a two-layer projector and a two-layer decoder. The parameters of the teacher model are frozen, while we train the language decoder through the caption loss as in Eq. (15), similar to Ref. [4]. Compared to previous VC methods, our TeaCap exhibits competitive captioning results on two datasets.

    MethodInput modalityMSRVTTMSVD
    BMRCBMRC
    Video frame-based methods
    Recent[23]Va39.126.659.342.752.334.169.880.3
    SGN[8]V + M40.828.360.849.552.835.572.994.3
    HMN[30]V + M + D43.529.062.751.559.237.775.1104.0
    CoCap[44]V43.129.862.756.255.939.976.8113.0
    CoCap[44]V (ViT-L/14)44.130.363.457.260.141.478.2121.5
    RSFD[31]V + M + A43.429.362.353.151.235.772.996.7
    IcoCap[43]CLIP features47.031.164.960.259.139.576.5110.3
    Our TeaCapbV45.630.663.958.356.139.276.7114.9
    Coded measurement-based methods
    Our SnapCapCoded measurement44.230.163.256.754.938.275.4108.9
    Our SnapCap(ViT/L-14)Coded measurement47.231.165.160.560.340.978.8117.1

    Table 1. A Comparison of Proposed Efficient Measurement-Based Captioning and Different Video-Based VC Methods on MSRVTT and MSVD.

    Further, for our SnapCap model, we initialize the weights of ViT blocks in the student model with pretrained CLIP ViT models. From Table 1, we can find that our SnapCap achieves competitive performances compared with video frame-based methods. As the training loss is shown in Eq. (16), on the one hand, the student model distills the knowledge from the teacher model (trained on a large-scale dataset). On the other hand, it can also learn the knowledge from the data via training.

    Further, in Figs. 5 and 6, we visualize the coded measurement, video frames, and predicted descriptions by our SnapCap as well as the ground truth on MSRVTT[62] and MSVD[63] datasets, respectively, where our SnapCap is able to accurately describe the scene in language.

    Qualitative results on the MSRVTT dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.

    Figure 5.Qualitative results on the MSRVTT dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.

    Qualitative results on the MSVD dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.

    Figure 6.Qualitative results on the MSVD dataset. We exhibit the compressed measurement, predicted caption by SnapCap, and the ground truth annotations. For a better understanding, we also show the ground truth video frames.

    4.3. Ablation Study

    4.3.1. Comparison with two-stage methods

    Given the coded measurement, a straightforward and intuitive manner for captioning is to reconstruct frames first and then perform captioning. However, such a two-stage strategy typically consumes more time and computational resources, which poses a tricky dilemma in resource-limited occasions. In this part, we conduct experiments to demonstrate the superiority of our reconstruction-free compression-free compressive learning schema in terms of inference speed, memory consumption, and captioning quality on the same 3090 GPU. For two-stage methods, we load pretrained neural networks BIRNAT[5], STFormer[6], and EfficientSCI[7], three methods, to perform reconstruction first with T=12 input measurements. Then, we apply a trained captioning model, TeaCap in Table 1, to generate descriptions. Hence, for all these two-stage methods, they share the same captioning model. The results on the MSRVTT[62] dataset are listed in Table 2, where the inference time is averaged over the whole testing set with batch size 1. We also include the GPU usage of these methods as well as the reconstruction time (“Rec.” in the table) and caption time for comparison. It can be clearly found that our proposed reconstruction-free video snapshot captioning model, SnapCap, has significant advantages in terms of inference speed. It also consumes less GPU memory but achieves the best captioning performances among these methods.

    Further, given the fact that in video CS systems such as CACTI, compression ratio B plays a determined factor in the quality of recovered videos for software decoders. Usually, the smaller the B, the better the reconstruction quality, leading to better captioning performances. To evaluate the robustness of SnapCap, we conduct experiments with different values of B and report the CIDEr values in Fig. 7. From the figure, it can be notably found that our SnapCap shows the least performance degradation as B increases.

    Caption quality (in terms of CIDEr value) comparison of different methods and different multiple compression ratios (Refs. [8,16,23,31]).

    Figure 7.Caption quality (in terms of CIDEr value) comparison of different methods and different multiple compression ratios (Refs. [8,16,23,31]).

    Data: Coded measurement Y and masks {Ck}k=1B.
    Input: Trained model encoder f(·,·), student model S(·), projector Proj(·), and the language decoder Dec(·).
    Output: Predicted captions.
    1.  Input the measurement Y and masks {Ck}k=1B to the encoder f(·,·) to get the latent representations flatent as in Eq. (3).
    2.  Input the latent representation of measurement to the student model S(·) to get the visual embedding fs as in Eq. (6).
    3.  Input visual embedding fs to a projector Proj(·) to obtain t as in Eq. (11).
    4.  Generate the predicted caption word-by-word through the language decoder Dec(·) as in Eq. (14).

    Table 2. Inference Stage

    4.3.2. Comparison with the traditional pipeline

    As mentioned in Sec. 1 and shown in Fig. 1(a), traditional multi-stage VC methods[8] usually start from the “sampling & captioning” stage while ignoring the data acquisition, compression, and decoding phases. Here, we consider three representative VC methods based on the H.264 Codec, as listed in the top part of Table 2. The CoCap[44] model generates the caption directly on the H.264 compressed videos without decoding the original frames. Although it achieves competitive results on the MSRVTT dataset with our SnapCap using the same backbone network, CLIP ViT-B, it runs about 2 times longer than SnapCap. Further, we have not yet considered the resource consumption to obtain the compressed videos from the live scene. In addition, for a fair comparison to CoCap[44], we also report the decoding time and captioning time of traditional multi-stage VC methods, HMN[30], and refined semantic enhancement method towards frequency diffusion (RSFD)[31], where the decoding is completed through an open-source package ffmpeg.

    4.3.3. Effects of regularization and distillation

    In Sec. 3, we introduce a novel VC pipeline that takes compressed measurement and the masks as input to derive language-related vision features for captioning. During training, we propose to optimize f(·,·) and g(·) through the Lreg loss first and then update the parameters of S(·), Proj(·) and Dec(·) under the guidance of the teacher model, secondly, as well as the captioning loss.

    To demonstrate the effectiveness of transferring knowledge strategy through the regularization manner and the direct feature map matching schema, we conduct experiments by removing the Lreg and Ldis step by step. In Table 3, we report the numerical results of the MSRVTT[62] dataset. It can be noted that, without Lreg nor Ldis, the model is almost unable to describe the scene where, during the training and inference, we observed the severe over-fitting problem. Hence, it is rather difficult to obtain meaningful features directly from coded measurement, which is also observed in previous works[17,18]. However, both proposed knowledge-transferring strategies effectively extract meaningful and language-related vision features for captioning.

    LregLdisBMRC
    ××24.721.752.016.8
    ×32.122.655.629.3
    ×33.024.957.031.6
    44.230.163.256.7

    Table 3. Contributions of Regularization Loss and Distillation Loss on the MSRVTT Dataset.a

    4.4. Captioning on Real Data

    Except for simulation data, we also apply our framework to the real data captured by the CACTI system. To be more specific, we test our model on two color public real snapshot compressive data, Ball Rotate[6] and Hammer[15], which are captured by Ref. [14], and four grayscale public real snapshot compressive data, Domino, hand, pendulum, and Water Ballon. The coded measurement and our predicted caption are presented in Figs. 8 and 9, respectively, where the reconstructions obtained by STFormer[6] (Ball Rotate and four grayscale data) and BIRNAT[5] (Hammer) are also exhibited for reference. It can be clearly and notably noted that our proposed VC pipeline is able to describe the scene accurately in language.

    Comparision of captioning results (our model prediction and two-stage model prediction) on two color real data. The top row is about Ball Rotate, and the bottom row is about Hammer. For better understanding, we also plot the reconstructed results of STFromer (top part) and BIRNAT (bottom part).

    Figure 8.Comparision of captioning results (our model prediction and two-stage model prediction) on two color real data. The top row is about Ball Rotate, and the bottom row is about Hammer. For better understanding, we also plot the reconstructed results of STFromer (top part) and BIRNAT (bottom part).

    Comparison of captioning results (our model prediction and two-stage model prediction) on four grayscale real data. From top to bottom, it is about the Domino, hand, pendulum, and Water Ballon. For better understanding, we also plot the reconstructed results of STFromer.

    Figure 9.Comparison of captioning results (our model prediction and two-stage model prediction) on four grayscale real data. From top to bottom, it is about the Domino, hand, pendulum, and Water Ballon. For better understanding, we also plot the reconstructed results of STFromer.

    5. Limitations and Future Work

    In this paper, we propose an efficient scene captioning model, SnapCap, to generate captions directly from the compressed measurement without reconstruction. Actually, there are some limitations of SnapCap, noted as follows: (1) SnapCap only targets captioning, which is a sub-field in the scene understanding, constraining the applications of SnapCap in real-world scenarios. (2) The CACTI system integrates light over several frames, which may result in motion blur for scenes with fast-moving objects and cause captioning degradations. (3) Under low-light conditions, it could be very hard for the CACTI system to integrate the signals and output measurement. Hence, SnapCap is not suitable for low-light environments.

    Toward possible future work, we are trying to explore the following: (1) Framework generalization: SnapCap is only used for scene captioning, which is a sub-field in the scene understanding area. Hence, we will explore extending SnapCap to more related tasks, such as object detection and tracking on the scene. (2) Longer-duration and faster-changing scenes: in this paper, we verify the effectiveness of SnapCap on two short and “static” video benchmarks, MSRVTT and MSVD, which usually last less than 10 s with a little movement transition. Therefore, we will continue applying SnapCap on more long-video benchmarks with rapid movement variations. (3) Generalization of other computational imaging systems: SnapCap begins from a snapshot CS camera, which is a part of computational imaging, and there are other systems like single-pixel imaging and metasurface-based imaging. We will explore extending SnapCap to more computational imaging technologies.

    6. Conclusion

    In this paper, to achieve efficient VC without software-based compression or reconstruction, we propose a novel end-to-end framework to generate captions directly from the compressed measurement. Specifically, we employ the KD strategy through a pretrained large vision language, CLIP, to transfer the knowledge from the video domain to the measurement domain. In the experimental section, we compare our proposed SnapCap with traditional multi-stage VC methods and two-stage methods on two extensively used benchmarks. Both the quantitative and qualitative results demonstrate that our SnapCap is able to describe the scene efficiently and accurately. We also verify the feasibility of our model in real data.

    References

    [1] V. Ramanishka et al. Top-down visual saliency guided by captions, 7206-7215(2017).

    [2] Z. Zhang et al. From compressive sampling to compressive tasking: retrieving semantics in compressed domain with low bandwidth. PhotoniX, 3, 19(2022).

    [3] B. Yang et al. Non-autoregressive coarse-to-fine video captioning, 3119-3127(2021).

    [4] M. Tang et al. Clip4caption: Clip for video caption, 4858-4862(2021).

    [5] Z. Cheng et al. BIRNAT: Bidirectional recurrent neural networks with adversarial training for video snapshot compressive imaging, 258-275(2020).

    [6] L. Wang et al. Spatial-temporal transformer for video snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell., 45, 9072(2022).

    [7] L. Wang, M. Cao, X. Yuan. Efficientsci: Densely connected network with space-time factorization for large-scale video snapshot compressive imaging, 18477-18486(2023).

    [8] H. Ryu et al. Semantic grouping network for video captioning, 2514-2522(2021).

    [9] X. Gu et al. Text with knowledge graph augmented transformer for video captioning, 18941-18951(2023).

    [10] S. Chen, Y.-G. Jiang. Motion guided spatial attention for video captioning, 8191-8198(2019).

    [11] X. Yuan, D. J. Brady, A. K. Katsaggelos. Snapshot compressive imaging: Theory, algorithms, and applications. IEEE Signal Process Mag., 38, 65(2021).

    [12] P. Llull et al. Coded aperture compressive temporal imaging. Opt. Express, 21, 10526-10545(2013).

    [13] M. Qiao et al. Deep learning for video compressive sensing. APL Photonics, 5, 030801(2020).

    [14] X. Yuan et al. Low-cost compressive sensing for color video and depth, 3318-3325(2014).

    [15] Y. Liu et al. Rank minimization for snapshot compressive imaging. IEEE Trans. Pattern Anal. Mach. Intell., 41, 2990-3006(2018).

    [16] X. Yuan. Generalized alternating projection based total variation minimization for compressive sensing, 2539-2543(2016).

    [17] K. Dong et al. Retrieving object motions from coded shutter snapshot in dark environment. IEEE Trans. Image Process., 32, 3281-3294(2023).

    [18] S. Kumawat et al. Action recognition from a single coded image. IEEE Trans. Pattern Anal. Mach. Intell., 45, 4109-4121(2022).

    [19] X. Yuan et al. Plug-and-play algorithms for large-scale snapshot compressive imaging, 1447-1457(2020).

    [20] Z. Wu, J. Zhangt, C. Mou. Dense deep unfolding network with 3D-CNN prior for snapshot compressive imaging, 4872-4881(2021).

    [21] Z. Wu et al. Adaptive deep pnp algorithm for video snapshot compressive imaging. Int. J. Comput. Vis., 131, 1662-1679(2023).

    [22] C. Hu et al. Video object detection from one single image through opto-electronic neural network. APL Photonics, 6, 046104(2021).

    [23] B. Wang et al. Reconstruction network for video captioning, 7622-7631(2018).

    [24] S. Chen, Y.-G. Jiang. Motion guided region message passing for video captioning, 1543-1552(2021).

    [25] K. He et al. Deep residual learning for image recognition, 770-778(2016).

    [26] C. Szegedy et al. Inception-v4, inception-resnet and the impact of residual connections on learning(2017).

    [27] D. Tran et al. Learning spatiotemporal features with 3d convolutional networks, 4489-4497(2015).

    [28] S. Jing et al. Memory-based augmentation network for video captioning. IEEE Trans. Multimedia, 26, 2367-2379(2023).

    [29] B. Pan et al. Spatio-temporal graph for video captioning with knowledge distillation, 10870-10879(2020).

    [30] H. Ye et al. Hierarchical modular network for video captioning, 17939-17948(2022).

    [31] X. Zhong et al. Refined semantic enhancement towards frequency diffusion for video captioning, 3724-3732(2023).

    [32] S. Liu et al. Bidirectional maximum entropy training with word co-occurrence for video captioning. IEEE Trans. Multimedia, 25, 4494-4507(2022).

    [33] W. Xu et al. Deep reinforcement polishing network for video captioning. IEEE Trans. Multimedia, 23, 1772-1784(2020).

    [34] P. H. Seo et al. End-to-end generative pretraining for multimodal video captioning, 17959-17968(2022).

    [35] J. Wang et al. GIT: a generative image-to-text transformer for vision and language(2022).

    [36] H. Luo et al. Univl: a unified video and language pre-training model for multimodal understanding and generation(2020).

    [37] H. Xu et al. mplug-2: a modularized multi-modal foundation model across text, image and video, 38728-38748(2023).

    [38] C. Schuhmann et al. Laion-400m: open dataset of clip-filtered 400 million image-text pairs(2021).

    [39] A. Miech et al. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2630-2640(2019).

    [40] M. Bain et al. Frozen in time: a joint video and image encoder for end-to-end retrieval, 1728-1738(2021).

    [41] Y. Tewel et al. Zero-shot video captioning by evolving pseudo-tokens(2023).

    [42] A. Radford et al. Learning transferable visual models from natural language supervision, 8748-8763(2021).

    [43] Y. Liang et al. IcoCap: Improving video captioning by compounding images. IEEE Trans. Multimedia, 26, 4389-4400(2023).

    [44] Y. Shen et al. Accurate and fast compressed video captioning, 15558-15567(2023).

    [45] X. Jiao et al. Tinybert: Distilling bert for natural language understanding(2019).

    [46] A. Yang et al. Context matters: distilling knowledge graph for enhanced object detection. IEEE Trans. Multimedia, 26, 487-500(2023).

    [47] Q. Qi, Y. Yan, H. Wang. Class-aware dual-supervised aggregation network for video object detection. IEEE Trans. Multimedia, 26, 2109-2123(2023).

    [48] K. Xu et al. Feature normalized knowledge distillation for image classification, 664-680(2020).

    [49] X. Li et al. A category-aware curriculum learning for data-free knowledge distillation. IEEE Trans. Multimedia, 26, 9603-9618(2024).

    [50] X. Wang et al. Kdgan: Knowledge distillation with generative adversarial networks(2018).

    [51] M. Li et al. Gan compression: efficient architectures for interactive conditional gans, 5284-5294(2020).

    [52] W. Zhu, B. Peng, W. Q. Yan. Dual knowledge distillation on multiview pseudo labels for unsupervised person re-identification. IEEE Trans. Multimedia, 26, 7359-7371(2024).

    [53] J. Chen et al. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only, 699-710(2023).

    [54] K. Wu et al. Tinyclip: Clip distillation via affinity mimicking and weight inheritance, 21970-21980(2023).

    [55] J. Chang et al. Detrdistill: a universal knowledge distillation framework for detr-families, 6898-6908(2023).

    [56] J. Rao et al. Parameter-efficient and student-friendly knowledge distillation. IEEE Trans. Multimedia, 26, 4230-4241(2023).

    [57] T. Zhang et al. Efficient RGB-T tracking via cross-modality distillation, 5404-5413(2023).

    [58] S. Gupta et al. Cross modal distillation for supervision transfer, 2827-2836(2016).

    [59] W. I. Cho et al. Speech to text adaptation: Towards an efficient cross-modal distillation(2020).

    [60] D. Alvarez Melis, T. Jaakkola. Towards robust interpretability with self-explaining neural networks(2018).

    [61] J. D. M.-W. C. Kenton, L. K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2(2019).

    [62] J. Xu et al. Msr-vtt: A large video description dataset for bridging video and language, 5288-5296(2016).

    [63] D. Chen, W. B. Dolan. Collecting highly parallel data for paraphrase evaluation, 190-200(2011).

    Jianqiao Sun, Yudi Su, Hao Zhang, Ziheng Cheng, Zequn Zeng, Zhengjue Wang, Chunhui Qu, Bo Chen, Xin Yuan, "SnapCap: efficient snapshot compressive scene captioning," Adv. Imaging 2, 011003 (2025)
    Download Citation