Listen Then See: Video Alignment with Speaker Attention

1Carnegie Mellon University

Abstract

Video-based Question Answering (Video QA) is a challenging task and becomes even more intricate when addressing Socially Intelligent Question Answering (SIQA). SIQA requires context understanding, temporal reasoning, and the integration of multimodal information, but in addition, it requires processing nuanced human behavior. Furthermore, the complexities involved are exacerbated by the dominance of the primary modality (text) over the others. Thus, there is a need to help the task's secondary modalities to work in tandem with the primary modality. In this work, we introduce a cross-modal alignment and subsequent representation fusion approach that achieves state-of-the-art results (82.06\% accuracy) on the Social IQ 2.0 dataset for SIQA. Our approach exhibits an improved ability to leverage the video modality by using the audio modality as a bridge with the language modality. This leads to enhanced performance by reducing the prevalent issue of language overfitting and resultant video modality bypassing encountered by current existing techniques. Our code and models are publicly available at: https://github.com/sts-vlcc/sts-vlcc

Architecture

final_architecture_image

The figure displays the proposed architecture. We run the Speaking Turn Sampling (STS) module to the aligned frame_i from the speaking turn k and the corresponding subtitle from the transcript. We pass this pair to the frozen CLIP encoder to obtain the visual and text encodings respectively. The resultant encodings are passed through the Vision Language Cross Contextualization (VLCC) module and subsequently passed through the projection layer to generate one of the inputs to the LLM. Simultaneously, we generate the text embeddings of size U for each question-answer pair, and the text embeddings of size V for the video subtitles.

Dataset

final_architecture_image

We use the SocialIQ 2.0 dataset which follows the guidelines for measuring social intelligence. This dataset consists of 1,400 social in-the-wild videos annotated with 8,076 questions and 32,304 answers (4 answers per question, 3 incorrect, 1 correct). We use this dataset which is the only dataset that captures social intelligence in the VQA setup.

The dataset, includes videos (mp4), audio (mp3, wav) and transcripts (vtt).

Speaking Turn

final_architecture_image

This is a demonstration of our Speaking Turn Sampling: The question asked in this video is "What is the tone of the people speaking?". This example shows that our method (in the green box) uses more relevant frames where people are speaking. In contrast, the baseline (in the red box) samples frames that do not contain relevant information for the task. In this example, our model predicts the correct answer, whereas the baseline does not.

BibTeX

@inproceedings{agrawal2024listen,
      title={Listen Then See: Video Alignment with Speaker Attention},
      author={Agrawal, Aviral and Lezcano, Carlos Mateo Samudio and Heredia-Marin, Iqui Balam and Sethi, Prabhdeep Singh},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      pages={2018--2027},
      year={2024}
    }