NA-VQA: Narrative Aligned Long Form Video Question Answering

Abstract

We introduce NA-VQA, a benchmark comprising 88 full-length movies and 4,442 question-answer pairs designed to evaluate deep temporal and narrative reasoning in long-form videos. Questions are categorized by evidence span length — Short, Medium, and Far — to systematically assess long-range dependencies. We also propose Video-NaRA, a framework that builds event-level chains and stores them in a structured memory for retrieval during reasoning. Experiments reveal that current models perform poorly on far-range questions, while our framework achieves up to 3% improvement on such challenging questions.

Dataset Overview

Full-Length Movies

4,442

QA Pairs

~2 hrs

Avg. Video Duration

Reasoning Types

Evidence Span Categories

Short

Evidence localized within a brief temporal window — tests local comprehension.

Medium

Evidence spread across a moderate temporal range — tests mid-range reasoning.

Far

Evidence distributed across the full video — tests long-range narrative reasoning.

Reasoning Type Coverage

NA-VQA is the only benchmark covering all 7 reasoning types:

Causal Narrative Character Thematic Goal-Based Social Hypothetical

Dataset Examples

Each sample includes video frames, a narrative question, a chain of supporting events, and a long-form answer.

Example a

Question: How does the man achieve the goal of creating a convincing family portrait despite the unusual circumstances surrounding the child?

Events

A man retrieves a camera with tripod from storage to capture a family photograph.
The woman holds the blonde toddler wrapped in a yellow blanket on the couch while the man sets up the camera.
The man anxiously sets up a camera in the kitchen for a family photograph.
The man gives animated instructions to the woman and the toddler about the photograph.
The man positions himself near the woman and the baby with the camera's cable release in hand to complete their portrayal as a proper family unit.
The man grins at the camera while seated with the woman and the baby for what appears to be a family portrait.

Answer: The man achieves the goal of creating a convincing family portrait through careful staging and props. He retrieves a camera with tripod from storage, demonstrating planning and intention to document their family appearance. He gives animated instructions about positioning and expressions while setting up the shot. The woman holds the blonde toddler wrapped in a yellow blanket on their floral-patterned couch, creating a traditional family tableau. Despite their nervous body language revealing their discomfort with the deception, they successfully create visual evidence of their pretend normal family life.

Example b

Question: Why did the store employee place the handwritten sign on the metal security shutters?

Events

A store employee arrives at the grocery store to find it locked behind security shutters.
The employee struggles with a jammed padlock due to their gloved hands.
The employee lies on the convenience store floor to create a makeshift sign.
The employee stands precariously on a garbage can outside the store.
The employee attaches a handwritten sign reading 'I ASSURE YOU WE'RE OPEN!' to the metal security shutters.
A handwritten sign reading 'I ASSURE YOU WE'RE OPEN!' hangs on the storefront.

Answer: The store employee placed the handwritten sign reading 'I ASSURE YOU WE'RE OPEN!' because they found the entrance locked behind security shutters with a jammed padlock. After struggling with the lock while wearing gloves and showing growing frustration, they improvised this solution to signal to potential customers that business was continuing despite the closed appearance of the storefront.

Narrative Aligned Long Form Video Question Answering

Abstract

Dataset Overview

Evidence Span Categories

Reasoning Type Coverage

Dataset Examples

Leaderboard

Citation