Which Way Did It Move? — Diagnosing Directional Motion Blindness in Video-LLMs

Try it — same clip, two questions

A colored object drifts across the frame in one direction. We ask the Video-LLM two questions about the same clip: one about color, one about direction. State-of-the-art models nail the color but collapse on the direction. We call this failure mode directional motion blindness.

Ground truth · Rightward

Q1 “What color is the moving object?”

Q2 “Which way is the object moving?”

Color Motion direction Top-1 acc. · 0–100%

LLaVA-Video-7B

GPT-4o

Qwen3-VL-4B

LLaVA-Video-7B + DeltaDirect

Direction-QA top-1 accuracy on MODIRECT-SYNBENCH (Avg. across P-Syn / C-Syn / P-Real / C-Real). Color bars are visual approximations from the paper's teaser. Full per-domain numbers in Results.

Abstract

Recent Video-LLMs achieve strong results on a wide range of video tasks, yet our analysis reveals a surprisingly fundamental failure mode: directional motion blindness. Given a clip of a colored object moving across the frame, state-of-the-art models can answer "what color?" reliably but collapse to chance on "which way is it moving?". Through linear probes and a logit-lens analysis we show that the direction signal is faithfully encoded at every layer; it simply fails to bind into the answer — a phenomenon we term the direction binding gap. We introduce MoDirect-Inst, a controlled instruction-tuning corpus, and DeltaDirect, a parameter-efficient motion-change head that pulls direction through the vision-language interface during training and leaves the inference pipeline untouched. DeltaDirect raises LLaVA-Video-7B from 27.6% to 85.4% on real-world direction QA.

TL;DR

Video-LLMs already see motion direction in their internal representations — they just don't say it. DeltaDirect adds a small auxiliary motion head during training so that the VL interface lets the direction signal through.

Diagnosis

What's broken inside a Video-LLM that makes direction unanswerable? We trace it in three steps: rule out the obvious, follow the signal through the network, then ask what training can — and cannot — fix.

01

Two easy explanations — both fail.

Before opening the model, we test the obvious external hypotheses: maybe direction-supervised data is too rare, or maybe the prompt simply isn't asking the right way. Neither moves the needle.

0.91%

Hypothesis 1: data is missing.
Only 0.91% of LLaVA-Video-178K is direction-related — after keyword + semantic filtering, and human verification confirms this as an upper bound. Supervision is scarce, but scarcity alone tells us nothing about where in the model the failure sits.

Hypothesis 2: the prompt is wrong.
Best lift on P-Syn from any combination of input-side scaffolds — visual boundary cues, step-by-step location reasoning, and coordinate-grid prompts. Vanilla 27.6% → scaffolded 34.7%, still hovering at the 25% chance line. Prompting cannot close the gap.

Neither lever moves the needle. The cause must be representational — so we open the model.

02

The signal survives every layer — the readout just won't bind it.

Linear probes at every stage of the pipeline. A logit-lens style readout test at the end.

Direction probe · final readout 95.3% The signal is at the readout.

Direction-QA (MCQ) 27.6% Barely above the 25% chance line.

Decoded, not bound — the Direction Binding Gap.

The gap is universal.

The same probe–QA divergence appears in every Video-LLM we tested — nine open-source models across architectures and scales. High linear-probe accuracy (green) coexists with low direction-QA accuracy (red), and the binding gap (yellow) is uniformly wide.

A shared structural limitation of how Video-LLMs route motion-direction evidence into language, not a model-specific artifact.

03

Tuning closes it — but only on the source domain. Magnitude is the lever.

Direction instruction tuning installs a source-domain binding pathway. It transfers an orientation across domains but not the amplitude needed to read it out.

P-Syn source

P-Real ood

C-Syn ood

C-Real hardest

P · primitive shape / C · cutout of a real object
Syn · synthetic background / Real · real-world place

Direction-supervised tuning data covers only P-Syn. Everything else is out-of-distribution — and that's where the binding gap reopens.

A binding pathway emerges on the source domain.

Logit lens on LLaVA-Video, Primitive-on-Syn. Before tuning, the lens (- -) stays at chance while the probe (—) is high — the model knows but doesn't say. After MoDirect-Inst tuning, the lens climbs in late layers and converges with the probe.

MCQ accuracy: 27.6% → 99.5%

The readout is now reading out direction, not just encoding it. But this binding does not generalize as visual complexity rises.

(a) OOD binding gap reopens, (b) concept-vector orientations align across domains, (c) magnitudes shrink under visual complexity

We analyze direction concept-vectors at the readout state via difference-in-means, decomposing each into a unit orientation and a magnitude. (a) The source-domain gap closes to 0.3 pp but reopens to 12.1 pp on Cutout-on-Real. (b) Direction-axis orientations align across domains in late layers (cos > 0.9). (c) Yet their magnitudes shrink with visual complexity — the magnitude deficit behind the OOD binding gap.

We test this directly: rescale each OOD readout's magnitude to the source-domain (P-Syn) level — orientation untouched, no extra training.

intervention

OOD magnitude P-Syn magnitude

outcome

Vanilla LLaVA-Video-7B 25.8

+ MoDirect-Inst (instruction tuning) 60.5

+ rescaled to P-Syn magnitude no extra training +15.5pp 60.5

+15.5 pp on Cutout-on-Real. The OOD failure is a magnitude deficit.

The readout state shares a common direction axis across domains — what it lacks is amplitude. The actionable target: make the projector output carry a stronger signed-displacement signal before it enters the LLM. This is exactly the design principle behind DeltaDirect ↓

Method · DeltaDirect

DeltaDirect attaches a lightweight motion-change head at the vision-language interface during training. The head consumes a frame-difference descriptor and is supervised by an MSE motion loss. At inference, the head is discarded — the projector and LLM alone produce the answer.

1

Frame-difference descriptor

Pool projector outputs of adjacent frames:

δ_t = Pool(F_t+1 − F_t)
2

Motion head

A 2-layer MLP maps δ_t to a predicted motion-change descriptor m̂_t.
3

Auxiliary loss

L_MVP = MSE(m_t, m̂_t)

Added to the language modeling loss with weight λ.
4

Drop the head at inference

Only the (nudged) projector and LLM remain. Zero parameter overhead at deployment.

Results

Direction-QA top-1 accuracy on the two MoDirect benchmarks — SynBench (averaged over the four P/C × Syn/Real cells) and RealBench (averaged over SSv2, KTH, TOMATO). Toggle to compare controlled vs real-world generalization across every model in Table 1.

chance open-source baseline closed-source frontier related method + MoDirect-Inst tuning + DeltaDirect (ours)

Citation

If you find this work useful, please cite us.

@article{lee2026deltadirect,
  title   = {Which Way Did It Move? Diagnosing and Overcoming
             Directional Motion Blindness in Video-LLMs},
  author  = {Lee, Jongseo and Lee, Hyuntak and Kim, Sunghun and
             Kim, Sooa and Chung, Jihoon and Choi, Jinwoo},
  journal = {arXiv preprint arXiv:2605.22823},
  year    = {2026}
}

arXiv:2605.22823 KHU-VLL/DeltaDirect

Try it — same clip, two questions

Abstract

Diagnosis

The gap is universal.

A binding pathway emerges on the source domain.

Method · DeltaDirect

Frame-difference descriptor

Motion head

Auxiliary loss

Drop the head at inference

Results

Citation