A new benchmark called BRITE measures how well text‑to‑video generators handle implausible prompts and audio‑visual alignment.
BRITE combines three elements: deliberately odd prompts, detailed scoring of sound‑image matching, and a question‑answer format that lets humans trace errors. The creators built the dataset with a human‑in‑the‑loop pipeline, avoiding fully automated LLM pipelines that often hallucinate. They tested five leading models—Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max—and found all of them struggled with linking objects to actions and synchronising audio, even though they performed well on static composition.
The result matters because the field has focused on photorealism while neglecting coherence in dynamic scenes. BRITE gives researchers a reliable, interpretable tool to spot these weaknesses before they reach users.
If future models keep impressing on still‑frames but ignore timing, they’ll remain half‑baked for real‑world applications.