The neural network architecture proposed in the paper leverages attention to effectively integrate information. (Attention is the mechanism by which the algorithm focuses on a single element or a few elements at a time.) It’s self-supervised, meaning the model must infer masked-out objects in videos using the underlying dynamics to extract more information. And the architecture ensures visual elements in the videos correspond to physical objects, a step the coauthors argue is essential for higher-level reasoning.
The researchers benchmarked their neural network against CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a dataset that draws on insights from psychology. CLEVRER contains over 20,000 5-second videos of colliding objects (three shapes of two materials and eight colors) generated by a physics engine and more than 300,000 questions and answers, all focusing on four elements of logical reasoning: descriptive (e.g., “what color”), explanatory (“what’s responsible for”), predictive (“what will happen next”), and counterfactual (“what if”).
According to the DeepMind coauthors, their neural network equaled the performance of the best neurosymbolic models without pretraining or labeled data and with 40% less training data, challenging the notion that neural networks are more data-hungry than neurosymbolic models. Moreover, it scored 59.8% on the hardest counterfactual questions — better than both chance and all other models — and it generalized to other tasks including those in CATER, an object-tracking video dataset where the goal is to predict the location of a target object in the final frame.
The neural network architecture proposed in the paper leverages attention to effectively integrate information. (Attention is the mechanism by which the algorithm focuses on a single element or a few elements at a time.) It’s self-supervised, meaning the model must infer masked-out objects in videos using the underlying dynamics to extract more information. And the architecture ensures visual elements in the videos correspond to physical objects, a step the coauthors argue is essential for higher-level reasoning.
The researchers benchmarked their neural network against CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a dataset that draws on insights from psychology. CLEVRER contains over 20,000 5-second videos of colliding objects (three shapes of two materials and eight colors) generated by a physics engine and more than 300,000 questions and answers, all focusing on four elements of logical reasoning: descriptive (e.g., “what color”), explanatory (“what’s responsible for”), predictive (“what will happen next”), and counterfactual (“what if”).
According to the DeepMind coauthors, their neural network equaled the performance of the best neurosymbolic models without pretraining or labeled data and with 40% less training data, challenging the notion that neural networks are more data-hungry than neurosymbolic models. Moreover, it scored 59.8% on the hardest counterfactual questions — better than both chance and all other models — and it generalized to other tasks including those in CATER, an object-tracking video dataset where the goal is to predict the location of a target object in the final frame.
(post is archived)