Today marks the release of V-JEPA 2, a new world model trained on video that enhances the ability of robots and AI agents to understand and predict interactions in the physical world. This development is a step towards creating advanced machine intelligence (AMI) capable of "thinking before acting."
The concept behind V-JEPA 2 draws parallels with human intuition about physical dynamics. Humans can predict outcomes based on actions, such as anticipating where a hockey puck will be or avoiding obstacles while walking through crowded spaces. Similarly, V-JEPA 2 aims to enable AI agents to mimic this kind of intelligence by using world models for understanding, predicting, and planning.
Building upon its predecessor, V-JEPA, which was released last year, V-JEPA 2 offers improved capabilities in understanding and prediction. The model allows robots to interact with unfamiliar objects and environments effectively. Training involved video data to help the model learn patterns such as human-object interaction and object movement.
In practical applications within labs, robots equipped with V-JEPA 2 have successfully performed tasks like reaching for objects and relocating them. This demonstrates the potential for more intuitive robotic behavior in complex environments.
In conjunction with releasing V-JEPA 2, three new benchmarks are introduced to aid researchers in evaluating their models' learning and reasoning abilities using video data. The goal is to provide researchers access to top-tier models and benchmarks to accelerate advancements in AI systems that could improve everyday life.