Keynotes | CRAG-MM: Comprehensive RAG Benchmark for Multi-modal, Multi-turn Challenge

Dhruv Batra

Bio: Dhruv Batra is a co-founder and the Chief Scientist of Yutori. Previously, he was a Senior Director leading Embodied AI at the Fundamental AI Research (FAIR) team at Meta, and an Associate Professor in the School of Interactive Computing at Georgia Tech.

He works on understanding and advancing the limits of artificial intelligence (AI). More specifically, his research lies at the intersection of machine learning and computer vision, with forays into robotics and natural language processing.

He is a recipient of the Presidential Early Career Award for Scientists and Engineers (PECASE) (2019), the Early Career Award for Scientists and Engineers by the US Army (ECASE-Army) (2018), the Office of Naval Research (ONR) Young Investigator Program (YIP) award (2017), the National Science Foundation (NSF) CAREER award (2014), Army Research Office (ARO) Young Investigator Program (YIP) award (2014), Outstanding Junior Faculty awards from Georgia Tech (2018) and Virginia Tech (2015), multiple research awards from industry (Google, Amazon, Facebook), Carnegie Mellon Dean's Fellowship (2007), best paper awards/nominations in every area of AI (ICRA 2024, ICLR 2023, CVPR 2022, ICCV 2019, EMNLP 2017) and teaching commendations. His research is supported by NSF, ARO, ARL, ONR, DARPA, Amazon, Google, Microsoft, and NVIDIA. Research from his lab has been extensively covered in the media (with varying levels of accuracy) at CNN, BBC, CNBC, Bloomberg Business, The Boston Globe, MIT Technology Review, Newsweek, The Verge, New Scientist, and NPR.

Title: Scouts: multi-modal agentic search for monitoring the web

Jingrui He

Bio: Dr. Jingrui He is a Professor at School of Information Sciences, University of Illinois at Urbana-Champaign. She received her PhD from Carnegie Mellon University in 2010. Her research focuses on heterogeneous machine learning, active learning, neural bandits, and self-supervised learning, with applications in security, agriculture, social network analysis, healthcare, and finance. Dr. He is the recipient of the 2016 NSF CAREER Award, the 2020 OAT Award, the 2025 Amazon Research Award, three times recipient of the IBM Faculty Award in 2018, 2015 and 2014 respectively, and was selected as IJCAI 2017 Early Career Spotlight. Dr. He has more than 190 publications at major conferences (e.g., ICML, NeurIPS, ICLR, KDD) and journals (e.g., TMLR, TKDD, JMLR), and is the author of two books. Her papers have received the Distinguished Paper Award at FAccT 2022, as well as Bests of the Conference at ICDM 2016, ICDM 2010, and SDM 2010. Dr. He is a Distinguished Member of ACM, a Senior Member of AAAI and IEEE. She is also the Program Co-chair of IEEE BigData 2023.

Title: Towards Multimodal Understanding on Rich Data IID vs. Non-IID

Abstract: Multimodal data is ubiquitous in our daily life, such as text, images, videos, graphs, and time series. Some of them can be characterized as IID data, following the independent and identical distributions; and some of them can be characterized as non-IID data, such as multimodal graphs. Towards understanding such data, we face multiple challenges, such as the view heterogeneity, interpretability, and the underlying mechanism. In this talk, I will introduce some of our recent efforts addressing these challenges for both IID data and non-IID data in various learning scenarios, such as predictive modeling, anomaly detection, fusion of pre-trained foundation models. Towards the end, I will also share my thoughts regarding some future directions.

Jianwei Yang

Bio: Jianwei Yang is an AI Research Scientist at Meta, and previously a Principal Researcher at MSR. His research lies at the intersection of computer vision and multimodal learning, with a focus on developing general-purpose multimodal agents capable of interacting with both humans and environments. He has co-organized several academic events, including the Workshops on Transformers for Vision, Workshops on Computer Vision in the Wild, and Tutorials on Recent Advances in Vision Foundation Models. Jianwei has also served as an Area Chair for top-tier conferences such as ICCV, NeurIPS, and ICLR. His work has been recognized with several honors, including a Best Student Paper Finalist at CVPR 2022, first place in the V3Det Challenge at CVPR 2024, and the Best Paper Award at the CoRL 2024 LangRob Workshop.

Title: Toward General-Purpose Multimodal Agents: From GUIs to Robots and Beyond

Abstract: The development of multimodal AI agents marks a pivotal step toward creating systems capable of understanding, reasoning, and interacting with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations but also act adaptively to achieve goals within their environments. In this talk, I will present my recent research toward this grand goal. First, I'll discuss agentic models for the digital world, focusing on understanding and interacting with complex graphical user interfaces. Next, I'll move to the physical world, where we enable robots to learn from demonstration videos and act in the real world. Finally, I'll introduce Magma, a unified multimodal foundation model designed to bridge perception and action across both domains. I'll conclude with insights on scalable training, grounding, and the challenges ahead in building truly general-purpose agents.