CVPR 2025 Wednesday 06/11

Skip to yearly menu bar Skip to main content

Timezone: America/Chicago

Full Schedule Wed 6/11 Thu 6/12 Fri 6/13 Sat 6/14 Sun 6/15

Registration Desk

Registration / Badge Pickup

7:00 AM - 5:00 PM

Workshop

10th New Trends in Image Restoration and Enhancement Workshop and Challenges

Radu Timofte · Zongwei Wu · Florin-Alexandru Vasluianu · Yawei Li

8:00 AM - 7:00 PM

Image and video restoration, enhancement, and manipulation are key computer vision tasks with increasing importance across various fields.
The 10th edition of the NTIRE workshop seeks to provide a comprehensive overview of recent trends and advancements in these areas, facilitating interaction and potential collaboration between academic and industrial participants.
The NTIRE associated challenges gauge the state-of-the-art in topics such as super-resolution, efficiency, quality assessment, enhancement, normalization, removal of shadows, reflections and raindrops, HDR, light fields, raw restoration, reconstruction, event-based deblurring, cross-domain detection, depth estimation, night photography, and face restoration.
Building on the success of the previous editions, this event will feature presentations covering a wide selection of topics from 69 papers accepted for publication, organizers and winners of the 23 associated challenges, and invited talks provided by distinguished researchers.

Tutorial

Cognitive AI for the Future: Agentic Multimodal Models and RAG for Vision Language Applications, from Training to Deployment

Zhuo Wu

8:00 AM - 12:00 PM

Cognitive AI represents a transformative leap in how machines understand and interact with the world. Despite its potential, practical challenges remain in making these systems accessible and applicable across diverse domains. This tutorial addresses how multimodal models, combined with Retrieval-Augmented Generation (RAG) and agentic workflows, can enable cognitive AI systems to deliver personalized, context-aware solutions. With applications ranging from educational tools to assistive technologies for the elderly and disabled, this tutorial focuses on practical strategies for training, optimizing, and deploying these models and pipelines, making them both scalable and accessible to researchers and practitioners.

Workshop

The 2nd Workshop on Foundation Models for Medical Vision

Jun Ma · Yuyin Zhou · Vishal M. Patel · Julia Schnabel · Bo Wang

8:00 AM - 12:30 PM

The rapid growth of foundation models in various domains has been transformative, bringing unprecedented capabilities and advances in automated understanding. Medical vision, a pivotal segment of computer vision, is poised to greatly benefit from these advancements. This workshop delves into the integration and application of foundation models specific to the realm of medical imaging. We will cover state-of-the-art techniques for diverse medical data, such as echocardiogram, fundus, pathology, and radiology, as well as the practical challenges of implementing these models in clinical settings. Through expert-led sessions, interactive discussions, and international competitions, we aim to offer attendees a comprehensive understanding of the potential impact foundation models could have on the future of medical diagnostics and patient care.

Tutorial

From Video Generation to World Model

Zhaoxi Chen

8:00 AM - 5:00 PM

In the past few years, the research community has witnessed remarkable advancements in generative models, especially in the realm of video generation. Generating compelling and temporal coherent videos is challenging but demanding. To overcome these challenges, early text-to-video (T2V) methods have explored the potential of text-to-image (T2I) pretraining, such as Make-A-Video, MagicVideo, and Lavie. With the success of Diffusion Transformers (DiT), the first T2V model, which can support generating up to 40 seconds and high-fidelity videos, named SORA, was proposed. The availability of large-scale high-quality video datasets are proved to be indispensable. Later methods, including CogVideoX and MovieGen, have further explored the potential of 3D VAE. However, the current largest T2V model still fails to maintain the physical standard in most of the generative videos. On the other hand, recent work such as Genie, Genie-2, and GameNGen has presented promising results towards action conditioned video generation, showing the great potential of controllable video generation toward world models. Thus, in this tutorial, we first would like to give a comprehensive background on text-to-video generation – by reviewing the previous and most recent advanced T2V methods. Then, we would like to discuss the connection, future directions, and potential solution from the current video generation model to the ultimate world model.

Workshop

8th Workshop on Efficient Deep Learning for Computer Vision

Yung-Hsiang Lu · Shuai Zhang · George K. Thiruvathukal

8:00 AM - 12:00 PM

Efficient computer vision on mobile, auto and edge devices significantly impacts daily life, technology, and industry. This workshop will explore the latest advancements in multimodal LLM, autonomous driving, Gaussian splatting avatars, and robotics. Additionally, discussions will delve into new optimization methods and applications, highlighting the 2025 IEEE Low Power Computer Vision Challenge (lpcv.ai), where winners of the three tracks will present their innovative solutions.

Workshop

LatinX in Computer Vision Research Workshop

Lidia Talavera-Martínez · Willams De Lima

8:00 AM - 12:30 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

This workshop aims to promote and increase the participation of the LatinX community in Computer Vision. The workshop will provide a platform for LatinX researchers at all levels to share academic, industrial, cultural, and social challenges; highlight prominent LatinX researchers and allies; offer resources and opportunities for career growth through sponsored registrations, mentoring, and resume sharing; and raise the visibility of women researchers within the LatinX community. While the event focuses primarily on researchers who identify as LatinX, everyone is invited to attend.

Workshop

The Second Workshop on: Computer Vision For Videogames (CV2)

Iuri Frosio · Ekta Prashnani · David Durst · Rulon Raymond · Marguerite deCourcelle · Nicu Sebe · Georgios Yannakakis · Joohwan Kim

8:00 AM - 12:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Tutorial

Volumetric Video in the Real World

Amrita Mazumdar

8:00 AM - 5:00 PM

Volumetric video, which encodes a time-varying 3D scene into a unified representation for novel-view synthesis of dynamic contents, has long been a grand challenge for achieving immersive experiences. High-quality volumetric video enables new and immersive applications, such as 3D video conferencing, 3D telepresence, and virtual tutoring in XR. Recent volumetric representation enables fast and high-quality reconstruction of dynamic 3D scenes. As such, our tutorial summarizes practical challenges towards generating and distributing volumetric video in the real world. Specifically, invited talks in this tutorial will cover: (1) compression and performance optimization for 4D reconstruction, such as dynamic Gaussian splatting, quantization, and autoencoders; (2) volumetric video reconstruction from single or sparse-view captures; (3) reconstruction of indoor and urban scenes with dynamic content; (4) reconstruction and playback of dynamic 4D humans in the real-world; and (5) integration of volumetric video with vision-language models for other applications. Challenges across video domains, such as dynamic humans, automotive video, and synthetically generated video, will be thoroughly discussed.

Tutorial

Scalable Generative Models in Computer Vision

Nanye Ma

8:00 AM - 5:00 PM

Generative models have emerged as a transformative force in computer vision, enabling breakthroughs in image, video, and 3D content synthesis. Recent advancements in model architectures and generative frameworks have driven unprecedented scalability, allowing models to handle larger datasets, longer context lengths, and more complex distributions. This tutorial will provide a comprehensive discussion of these advancements, focusing on frontier techniques for scaling generative models and their applications to video synthesis, 3D reconstruction, and virtual world simulation. Attendees will gain insights into the design principles behind scalable models, learn about key technical innovations, and understand the broader implications for the future of computer vision. By addressing both theoretical and practical aspects, this tutorial aims to equip researchers with the knowledge to explore, build, and deploy next-generation scalable generative models.

Tutorial

The 2nd Point Cloud Tutorial: All You Need To Know About 3D Point Cloud

Xiaoyang Wu

8:00 AM - 5:00 PM

Point cloud is a data structure that is quite prevalent in 3D vision, which plays an important role in eras like 3D perception, 3D generation, autonomous driving, embodied AI, etc. However, there has not been a comprehensive resource that covers the state-of-the-art approaches and engineering details in point cloud processing. This tutorial aims to provide a comprehensive understanding of point cloud processing and analysis. Participants will delve into various aspects of point cloud data, exploring fundamental layers, network engineering considerations, pre-training technology, and acceleration library for point cloud processing. Through a combination of lectures, attendees will gain insights into the latest developments in the field and learn how to make informed choices when working with point cloud data. For the 2nd point cloud tutorial at CVPR 2025, we aim to move beyond traditional topics like backbone design and pre-training technologies covered in the 1st tutorial. This time, we will also explore challenges and opportunities in applications such as Autonomous Driving, Robotic Learning, and Egocentric Perception in AR/VR. With a diverse background spanning industry and academia, foundational research, and application-driven innovations, we offer a comprehensive perspective on the future of point cloud technology.

Workshop

Computer Vision for Mixed Reality

Rakesh Ranjan

8:00 AM - 12:00 PM

With the advent of passthrough devices such as the Quest 3, Apple Vision Pro, and more recently, Orion AR glasses, users can now engage in deeply immersive experiences that blend the virtual and real worlds, often referred to as Mixed Reality (MR). Unlike traditional Virtual Reality (VR), MR presents unique challenges in computer vision, such as capturing and reconstructing real-world environments with high fidelity and augmenting them with virtual elements in a realistic manner, in real-time.
This workshop aims to provide the research community with a deeper understanding of these MR-specific challenges and explore novel methods in areas like view synthesis, scene understanding, and efficient on-device AI, among others. Attendees will benefit from the insights of a diverse committee with expertise in 3D computer vision, graphics, human visual perception, and efficient machine learning.

Workshop

M&M: Multi-modal Models and Medicine

Vishwesh Nath · Jeya Maria Jose Valanarasu · Zhihong Chen · Xueyan Mei · Weidi Xie · Vishal M. Patel · Bennett Landman

8:00 AM - 12:00 PM

Healthcare today stands at the intersection of technology and innovation, driven by diverse data sources—from clinical reports and electronic health records to medical imaging, vital signs, and numerous forms of unstructured data. While deep learning has significantly advanced medical imaging, the vast potential of integrating these abundant, multi-modal data streams remains largely untapped. This integration promises revolutionary improvements in patient outcomes, yet navigating this landscape poses unique and complex challenges due to the fragmented and isolated nature of healthcare data. This workshop addresses the critical questions facing researchers and practitioners: How can we effectively align and integrate multi-modal medical data? How do we tackle safety, privacy, interpretability, and the scarcity of clinically driven benchmarks?

Tutorial

Foundations of Interpretable AI

Aditya Chattopadhyay

8:00 AM - 12:00 PM

In recent years, interpretability has emerged as a significant barrier to the widespread adoption of deep learning techniques, particularly in domains where AI decisions can have consequential impacts on human lives, such as healthcare and finance. Recent attempts at interpreting the decisions made by a deep network can be broadly classified in two categories, (i) methods that seek to explain existing models (post-hoc explainability), and (ii) methods that seek to build models that are explainable by design. This tutorial aims to provide a comprehensive overview of both approaches along with a discussion on their limitations.

Workshop

4th edition of Computer Vision for Metaverse Workshop

Giuseppe Serra · Ali Abdari · Alex Falcon · Beatrice Portelli · Vanessa Sklyarova · Barbara Roessle · Daniel Jung · Shunlin Lu · Ji Hou · Bichen Wu · Djamila Aouada · Gyeongsik Moon

8:10 AM - 12:20 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

In the ever-growing areas of Augmented Reality (AR), Virtual Reality (VR), and the expansive Metaverse, computer vision brings together the digital and physical worlds seamlessly. Its ability to understand and interpret visual information pushes these immersive technologies to new levels, enhancing user experiences, driving creative innovations, and exploring new frontiers. On the other side, Natural Language Processing (NLP) is pivotal for deciphering human language and facilitating applications like translation and summarization. Large Language Models (LLMs) are now capable of human-level conversational skills, drastically enhancing human-machine interactions. As exemplified by CLIP and other multimodal foundational models, textual information plays a significant role in understanding visual data. Furthermore, as a consequence, these large models may contribute significantly to improving AR, VR, and the Metaverse, enabling hands-free navigation, voice-based commands, and immersive communication between avatars.

Workshop

Navigating the Future: Ensuring Trustworthiness in Multi-Modal Open-World Intelligence

Wei Ji · Hong Liu · Zhun Zhong · Zhe Zeng · Elisa Ricci · Andrew Wilson · Shin’ichi Satoh · Nicu Sebe

8:15 AM - 5:30 PM

Today’s interconnected world presents unique challenges for intelligent systems in processing and integrating diverse data modalities, including text, audio, and visual data. However, traditional closed-world paradigms can fall short when faced with unseen classes and novel scenarios, which frequently emerge in complicated real-world environments. We propose the consideration of open-world learning as a way to build intelligent systems that are highly adaptable while also being robust and trustworthy, capable of tackling highly dynamic and creative tasks. Here, the integration of privacy-preserving techniques is crucial as data sources expand, particularly in high-stakes applications such as autonomous navigation systems for public safety. These systems must discern and adapt to evolving traffic patterns, weather conditions, and user behaviors in real time, underscoring the necessity of continuous learning and resilience against adversities. By exploring these critical challenges, this workshop aims to foster discussions that advance the development of trustworthy, multi-modal systems capable of thriving in open-world contexts.

Workshop

2nd MetaFood Workshop

Yuhao Chen · Petia Radeva · Jiangpeng He · Bhalaji Nagarajan · Fengqing Zhu

8:25 AM - 12:30 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Today, computer vision algorithms show near-perfect performance, better than human when there are clear, well curated and enough amount of data. However, there remains a substantial gap when it comes to applying state-of-the-art computer vision algorithms to food data, particularly when dealing with food in its natural, uncontrolled environment, often referred to as “data in the wild.” This gap stems from the inherent challenges in noisy, watermarked, and low-quality food data readily available on the internet. The MetaFood Workshop (MTF) invites the CVPR community to engage with the food domain-related challenges. These challenges provide not only a demanding, real testing environment for the development of robust computer vision algorithms, but also an exciting opportunity to develop new algorithms in the fields of food data analysis and food digitization.

Workshop

Visual Perception and Learning in an Open World

Shu Kong · Neehar Peri · Yu-Xiong Wang · Andrew Owens · Abhinav Shrivastava

8:30 AM - 5:40 PM

Visual perception is crucial for a wide range of applications. Traditionally, visual perception models were developed under a closed-world paradigm, where data distributions and categorical labels were assumed to be fixed and known in advance. However, these closed-world models often prove brittle when deployed in the real open world, which is dynamic, vast, and unpredictable. Modern approaches to visual perception have shifted towards open-world models, such as pretraining foundation models on large datasets sourced from the open world (e.g., data collected from the Internet). These foundation models are then adapted to solve specific downstream tasks. While contemporary model training follows the principle of "open-world learning," our workshop seeks to address existing limitations, potential risks, new opportunities, and challenges.

Workshop

BEAM 2025: Benchmarking and Expanding AI Multimodal Approaches

László A. Jeni · Morteza Ziyadi · Hao Yang · Xu Zhang · Yang Zou · Zhaowei Cai · Maria Zontak · Davide Modolo · Ashwin Swaminathan · Liuyue Xie · Mosam Dabhi · Xiang Yue · Ce Zheng · Rohan Choudhury · Ananya Bal

8:30 AM - 12:30 PM

Workshop

5th Workshop on 3D Scene Understanding for Vision, Graphics, and Robotics

Yixin Chen · Baoxiong Jia · Yao Feng · Songyou Peng · Chuhang Zou · Sai Kumar Dwivedi · Yixin Zhu · Siyuan Huang · Derek Hoiem · Marc Pollefeys · Song-Chun Zhu

8:30 AM - 1:00 PM

The developments in computer vision, graphics, and robotics have jointly spurred calls for next-generation AI systems that physically interact with their surroundings. Current research advances encompass 3D representations, large-scale foundation models, and end-to-end VLA approaches, but fundamental questions remain on how best to sustain environment comprehension, align efforts from diverse fields, and integrate scene understanding techniques to enhance physical interaction. The workshop seeks to unite current efforts, educate an interdisciplinary workforce with expertise across fields, and promote future developments in embodied and general AI.

Workshop

The 1st Workshop on Humanoid Agents

Wentao Zhu · Fangchen Liu · Bike Zhang · He Wang · Li Yi · Koushil Sreenath · Yizhou Wang · Pieter Abbeel · Leonidas Guibas

8:30 AM - 5:30 PM

Workshop

The 4th Workshop on Federated Learning for Computer Vision

Chen Chen · Guangyu Sun · Nathalie Baracaldo · Yang Liu · Peter Richtárik · Mi Zhang · Lingjuan Lyu · Nicholas Lane · Ang Li · Bo Li · Mahdi Morafah

8:30 AM - 5:30 PM

This workshop aims at bringing together researchers and practitioners with common interest in federated learning for computer vision. This workshop is an attempt at studying the different synergistic relations in this interdisciplinary area. This day-long event will facilitate interaction among students, scholars, and industry professionals from around the world to discuss the future research challenges and opportunities.

Workshop

The Sixth Workshop on Fair, Data-efficient, and Trusted Computer Vision

Nalini Ratha · Srikrishna Karanam · Kuan-Chuan Peng · Mayank Vatsa · Richa Singh · Ziyan Wu · Michele Merler · Kush Varshney

8:30 AM - 1:00 PM

Workshop

8th Multimodal Learning and Applications Workshop

Michael Ying Yang · Pietro Morerio · Paolo Rota · Bodo Rosenhahn · Vittorio Murino

8:30 AM - 1:15 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

The aim of this workshop is to generate momentum around multimodal learning and applications, and to encourage interdisciplinary interaction and collaboration between computer vision, multimedia, remote sensing, and robotics communities, that will serve as a forum for research groups from academia and industry. We expect contributions involving, but not limited to, image, video, audio, depth, IR, IMU, laser, text, drawings, synthetic, etc. Position papers with feasibility studies and cross-modality issues with highly applicative flair are also encouraged. Multimodal data analysis is a very important bridge among vision, multimedia, remote sensing, and robotics, therefore we expect a positive response from these communities.

Workshop

Data Driven Autonomous Driving Simulation (DDADS)

Azadeh Dinparastdjadid · Žan Gojčič · Maximilian Igl · Maximilian Naumann · Thomas Gilles · Ekaterina Tolstaya · Sanja Fidler · Shimon Whiteson

8:30 AM - 5:45 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Workshop

Sight and Sound

Andrew Owens · Jiajun Wu · Kristen Grauman · Antonio Torralba · William Freeman · Andrew Zisserman · Hang Zhao · Ruohan Gao · Triantafyllos Afouras · Arsha Nagrani · Jean-Charles Bazin

8:30 AM - 12:30 PM

Since pretty much every video has an audio track, the prospect of learning from paired audio-visual data — either with new forms of unsupervised learning, or by simply incorporating sound data into existing vision algorithms — is intuitively appealing, and this workshop will cover recent advances in this direction. But it will also touch on higher-level questions, such as what information sound conveys that vision doesn’t, the merits of sound versus other “supplemental” modalities such as text and depth, and the relationship between visual motion and sound. We’ll also discuss how these techniques are being used to create new audio-visual applications, such as in the fields of speech processing and video editing.

Workshop

The 2nd Workshop on Equivariant Vision: From Theory to Practice

Congyue Deng · Evangelos Chatzipantazis · Jiahui Lei · YINSHUANG XU · Stefanos Pertigkiozoglou · Minghan Zhu · Huazhe Xu · Thomas W. Mitchel · Leonidas Guibas · Kostas Daniilidis

8:30 AM - 5:00 PM

Exploiting symmetry in structured data is a powerful way to improve the generalization ability, data efficiency, and robustness of AI systems, which leads to the research direction of equivariant deep learning. Showing its effectiveness, it has been widely adopted in a large variety of subareas of computer vision, from 2D image analysis to 3D perception, as well as further applications such as medical imaging and robotics. The workshop will foster discussion and knowledge exchange among researchers actively working on equivariance, providing a platform to share methodologies and explore the latest advancements in this rapidly evolving field.

Workshop

CV4Science 2025: Using Computer Vision for the Sciences

Utkarsh Mall · Ye Zhu · Jacob Berv · Siavash Golkar · Katherine Bouman · Subhransu Maji · David Fouhey

8:30 AM - 5:00 PM

This workshop aims to: bring together researchers working on computer vision and diverse scientific domains to discuss the latest advancements, challenges, and opportunities at their intersections. The goal is to foster interdisciplinary collaboration, build community within the computer vision community, and highlight progress and researchers at the interface of computer vision and the sciences. AI advancements have become a transformative force, extending beyond their original domain to drive breakthroughs in scientific discovery—an impact highlighted by the 2024 Nobel Prizes in Physics and Chemistry. Computer vision, as one of the core areas in AI research, offers powerful tools for analyzing data, with applications spanning a wide range of scientific fields, from accelerating discoveries in astrophysics and biology to enhancing environmental monitoring and materials science.

Workshop

FGVC12: 12th Workshop on Fine-grained Visual Categorization

Nico Lang · Elijah Cole · Suzanne Stathatos · Lukas Picek · Klara Janouskova · Christine Kaeser-Chen · Justin Kay · Joakim Bruslund Haurum · Xiangteng He · Mehmet Aygun · Serge Belongie · Oisin Mac Aodha · Subhransu Maji · Sara Beery · Grant Horn

8:45 AM - 6:00 PM

FGVC12 will explore topics of broad interest to the computer vision community, specifically addressing self-supervision, limited data, and human-in-the-loop learning through the challenging lens of fine-grained learning. This focus extends beyond traditional computer vision, offering methodologies applicable to real-world scenarios in domains like ecology, biology, medicine, and art history, thus fostering participation from researchers outside the CVPR community. The workshop will feature innovative challenges, building upon successful past competitions like iNaturalist, which have previously introduced new datasets and fostered novel solutions. FGVC12 will feature not only leading researchers from the field of computer vision, but also experts from domains such as biomedical data science and ecology to promote discussion of open problems in these disciplines.

FGVC12 acknowledges the support from our Gold Sponsor Google DeepMind.

Workshop

Global 3D Human Poses

Tianjian Jiang · Manuel Kaufmann · Jie Song · Soyong Shin · Jiye Lee · Ye Yuan · Otmar Hilliges

8:45 AM - 1:30 PM

The Global 3D Human Poses (G3P) workshop focuses on innovative techniques that incorporate trajectory data into pose estimation. By fostering collaboration among researchers and practitioners, the workshop will delve into new methodologies, address emerging challenges, and discuss the transformative potential of global pose estimation. Ultimately, the insights and innovations presented here are poised to push the boundaries of computer vision and pave the way for more robust, real-world applications in interactive systems and beyond.

Workshop

3rd Workshop on Generative Models for Computer Vision

Adam Kortylewski · Fangneng Zhan · Tian Han · Alan L. Yuille · Christian Theobalt

8:45 AM - 5:45 PM

This workshop aims to foster collaboration between researchers in generative AI and computer vision to explore how visual recognition can benefit from recent advances in generative image modeling. The workshop will feature expert discussions on research results and future directions, specifically focusing on topics such as generative models as data source for training computer vision models, benchmarking with generative models, analysis-by-synthesis approaches, self-supervised learning, adversarial robustness, out-of-distribution generalization, and ethical considerations within generative modeling.

Workshop

Workshop on Autonomous Driving

Vincent Casser · Alexander Liniger · Jose M. Alvarez · Maying Shen · Jannik Zürn · Chiyu “Max” Jiang · Nadine Chang · Dragomir Anguelov · John Leonard · Luc Van Gool

8:45 AM - 5:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

The CVPR 2025 Workshop on Autonomous Driving (WAD) brings together leading researchers and engineers from academia and industry to discuss the latest advances in autonomous driving. Now in its 8th year, the workshop has been continuously evolving with this rapidly changing field and now covers all areas of autonomy, including perception, behavior prediction and motion planning. In this full-day workshop, our keynote speakers will provide insights into the ongoing commercialization of autonomous vehicles, as well as progress in related fundamental research areas. Furthermore, we will host a series of technical benchmark challenges to help quantify recent advances in the field, and invite authors of accepted workshop papers to present their work.

Workshop

Computational Cameras and Displays

Kristina Monakhova · Mark Sheinin · Fei Xia · Vishwanath Saragadam

8:45 AM - 5:00 PM

This workshop is designed to unite the computational camera and display communities in that it considers to what degree concepts from computational cameras can inform the design of emerging computational displays and vice versa, both focused on applications in computer vision. The Computational Cameras and Displays (CCD) workshop series serves as an annual gathering place for researchers and practitioners who design, build, and use computational cameras, displays, and imaging systems for a wide variety of uses. The workshop solicits posters and demo submissions on all topics relating to computational imaging systems.

Workshop

6th International Workshop on Large Scale Holistic Video Understanding

Vivek Sharma · Shyamal Buch · Anurag Arnab · Ali Diba · Mohsen Fayyaz · Luc Van Gool · Joao Carreira · Manohar Paluri · Ehsan Adeli · Jürgen Gall · David A. Ross

8:50 AM - 12:10 PM

In recent years, the ability of computer systems to classify and analyze online videos has greatly improved. Significant advancements have been made in specific video recognition tasks, such as action and scene recognition. However, the comprehensive understanding of videos, known as holistic video understanding (HVU), has not received the attention it deserves. Current video understanding systems are specialized, focusing on narrow tasks.

For real-world applications like video search engines, media monitoring systems, and defining a humanoid robot's environment, integrating state-of-the-art methods is essential. To address this need, we are hosting a workshop focused on HVU. This workshop will cover recognizing scenes, objects, actions, attributes, and events in real-world videos.

We are introducing our HVU dataset, organized hierarchically with a semantic taxonomy for holistic video understanding. While many existing datasets focus on human action or sport recognition, our new dataset aims to broaden the scope and draw attention to the potential for more comprehensive video understanding solutions.

Our workshop will gather ideas related to multi-label and multi-task recognition in real-world videos, using our dataset to test and showcase research efforts.

Workshop

8th International Workshop on Visual Odometry and Computer Vision Applications Based on Location Clues

Guoyu Lu · Friedrich Fraundorfer · Yan Yan · Nicu Sebe · Chandra Kambhamettu

9:00 AM - 5:30 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Visual odometry has attracted substantial interest in computer vision, robotics and mechanical engineering communities, to name a few. This workshop aims to foster scalable algorithms and systems for accurate and real-time visual odometry, addressing the growing demands of location-aware applications. It will explore methods and applications leveraging location cues to enhance scene understanding, city navigation, and other context-rich problems, while emphasizing visual odometry and localization in mobile and robotics domains.

Workshop

EarthVision: Large Scale Computer Vision for Remote Sensing Imagery

Ronny Haensch · Devis Tuia · Jan D. Wegner · Loic Landrieu · Charlotte Pelletier · Hannah Kerner · Nathan Jacobs

9:00 AM - 5:30 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Earth Observation (EO) and remote sensing are ever-growing fields of investigation where computer vision, machine learning, and signal/image processing meet. The general objective of the domain is to provide large-scale and consistent information about processes occurring at the surface of the Earth by exploiting data collected by airborne and spaceborne sensors. Earth Observation covers a broad range of tasks, from detection to registration, data mining, and multi-sensor, multi-resolution, multi-temporal, and multi-modality fusion and regression, to name just a few. It is motivated by numerous applications such as location-based services, online mapping services, large-scale surveillance, 3D urban modeling, navigation systems, natural hazard forecast and response, climate change monitoring, virtual habitat modeling, food security, etc. The sheer amount of data calls for highly automated scene interpretation workflows.

Workshop

Synthetic Data for Computer Vision Workshop

Jieyu Zhang · Cheng-Yu Hsieh · Zixian Ma · Rundong Luo · Shobhita Sundaram · Wei-Chiu Ma · Ranjay Krishna

9:00 AM - 5:00 PM

Workshop

Workshop on Video Large Language Models

Mubarak Shah · Larry S. Davis · Rene Vidal · Son Dinh Tran · Angela Yao · Salman Khan · Rita Cucchiara · Cees G. M. Snoek · Christoph Feichtenhofer · Chang Xu · Jayakrishnan Unnikrishnan · Afshin Dehghan · Mamshad Nayeem Rizve · Rohit Gupta · Swetha Sirnam · Ashmal Vayani · Omkar Thawakar · Muhammad Uzair Khattak · Dmitry Demidov

9:00 AM - 6:00 PM

This workshop will explore the evolution, applications, and challenges of Video Large Language Models (VidLLMs), the latest advancement in multimodal LLMs. It will feature keynote talks from leading researchers, a panel discussion comparing VidLLMs with expert models, and a poster session. The workshop also includes three challenge tracks designed to evaluate VidLLMs' capabilities in compositional video retrieval, complex video reasoning and robustness, and multilingual video reasoning. These tracks aim to address key research areas such as training VidLLMs, their application in specialized computer vision tasks, and the challenges in evaluating their performance. Potential topics for invited papers include VidLLM methods/algorithms, data creation, evaluation and analysis, best practices, applications, and limitations, risks and safety.

Workshop

2nd Workshop on Urban Scene Modeling: Where Vision meets Photogrammetry and Graphics (USM3D)

Jack Langerman · Ruisheng Wang · Dmytro Mishkin · Ilke Demir · Renzhong Guo · Tolga Birdal · Sean Ma · Clement Mallet · Yang Wang · Shangfeng Huang

9:00 AM - 6:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Classical 3D reconstruction has traditionally focused on low-level representations, and this workshop addresses the need for higher-level, structured and parametric representations like CAD models from images and point clouds, with implications for construction, manufacturing, urban planning, and related fields. The workshop aims to foster interdisciplinary collaboration between 3D vision researchers, photogrammetry, graphics, machine learning, and other domains where structured 3D representations are critical. To advance research in this area, the workshop introduces two large-scale datasets: S23DR, a collection of 3D models with corresponding multiview images, and Building3D, a city-scale dataset for building wireframe model generation from aerial LiDAR. By providing these resources and promoting collaboration, the workshop seeks to catalyze multi-view structured 3D reconstruction trends, bridge industry-academia gaps, and enable applications in urban planning, disaster management, and other critical areas.

Workshop

3D Vision Language Model for Robotics Manipulation: Opportunities and Challenges

Jiafei Duan · Muhammad Zubair Irshad · Ishika Singh · Vitor Guizilini · Rares Andrei Ambrus · Zsolt Kira

9:00 AM - 5:00 PM

Tutorial

Tackling 3D Deep Learning, Gaussian Splats and Physics Simulation with NVIDIA Kaolin Library, a Hands-On Lab

Clement Tsang Tsang

9:00 AM - 12:00 PM

3D Deep Learning often demands extensive boilerplate work such as managing data, camera conventions, and visualizing novel 3D representations. NVIDIA’s Kaolin Library, built on PyTorch, addresses these with tools like convenience APIs, reusable research modules, and GPU-optimized operations. The library’s updates are designed to address the evolving needs of the research community. Recent examples include supporting emerging representations like 3D Gaussian Splats (3DGS), and physics-based simulations for dynamic 3D modeling. Initially developed for internal use, Kaolin is shared externally under an open-source license. The tutorial will provide hands-on coding experience to equip attendees with practical skills for using Kaolin. In this session, we explore interactive tools 3DGS viewing in Jupyter, how to create optimizable physical simulations, and finally convert between common 3D representations to export results. GPU back ends will be provided. By the end of the tutorial, attendees will be able to utilize Kaolin’s features to streamline their research workflows and accelerate their projects.

Workshop

Demographic diversity in computer vision

Polina Kirichenko · Vikram V. Ramaswamy · Kyle Buettner · Sina Malakouti · Tarun Kalluri · Manmohan Chandraker · Adriana Kovashka · Olga Russakovsky

9:00 AM - 6:00 PM

AI systems should serve all people with diverse values and perspectives around the world. However, as datasets scale, it's widely documented that they exhibit social biases of various forms, which translate to AI systems that cause real-world harm to under-represented demographic groups. A focused investigation of demographic biases in modern foundation models, their real-world impact and mitigation is thus critical to ensure equitable access to future models and their applications. This workshop will highlight diverse voices from around the globe and foster discussion on building inclusive AI.

Workshop

Uncertainty Quantification for Computer Vision

Andrea Pilzer · Martin Trapp · Arno Solin · Gianni Franchi · Andrei Bursuc · Marcus Klasson · Angela Yao · TUAN-HUNG VU · Fatma Güney

9:00 AM - 5:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

The UNcertainty quantification for Computer Vision (UNCV) Workshop aims to raise awareness and generate discussion regarding how predictive uncertainty can, and should, be effectively incorporated into models within the vision community. At the time of Generative AI (GenAI) we find this more crucial than ever. The workshop will bring together experts from machine learning and computer vision to create a new generation of well-calibrated and effective methods that know when they do not know.

Workshop

4th Workshop on Computer Vision in the Wild

Jianwei Yang · Chunyuan Li · Jiasen Lu · Reuben Tan · Qianhui Wu · Baolin Peng · Mu Cai · Xuehai He · Hao Zhang · Tianhe Ren · Feng Li · Shilong Liu · Xueyan Zou · Zhengyuan Yang · Xin Wang · Yong Jae Lee · Lei Zhang · Jianfeng Gao

9:00 AM - 5:00 PM

As artificial intelligence continues to evolve, the intersection of vision and language models is becoming increasingly crucial for real-world applications. The 4th Workshop on Computer Vision in the Wild (CVinW) at CVPR 2025 aims to foster discussions and innovations that push the boundaries of computer vision systems in unconstrained environments. Building on the success of our previous workshops: CVPR 2024 CVinW Workshop, CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop, this edition will focus on the next generation of large multimodal models (LMMs) and vision-language-action (VLA) systems, with an emphasis on temporal reasoning, video understanding, and physical interaction.

Workshop

Mobile AI workshop and associated challenges, 5th edition

Andrey Ignatov · Radu Timofte

9:00 AM - 4:00 PM

Over the past years, mobile AI-based applications are becoming more and more ubiquitous. Various deep learning models can now be found on any mobile device, starting from smartphones running portrait segmentation, image enhancement, face recognition and natural language processing models, to smart-TV boards coming with sophisticated image super-resolution algorithms. The performance of mobile NPUs and DSPs is also increasing dramatically, making it possible to run complex deep learning models and to achieve fast runtime in the majority of tasks. While many research works targeted at efficient deep learning models have been proposed recently, the evaluation of the obtained solutions is usually happening on desktop CPUs and GPUs, making it nearly impossible to estimate the actual inference time and memory consumption on real mobile hardware. To address this problem, we introduce the first Mobile AI Workshop, where all deep learning solutions are developed for and evaluated on mobile devices.

Workshop

2nd GenAI Media Generation Challenge Workshop

Sam Tsai · Ji Hou · Jialiang Wang · Yaqiao Luo · Simran Motwani · Xiaoliang Dai · Peizhao Zhang · Kunpeng Li · Peter Vajda · Tao Xu · Chih-Yao Ma

9:00 AM - 12:00 PM

We are proud to announce the launch of the 2nd GenAI Media Generation Challenge (MAGIC), featuring a media generation track and auto-evaluation track: Media Generation Festival: For the first time, we are organizing a media generation festival with no restrictions on prompts. We would define a few different topics for which submitted media would compete in, and participants can submit their best generated videos or images for those specific topics. For each topic, we run a crowd sourced voting mechanism to determine the winners for each topic. Auto Evaluation Challenge: We are introducing an auto evaluation challenge for both text-to-image and text-to-video tasks. Participants can develop and submit their auto evaluation score for a preselect set of images and videos that we will provide and enter into the media generation festival track. Auto evaluation submissions would be to predict the outcomes from the crowd sourced voting mechanism in the media generation festival The auto evaluation method that achieves the best correlation with the final results would be the winners for this challenge.

Workshop

Embodied Intelligence for Autonomous Systems on the Horizon

Hongyang Li · Kashyap Chitta · Andrei Bursuc · Christos Sakaridis · Jonah Philion · Florent Bartoccioni · Ana-Maria Marcu · Huijie Wang

9:00 AM - 6:00 PM

Autonomous systems, such as robots and self-driving cars, have rapidly evolved over the past decades. Despite this, several problems remain. Attempts have been made to develop more capable autonomous systems, such as integrating foundation models and utilizing large-scale data. However, the challenging problems have yet to be solved.

The motivation behind this workshop is to explore potential solutions, and discuss the challenges and opportunities associated with these approaches. We believe that this workshop serves as a brand-new perspective on the present and future of autonomous systems, and is necessary for both the robotics and computer vision communities.

Workshop

Generalization in Robotics Manipulation Workshop and Challenges

Shizhe Chen · Ricardo Garcia Pinel · Jiafei Duan · Dieter Fox · Cordelia Schmid · Ivan Laptev · Sami Haddadin

9:00 AM - 1:00 PM

Robotic manipulation is one of the most fascinating and challenging problems in robotics, with broad applications in manufacturing, customer service, healthcare, household tasks and more. While learning-based visual policies have achieved impressive results such as manipulating Rubik’s cubes, they are typically trained and tested in the same environments on specific tasks, lacking generalization capabilities to new scenes, objects and tasks. Recently, foundation models such as large language models (LLMs) and vision-language models (VLMs) have demonstrated strong abilities to encode vast amounts of world knowledge and generalize to new domains, offering a promising path forward for enhancing robots’ generalization capabilities. In this workshop, we aim to unite researchers from different communities to push the boundaries of generalizable robotic manipulation, including foundation models, perception, planning, embodied AI, simulators, sim2real, among others.

Workshop

Foundation Models Meet Embodied Agents

Manling Li · Ruohan Zhang · Jiayuan Mao · Wenlong Huang · Qineng Wang · Weiyu Liu · Xiaohan Zhang · Yonatan Bisk · Shenlong Wang · Yunzhu Li · Li Fei-Fei · Jiajun Wu

9:00 AM - 5:30 PM

An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of Large Language Models as powerful tools for building Large Agent Models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects). However, moving from Foundation Models to Embodied Agents poses significant challenges in understanding lower-level visual details, and long-horizon reasoning for reliable embodied decision-making. We will cover the advances of the foundation models into Large Language Models Vision-Language Models, and Vision-Language-Action Models. In this workshop, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and present a structured view to investigate the robot’s decision-making process. More information at https://foundation-models-meet-embodied-agents.github.io/cvpr2025.

Workshop

Workshop on 4D Vision: Modeling the Dynamic World

Shangzhe Wu · Qianqian Wang · Gengshan Yang · Jiahui Lei · Ruoshi Liu · Yufei Ye · Congyue Deng · Tarasha Khurana · Aleksander Holynski · Carl Doersch

9:20 AM - 5:30 PM

In recent years, we have seen remarkable progress in 3D computer vision, with increasingly robust and efficient models for reconstructing and generating 3D objects and scenes. 4D computer vision, as a natural extension of these efforts, is rapidly gaining traction. This workshop aims to establish a dedicated venue for discussions on this topic, bringing together researchers across various domains to exchange perspectives, identify challenges, and collectively accelerate progress in this space.

Workshop

Exploring the Next Generation of Data

Nadine Chang · Maying Shen · Jose M. Alvarez · Sifei Liu · Rafid Mahmood · Despoina Paschalidou

9:50 AM - 5:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Data is more crucial than ever, enabling the first generation of deep learning models to the new generation of foundation models. These foundation models are rapidly incorporating into several safety critical applications of human life. Thus, the large volume of data they rely on must be high-quality for safe model development. Due to the sheer volume of raw data, it is necessary to obtain a scalable ability to rank and select data by its inherent quality and value for both generic and specific tasks. Recently, foundation models themselves are used to discover even more data to feed into more foundation model training. This cyclic relationship between data and foundation models introduces another layer of complexity and biases to consider. Overall, this enormous challenge to discover the next generation of data requires several considerations: definition of quality data, bias-free data, scalability, generating data, ethical data gathering, continuous data gathering, and hallucination free foundation models for data mining. In this workshop, 7 leading experts across academia and industry will discuss how to tackle this large challenge together.

Workshop

Emergent Visual Abilities and Limits of Foundation Models (EVAL-FoMo 2)

Ashkan Khakzar · A. Koepke · Ameya Prabhu · Jindong Gu · Francesco Pinto · Arsha Nagrani · Boyi Li · Philip H.S. Torr · Trevor Darrell

12:45 PM - 6:00 PM

TLDR: This workshop focuses on analysis and evaluations to understand and identify emerging visual capabilities and pinpoint visual limits in foundation models.

Visual information processing is being transformed by foundation models. Trained on massive datasets using self-supervised and generative methods, these models exhibit the emergence of sophisticated visual abilities—such as depth perception, object recognition, and part discovery — without explicit programming or supervision. This shift marks a new paradigm where neural models derive visual understanding from the intrinsic structures and patterns present in the data rather than supervisory signals associated with a visual task. Moreover, questions remain about how to systematically analyze and evaluate these emergent capabilities. Recent studies have also highlighted the models' visual limitations, emphasizing the need for innovative evaluation methods to identify these shortcomings. By evaluating and understanding both the capabilities and limits of these models, we can better compare different learning algorithms and architectures in terms of how they represent the visual world.

Workshop

How to Stand Out in the Crowd?

Anand Bhattad · Aditya Prakash · Unnat Jain · Angjoo Kanazawa · Georgia Gkioxari · Svetlana Lazebnik

12:50 PM - 6:10 PM

In today’s AI landscape, visibility is harder than ever. The pace is breakneck, arXiv is overflowing, and the pressure to perform is real. So how do early-career researchers cut through the noise?

How do you define a research identity without chasing trends?
How do you publish with purpose, not just pace?
How do you explore emerging areas without getting lost in the noise?
How do you balance mentorship with momentum?

In its third year, this CVPR community-building workshop we bring voices across CV, NLP, ML, and Robot Learning -- Andrea, Carl, Dima, Gül, Jia-Bin, Laura, Ludwig, Saining, Sara, and Shuran -- to answer these questions and more. This is an open forum to share insights, frustrations, and hacks — because no one builds a research career alone.

Workshop

CVPR 2025 Photorealistic Avatar Challenge

Ross Cutler · Julien Valentin · Justus Thies · Babak Naderi · Vishak Gopal

1:00 PM - 4:00 PM

Workshop

2nd Workshop on Neural Fields Beyond Conventional Cameras

Ilya Chugunov · Tzofi Klinghoffer · Shengyu Huang · Wenzheng Chen · Daniel Gilo · Akshat Dave · Lingjie Liu · David B. Lindell · Or Litany · Ramesh Raskar

1:00 PM - 6:00 PM

Neural fields have been widely adopted for learning novel view synthesis and 3D reconstruction from RGB images by modelling transport of light in the visible spectrum. This workshop focuses on neural fields beyond conventional cameras, including (1) learning neural fields from data from different sensors across the electromagnetic spectrum and beyond, such as lidar, cryo-electron microscopy (cryoEM), thermal, event cameras, acoustic, and more, and (2) modelling associated physics-based differentiable forward models and/or the physics of more complex light transport (reflections, shadows, polarization, diffraction limits, optics, scattering in fog or water, etc.). Our goal is to bring together a diverse group of researchers using neural fields across sensor domains to foster learning and discussion in this growing area.

Tutorial

Evaluating Large Multi-modal Models: Challenges and Methods

Kaijie Zhu

1:00 PM - 5:00 PM

The proliferation of large multi-modal models (LMMs) has raised increasing concerns about their security and risks, which are mainly due to a lack of understanding of their capabilities and limitations. In this tutorial, our aim is to fill this gap by presenting a holistic overview of LMM evaluation. First, we discuss the recent advance of LMMs evaluation from the perspectives of what, where, and how to evaluate. Then, we present several key challenges in LMM evaluation such as data contamination and fixed complexity. Based on these challenges, we introduce how to overcome these challenges. Furthermore, our discussion covers key evaluation metrics including trustworthiness, robustness, and fairness, as well as performance across diverse downstream tasks in natural and social sciences. We conclude with an overview of widely-used code libraries and benchmarks that support these evaluation efforts. We hope that academic and industrial researchers continue to work to make LMMs more secure, responsible, and accurate.

Workshop

The 4th Explainable AI for Computer Vision (XAI4CV) Workshop

Sukrut Rao · Indu Panigrahi · Sunnie S. Y. Kim · Vikram V. Ramaswamy · Rajat Sahay · Avinab Saha · Dahye Kim · Miguel-Ángel Fernández-Torres · Lenka Tětková · Teresa Dorszewski · Bartlomiej Sobieski · Marina Gavrilova · Yuhui Zhang · Pushkar Shukla

1:00 PM - 5:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Explainability of computer vision systems is critical for people to effectively use and interact with them. This workshop provides a forum for researchers and practitioners to discuss the challenges and opportunities in explainable AI (XAI) for CV, addressing a critical need given the increasing deployment of these systems by: (1) initiating discussions across researchers and practitioners in academia and industry to identify successes, failures, and priorities in current XAI work; (2) examining the strengths, weaknesses, and underlying assumptions of proposed XAI methods and establish best practices in evaluation of these methods; and (3) discussing the various nuances of explainability and brainstorm ways to build explainable CV systems that benefit all involved stakeholders.

Workshop

Image Matching: Local Features and Beyond

Fabio Bellavia · Jiri Matas · Dmytro Mishkin · Luca Morelli · fabio remondino · Amy Tabb · Eduard Trulls · Kwang Moo Yi

1:00 PM - 5:30 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Workshop

Photo-realistic 3D Head Avatars

Tobias Kirschstein · Simon Giebenhain · Tianye Li · Koki Nagano · Justus Thies · Matthias Nießner

1:00 PM - 6:30 PM

Photorealistic 3D head avatars will play a crucial role in future computer games, visual effects, movie production, and virtual telepresence. In this workshop, we bring together leading academic researchers and industry experts to discuss the technology behind 3D head avatars, current applications, and future trends. In particular, we focus on two key desiderata of 3D head avatars: achieving the highest possible rendering quality and controlling the avatar with a driving signal. To this end, the workshop hosts a challenge on the NeRSemble 3D Head Avatar Benchmark. Challenge participants are invited to submit their methods for two tasks: dynamic novel view synthesis on heads, and monocular FLAME-driven 3D head avatar reconstruction. The authors of the best-performing submission will receive a GPU prize and present their method alongside invited speakers in the workshop.

Tutorial

Evaluations and Benchmarks in Context of Multimodal LLM

Hao Fei

1:00 PM - 5:00 PM

Despite existing various emerging benchmarks for evaluating Multimodal Large Language Models (MLLMs), the evaluation of MLLMs validity and effectiveness might remain open to further discussion. This tutorial addresses the need for comprehensive and scientifically valid benchmarks in MLLM development. The tutorial will offer a systematic overview of current MLLM benchmarks and discuss necessary performance enhancements for achieving human-level AGI. We will introduce recent developments in MLLMs, survey benchmarks, and explore evaluation methods. Detailed discussions will cover vision-language capabilities, video modality evaluations, and expert-level skills across multiple disciplines. We’ll further identify gaps in benchmarking the multimodal generalists, and introduce methods to comprehensively evaluate MLLMs. Finally, a special focus will be on addressing and mitigating the frequent hallucination phenomena in MLLMs to enhance model reliability.

Workshop

2nd Workshop on Human Motion Generation (HuMoGen)

Rishabh Dabral

1:00 PM - 6:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Workshop

AVA: Accessibility, Vision, and Autonomy Meet

Eshed Ohn-Bar · Danna Gurari · Hernisa Kacorri · Kris Kitani · Chieko Asakawa · Jennifer Mankoff

1:00 PM - 5:00 PM

The overarching goal of this workshop is to gather researchers, students, and advocates at the intersection of accessibility, computer vision, and autonomous systems. Building upon the success of the previous CVPR workshop (with cross-disciplinary talks, posters, and challenges), this iteration will focus on addressing the lack of shared development tools and vision-based benchmarks for accessibility systems. The workshop will feature a multimodal challenge with synthetic and real-world benchmarks. By fostering discussion and actively engaging people with disabilities, the workshop aims to build a stronger community for accessibility research within computer vision, uncover research opportunities, and encourage the development of more effective and usable real-world visual reasoning models.

Workshop

Workshop on Vision-based Assistants in the Real-world

Apratim Bhattacharyya · Fadime Sener · Roland Memisevic · Bugra Tekin · Edoardo Remelli · Shugao Ma · Guodong Ding · Shweta Mahajan · Angela Yao

1:30 PM - 6:00 PM

Workshop

Multimodal Foundation Models for Biomedicine: Challenges and Opportunities

Xiaohan Wang

1:30 PM - 5:30 PM

Workshop

Pixel-level Video Understanding in the Wild Challenge

Henghui Ding · Nikhila Ravi · Yunchao Wei · Jiaxu Miao · Zongxin Yang · Yi Yang · Si Liu · Yi Zhu · Elisa Ricci · Cees G. M. Snoek · Song Bai · Philip H.S. Torr

1:30 PM - 6:00 PM

Though start and end times here are correct, detailed schedules here may not be complete or up to date. Please be sure to cross reference the workshop's website to verify workshop schedule details if they are available on the workshop's website. (Added by CVPR.)

Workshop

Multimodal Algorithmic Reasoning Workshop

Anoop Cherian · Kuan-Chuan Peng · Suhas Lohit · Honglu Zhou · Le Xue · Kevin A. Smith · Tim Marks · Joshua B. Tenenbaum

1:40 PM - 6:00 PM

In this workshop, we plan to gather researchers working in neural algorithmic learning, multimodal reasoning, and cognitive models of intelligence to showcase their cutting-edge research, discuss the latest challenges, as well as bring to the forefront problems in perception and language modeling that are often overlooked but are pivotal in achieving true artificial general intelligence. An emphasis of this workshop is on the emerging topic of multimodal algorithmic reasoning, where a reasoning agent is required to automatically deduce new algorithms/procedures for solving real-world tasks, e.g., algorithms that use multimodal foundational models for analysis, synthesis, and planning, new approaches towards solving challenging vision-and-language mathematical (Olympiad type) reasoning problems, deriving winning strategies in multimodal games, procedures for using tools in robotic manipulation, etc. We hope to deep dive into this exciting topic at the intersection of multimodal learning and cognitive science to understand what we have achieved thus far in machine intelligence and what we are lacking in relation to the human way of thinking -- through talks from outstanding researchers and faculty that could inspire the audience to search for the missing rungs on the ladder to true intelligence.