CVPR 2025 Tutorials

Scalable Generative Models in Computer Vision

Tutorial

Nanye Ma

[ 202 B ]

Abstract

Generative models have emerged as a transformative force in computer vision, enabling breakthroughs in image, video, and 3D content synthesis. Recent advancements in model architectures and generative frameworks have driven unprecedented scalability, allowing models to handle larger datasets, longer context lengths, and more complex distributions. This tutorial will provide a comprehensive discussion of these advancements, focusing on frontier techniques for scaling generative models and their applications to video synthesis, 3D reconstruction, and virtual world simulation. Attendees will gain insights into the design principles behind scalable models, learn about key technical innovations, and understand the broader implications for the future of computer vision. By addressing both theoretical and practical aspects, this tutorial aims to equip researchers with the knowledge to explore, build, and deploy next-generation scalable generative models.

Volumetric Video in the Real World

Tutorial

Amrita Mazumdar

[ 205 B ]

Abstract

Volumetric video, which encodes a time-varying 3D scene into a unified representation for novel-view synthesis of dynamic contents, has long been a grand challenge for achieving immersive experiences. High-quality volumetric video enables new and immersive applications, such as 3D video conferencing, 3D telepresence, and virtual tutoring in XR. Recent volumetric representation enables fast and high-quality reconstruction of dynamic 3D scenes. As such, our tutorial summarizes practical challenges towards generating and distributing volumetric video in the real world. Specifically, invited talks in this tutorial will cover: (1) compression and performance optimization for 4D reconstruction, such as dynamic Gaussian splatting, quantization, and autoencoders; (2) volumetric video reconstruction from single or sparse-view captures; (3) reconstruction of indoor and urban scenes with dynamic content; (4) reconstruction and playback of dynamic 4D humans in the real-world; and (5) integration of volumetric video with vision-language models for other applications. Challenges across video domains, such as dynamic humans, automotive video, and synthetically generated video, will be thoroughly discussed.

Foundations of Interpretable AI

Tutorial

Aditya Chattopadhyay

[ 106 A ]

Abstract

In recent years, interpretability has emerged as a significant barrier to the widespread adoption of deep learning techniques, particularly in domains where AI decisions can have consequential impacts on human lives, such as healthcare and finance. Recent attempts at interpreting the decisions made by a deep network can be broadly classified in two categories, (i) methods that seek to explain existing models (post-hoc explainability), and (ii) methods that seek to build models that are explainable by design. This tutorial aims to provide a comprehensive overview of both approaches along with a discussion on their limitations.

From Video Generation to World Model

Tutorial

Zhaoxi Chen

[ 204 ]

Abstract

In the past few years, the research community has witnessed remarkable advancements in generative models, especially in the realm of video generation. Generating compelling and temporal coherent videos is challenging but demanding. To overcome these challenges, early text-to-video (T2V) methods have explored the potential of text-to-image (T2I) pretraining, such as Make-A-Video, MagicVideo, and Lavie. With the success of Diffusion Transformers (DiT), the first T2V model, which can support generating up to 40 seconds and high-fidelity videos, named SORA, was proposed. The availability of large-scale high-quality video datasets are proved to be indispensable. Later methods, including CogVideoX and MovieGen, have further explored the potential of 3D VAE. However, the current largest T2V model still fails to maintain the physical standard in most of the generative videos. On the other hand, recent work such as Genie, Genie-2, and GameNGen has presented promising results towards action conditioned video generation, showing the great potential of controllable video generation toward world models. Thus, in this tutorial, we first would like to give a comprehensive background on text-to-video generation – by reviewing the previous and most recent advanced T2V methods. Then, we would like to discuss the connection, future directions, and potential solution from the current …

Cognitive AI for the Future: Agentic Multimodal Models and RAG for Vision Language Applications, from Training to Deployment

Tutorial

Zhuo Wu

[ 401 AB ]

Abstract

Cognitive AI represents a transformative leap in how machines understand and interact with the world. Despite its potential, practical challenges remain in making these systems accessible and applicable across diverse domains. This tutorial addresses how multimodal models, combined with Retrieval-Augmented Generation (RAG) and agentic workflows, can enable cognitive AI systems to deliver personalized, context-aware solutions. With applications ranging from educational tools to assistive technologies for the elderly and disabled, this tutorial focuses on practical strategies for training, optimizing, and deploying these models and pipelines, making them both scalable and accessible to researchers and practitioners.

The 2nd Point Cloud Tutorial: All You Need To Know About 3D Point Cloud

Tutorial

Xiaoyang Wu

[ 202 A ]

Abstract

Point cloud is a data structure that is quite prevalent in 3D vision, which plays an important role in eras like 3D perception, 3D generation, autonomous driving, embodied AI, etc. However, there has not been a comprehensive resource that covers the state-of-the-art approaches and engineering details in point cloud processing. This tutorial aims to provide a comprehensive understanding of point cloud processing and analysis. Participants will delve into various aspects of point cloud data, exploring fundamental layers, network engineering considerations, pre-training technology, and acceleration library for point cloud processing. Through a combination of lectures, attendees will gain insights into the latest developments in the field and learn how to make informed choices when working with point cloud data. For the 2nd point cloud tutorial at CVPR 2025, we aim to move beyond traditional topics like backbone design and pre-training technologies covered in the 1st tutorial. This time, we will also explore challenges and opportunities in applications such as Autonomous Driving, Robotic Learning, and Egocentric Perception in AR/VR. With a diverse background spanning industry and academia, foundational research, and application-driven innovations, we offer a comprehensive perspective on the future of point cloud technology.

Tackling 3D Deep Learning, Gaussian Splats and Physics Simulation with NVIDIA Kaolin Library, a Hands-On Lab

Tutorial

Clement Tsang Tsang

[ 107 B ]

Abstract

3D Deep Learning often demands extensive boilerplate work such as managing data, camera conventions, and visualizing novel 3D representations. NVIDIA’s Kaolin Library, built on PyTorch, addresses these with tools like convenience APIs, reusable research modules, and GPU-optimized operations. The library’s updates are designed to address the evolving needs of the research community. Recent examples include supporting emerging representations like 3D Gaussian Splats (3DGS), and physics-based simulations for dynamic 3D modeling. Initially developed for internal use, Kaolin is shared externally under an open-source license. The tutorial will provide hands-on coding experience to equip attendees with practical skills for using Kaolin. In this session, we explore interactive tools 3DGS viewing in Jupyter, how to create optimizable physical simulations, and finally convert between common 3D representations to export results. GPU back ends will be provided. By the end of the tutorial, attendees will be able to utilize Kaolin’s features to streamline their research workflows and accelerate their projects.

Evaluating Large Multi-modal Models: Challenges and Methods

Tutorial

Kaijie Zhu

[ 109 ]

Abstract

The proliferation of large multi-modal models (LMMs) has raised increasing concerns about their security and risks, which are mainly due to a lack of understanding of their capabilities and limitations. In this tutorial, our aim is to fill this gap by presenting a holistic overview of LMM evaluation. First, we discuss the recent advance of LMMs evaluation from the perspectives of what, where, and how to evaluate. Then, we present several key challenges in LMM evaluation such as data contamination and fixed complexity. Based on these challenges, we introduce how to overcome these challenges. Furthermore, our discussion covers key evaluation metrics including trustworthiness, robustness, and fairness, as well as performance across diverse downstream tasks in natural and social sciences. We conclude with an overview of widely-used code libraries and benchmarks that support these evaluation efforts. We hope that academic and industrial researchers continue to work to make LMMs more secure, responsible, and accurate.

Evaluations and Benchmarks in Context of Multimodal LLM

Tutorial

Hao Fei

[ 208 A ]

Abstract

Despite existing various emerging benchmarks for evaluating Multimodal Large Language Models (MLLMs), the evaluation of MLLMs validity and effectiveness might remain open to further discussion. This tutorial addresses the need for comprehensive and scientifically valid benchmarks in MLLM development. The tutorial will offer a systematic overview of current MLLM benchmarks and discuss necessary performance enhancements for achieving human-level AGI. We will introduce recent developments in MLLMs, survey benchmarks, and explore evaluation methods. Detailed discussions will cover vision-language capabilities, video modality evaluations, and expert-level skills across multiple disciplines. We’ll further identify gaps in benchmarking the multimodal generalists, and introduce methods to comprehensively evaluate MLLMs. Finally, a special focus will be on addressing and mitigating the frequent hallucination phenomena in MLLMs to enhance model reliability.

Geospatial Computer Vision and Artificial Intelligence for Large-Scale Earth Observation Data

Tutorial

Orhun Aydin

[ 103 B ]

Abstract

Earth observation (EO) data has applications in agriculture, disaster management, and security. This tutorial explores integrating CV and EO data using diverse sensing types. Attendees will learn about open-source tools, multimodal reasoning, geospatial foundation models, and hands-on analysis of EO data for environmental and climate monitoring.

Sense, Perceive, Interact & Render on Android XR

Tutorial

Sean Fanello

[ 201 B ]

Abstract

This tutorial details the perception stack built for Android XR, including head, hand, face, and eye tracking. It covers data capture, rendering, photorealistic avatars, and scene understanding. Use cases highlight the stack's application in immersive and interactive experiences.

Efficient Text-to-Image/Video modeling

Tutorial

Srikumar Ramalingam

[ 202 A ]

Abstract

We are witnessing groundbreaking results in image-to-text and image-to-video models. However, the generation process with these models is iterative and computationally expensive. There is a growing need to make these algorithms faster for serving millions of users efficiently. This course focuses on techniques such as progressive parallel decoding, distillation methods, and Markov Random Fields to accelerate text-to-image and text-to-video models. The course also critiques popular evaluation techniques like FID and introduces efficient alternatives such as CMMD.

Multi-Modal Computer Vision and Foundation Models In Agriculture in conjunction with IEEE CVPR 2025

Tutorial

Chris Padwick

[ 107 B ]

Abstract

With the recent success of computer vision and deep learning in various applications, there has been significantly increasing attention paid to its use in agriculture. Agriculture-related vision problems are of great economic and social value. For example, robotics has recently been reinvigorated with work on Vision-Language-Action models. Building on these successes, researchers are using multi-modal computer vision foundation models to make progress on agricultural tasks and topics. Some relevant examples include: 1) Agricultural models that leverage data from different remote sensing platforms; 2) Multi-temporal yield prediction models using unsupervised domain adaptation; 3) Multi-modal models for identifying pests and weeds. This tutorial will encourage research in ML, CV, and agriculture, featuring leading researchers discussing the evolution and trends in this field.

Animal re-identification

Tutorial

Lukas Picek

[ 202 C ]

Abstract

This tutorial introduces the field of individual animal re-identification (ReID), crucial for ecological monitoring, conservation, and ethical wildlife research. Accurate animal ReID supports long-term monitoring of endangered species, combatting poaching, and understanding animal behavior. This half-day hybrid tutorial includes multiple talks and a panel discussion to encourage interaction and research directions.

Computer Vision over Homomorphically Encrypted Data

Tutorial

Vishnu Naresh Boddeti

[ 205 B ]

Abstract

Over the past decade, computer vision (CV) systems have become integral to healthcare, surveillance, and personal devices. The sensitive nature of data and models raises privacy concerns. Fully homomorphic encryption (FHE) allows computations on encrypted data, ensuring privacy. This tutorial explores integrating FHE into CV, addressing its challenges, mathematical foundations, key FHE schemes, SIMD capabilities, and hands-on demonstrations. It covers private and encrypted CV tasks and discusses open research directions.

Edge AI in Action: Technologies and Applications

Tutorial

Fabricio Narcizo

[ Davidson C3 ]

Abstract

Edge AI in Action is a hands-on tutorial exploring practical tools to develop and deploy AI models on resource-constrained devices. Topics include model optimization, deployment of LLMs and CV models, and integration with cloud-edge architectures. Demonstrations include devices like Raspberry Pi, iPhones, and Androids. Attendees will gain actionable insights into real-world Edge AI.

Robotics 101: An Odyssey from A Vision Perspective

Tutorial

Chonghao Sima

[ 202 B ]

Abstract

This full-day tutorial offers a vision-focused introduction to robotics. It covers foundational background, technical advancements, key challenges, and emerging directions. With diverse speakers from multiple domains, the tutorial is divided into two sessions: 'Perceive the World' and 'Interact with the World', addressing perception and interaction in robotics.

3D Shape Analysis: From Classical Optimization to Learning-based Matching

Tutorial

Viktoria Ehm

[ 204 ]

Abstract

3D shape analysis deals with extracting information from geometric data, with applications in driving, biomedicine, and AR/VR. This tutorial covers classical shape matching methods (linear and quadratic assignment problems), product graph formalisms, learning-based correspondence, spectral methods, and real-world applications. Challenges and future directions are also addressed.

Continuous Data Cycle via Foundation Models

Tutorial

Nadine Chang

[ 401 AB ]

Abstract

Foundation models are being continuously integrated into applications like autonomous driving and diagnostics. This tutorial explores the data-model feedback loop: how foundation models affect data curation and vice versa. Talks cover leveraging foundation models to build efficient data engines, enhancing model performance, and addressing data relevance, scale, and quality.

Recent Advances in Vision Foundation Models

Tutorial

Zhengyuan Yang

[ 401 AB ]

Abstract

This tutorial covers cutting-edge developments in vision foundation models. Topics include multimodal understanding and generation, scaling test-time compute, and applications for physical and virtual agents. The session will provide insights into the design and future directions of vision-based foundation models.

Intelligent Healthcare based on Cameras and Wireless Sensors

Tutorial

Wenjin Wang · Daniel McDuff

[ 212 ]

Abstract

This tutorial explores contactless health monitoring using cameras and RF sensors. Topics include measuring vital signs from skin or body imagery, emotion recognition, sleep staging, and activity recognition. It covers radar, WiFi, RFID, and acoustic-based RF sensing, highlighting multi-modal techniques that improve monitoring in healthcare, telemedicine, sports, and driver safety.

Full-Stack, GPU-based Acceleration of Deep Learning and Foundation Models

Tutorial

Jason Clemons, Hongxu (Danny) Yin, and Xinglong Sun

[ 205 A ]

Abstract

This tutorial offers insights across the hardware-software stack to accelerate deep neural networks, from convolutions to multimodal LLMs. Attendees will learn practical tools and trade-offs to optimize performance and inspire the next generation of scalable acceleration techniques.

Identifying Structure in Data: All you need to know about Dimensionality Reduction, Clustering and more

Tutorial

Constantin Seibold

[ 106 B ]

Abstract

This tutorial explores techniques for dataset curation, quality monitoring, dimensionality reduction (t-SNE, UMAP, h-NNE), and clustering (k-means, DBSCAN, FINCH). Attendees will learn how to use these methods to understand structure, reduce bias, detect outliers, and improve performance in AI and CV workflows.

Multimodal Mathematical Reasoning: Frontiers in Integrating Vision, Language, and Symbolic Representations

Tutorial

Tianyu Yang

[ 202 C ]

Abstract

This tutorial surveys the growing field of multimodal mathematical reasoning, combining CV, NLP, and symbolic logic. It addresses diagram interpretation, symbolic notation, and multi-step logic. Attendees will explore datasets, models, and evaluation, and discuss applications in education and science.

Power-efficient neural networks using low-precision data types and quantization

Tutorial

Thomas Pfeil

[ 205 B ]

Abstract

As neural networks grow, sustainability and cost become major challenges. This tutorial covers low-precision data types, quantization methods, and hands-on applications. Attendees will gain tools to maintain model performance while optimizing for efficiency on edge and large-scale deployments.