Edge AI in Action: Mastering On-Device Inference
The Principles of Diffusion Models: Real-Time Continuous & Discrete Diffusion
Overview
We present a concise, hands-on tutorial on fast diffusion-based generation across continuous and discrete data, featuring live demos that attendees can readily adapt for their own research.
Continuous Diffusion
The first part is based on The Principles of Diffusion Models, which unifies diffusion through variational-based, score-based, and flow-based viewpoints, then focuses on efficiency: ODE samplers (Euler/Heun-type), distillation of pretrained diffusion models into few-step generators (e.g., DMD), and flow-map alternativesincluding Consistency Models, Consistency Trajectory Models, and MeanFlow. We focus on first principles, together with practical training recipes and live demos.
Discrete Diffusion
The second part focuses on discrete diffusion. We introduce its core theoretical foundations, with emphasis on Diffusion Duality, which shows how discrete diffusion processes can emerge from Gaussian diffusion and provides a principled way to design discrete analogues of continuous-space methods. Building on this framework, we present Discrete Consistency Distillation for few-step generation in discrete diffusion models, and walk through its training and practical implementation. We conclude by exploring two families of samplers: those enabling few-step generation and those supporting inference-time scaling.
The tutorial is intended for participants familiar with neural networks and PyTorch, with some background in classic generative modeling concepts.
Click here for the detailed schedule.
Accelerated Diffusion Models: From Theory to Interactive World Models
Diffusion models and flow-based methods have revolutionized generative learning in the visual domain, setting new standards for image, video, and 3D content creation. However, as the field shifts toward interactive applications—such as real-time editing, world models, and embodied AI—the need for low-latency feedback has become critical. Currently, the high computational cost of iterative sampling hinders real-world deployment. While various acceleration techniques exist, the lack of a unified resource makes it difficult to bridge the gap between theory and practice.
Principled Interpretability in Vision Models: From Mechanistic Understanding to Interpretable Models by Design
As deep learning systems are increasingly deployed in high-stakes applications, understanding their behavior is critical for ensuring trust and safety. Interpretability provides essential tools to explain, debug, and improve these models. However, the field remains fragmented, spanning a wide range of methods and assumptions, while lacking standardized evaluation protocols. This tutorial aims to provide aunified overview of interpretability in deep learning– bridging post-hoc mechanistic understanding and methods to design inherently interpretable deep learning models.By the end of this tutorial, attendees will gain a solid understanding ofmodern interpretability methodsfor deep learning models, how torigorously evaluatethem, and open research directions in this critical area.
Workshop on Autonomous Driving
LatinX in Computer Vision Research Workshop
AI for Creative Visual Content Generation, Editing and Understanding
4th Workshop on Vision Based Industrial Inspection
The 2nd International Workshop & Challenge on Subtle Visual Computing @CVPR 2026
Workshop on "Bitter Lessons"
Autonomous Understanding Through Open-world Perception and Integrated Language models for On-road Tasks
IPA: Interactive Physical AI Workshop
Sense of Space: Multi-Sensory Modeling for Embodied Intelligence
Workshop on World Models Meet Active Sensing and Closed-Loop Planning
AERO-HPR: Human Perception and Recognition in Aerial Surveillance
2nd Workshop on Photorealistic 3D Head Avatars
PHAROS AI Factory for Medical Imaging & Healthcare
Computational Cameras and Displays
Workshop on Vision-based Assistants in the Real-World
Workshop on Agentic AI for Visual Media
The 3rd Workshop on AI for Content Generation, Quality Enhancement and Streaming
Multimodal Foundation Models for Biomedicine: Challenges and Opportunities
The 5th Explainable AI for Computer Vision (XAI4CV) Workshop
Computer vision for high-stakes, real-world applications necessitates robust explanation and transparency to ensure trust, accountability, and ethical deployment. Celebrating its 5th Anniversary, the Explainable AI for Computer Vision (XAI4CV) workshop provides a premier forum for the entire spectrum of XAI research, from interpretable-by-design models to challenges in multimodal foundational models. The program includes invited talks, spotlight papers, a poster session, and a tutorial. XAI4CV accepts paper and demo submissions to define the future of trustworthy visual AI.
Third Joint Egocentric Vision (EgoVis) Workshop
Visual General Intelligence
AI4RWC: The 2nd International Workshop on Vision Intelligence for Real-world Challenges
GRAIL-V: Grounded Retrieval & Agentic Intelligence for Vision-Language
The 3rd Workshop on Human Motion Generation - New Perspective on Simulation, Animation, and VR applications
The Second CVPR Workshop on Foundation and Large Vision Models in Remote Sensing (MORSE)
13th Workshop on Fine-grained Visual Categorization
The 5th DataCV Workshop and Challenge
Bridging Vision, Language, and Action: What’s Missing in Actionable Visual Perception for Robotics
Women in Computer Vision
Efficient Deep Learning for Computer Vision
The 1st Workshop on Deployment of Foundation Models for Embodied AI
1st Workshop on Video World Models: Interaction, Memory, and Efficiency
The 3rd Workshop on Foundation Models for Medical Vision
Foundation Models for Autonomous Driving
The 22th Embedded Vision Workshop
The 2nd Workshop on Multimodal Spatial Intelligence
22nd Workshop on Perception Beyond the Visible Spectrum
From Lab Demos to Daily Tasks: Embodied Intelligence in the Wild
The 5th Workshop on Federated Learning for Computer Vision
Proposal for 12th Workshop on Medical Computer Vision, CVPR 2026
The 3rd AI for Visual Arts Workshop and Challenges
Multimodal Alignment for a Pluralistic Society
On Sensor Vision Workshop
Urban Scene Modeling: Structured, Semantic, and Synthetic 3D Habitats
Generative AI for Sign Language
Generative AI for XR and Identity-based Applications
AI for Content Creation
From Perception to Simulation: The Emergence of World Models in Multi-modal Reasoning
World models are rapidly reshaping artificial intelligence, evolving from systems that passively perceive the world into engines capable of simulating, reasoning, and planning within it. This tutorial examines how recent advances in generative modeling, self-supervised learning, and multimodal architectures are enabling machines to move beyond recognition and prediction toward mental simulation, counterfactual reasoning, and decision making.
We will explore the foundations of world models, approaches for learning dynamics from visual and multimodal data, and the integration of planning and reasoning. The tutorial highlights connections between video generation, diffusion models, discrete representations, and embodied AI, while addressing key challenges such as grounding, causality, physical consistency, and evaluation.
Designed for researchers, practitioners, and students, this session provides both conceptual insights and practical perspectives on building AI systems that reason about environments rather than merely interpreting them.
Building GenAI based Simulation Environment for End-to-End Autonomous Driving
End-to-end autonomous driving systems require simulation environments capable of exposing models to diverse, realistic, and safety-critical long-tail events that rarely appear in real-world data. Traditional simulators—relying on scripted scenarios, simplified traffic logic, and static 3D assets—capture only a narrow slice of real traffic complexity and fail to exercise modern data-driven AV stacks in a meaningful, system-level manner. As end-to-end policies blur the boundaries between perception, prediction, and planning, new generative, data-first, closed-loop simulation workflows are needed to bridge the gap between real-world distributions and synthetic environments.This tutorial aims to demonstrate how generative AI and world models can build end-to-end simulation pipelines that directly support learning-based AV systems. We focus on practical, reproducible methods involving city-scale digital twins, data-driven traffic behavior models, generative corner-case synthesis, and sensor-level simulation tailored for perception and end-to-end policies. Participants will gain both conceptual understanding and hands-on entry points—code, tools, datasets, and minimal templates—to design or extend their own generative simulation systems.This tutorial walks through the complete pipeline for generative end-to-end AV simulation. We begin by defining what distinguishes end-to-end simulation from classical AV simulators and how policy-driven requirements reshape simulation design. We then introduce world modeling and city-scale digital twins, covering data-driven reconstruction of road layouts, traffic rules, and naturalistic human driving behavior. Next, we discuss generative modeling of rare and adversarial scenarios derived from crash reports, regulations, or textual descriptions. We follow with sensor and video simulation, comparing graphics engines, neural rendering, and video foundation models for producing realistic, multi-view, and temporally consistent sensor data. Finally, we integrate these components into a full pipeline and discuss system-level evaluation, failure analysis, and open challenges in validating generative simulation and aligning with safety standards.SpeakersTo be announced.ScheduleTitleSpeakerTimeIntroduction & MotivationTBDTBDModule 1: World Modeling & Digital TwinsTBDTBDModule 2: Generative Corner-Case & Scenario SynthesisTBDTBDBreak-TBDModule 3: Sensor Simulation, Video Generation & End-to-End PipelinesTBDTBDModule 4: Testing Open-Source AV StacksTBDTBDClosing DiscussionTBDTBDOrganizersHenry LiuUniversity of MichiganHowie SunSaferDrive AIJun GaoUniversity of Michigan / NVIDIAShuo FengTsinghua UniversityXintao YanUniversity of Hong KongJiawei WangUniversity of MichiganRelated Publications & ResourcesFeng, S., Sun, H., Yan, X., Zhu, H., Zou, Z., Shen, S., & Liu, H. X. (2023). Dense reinforcement learning for safety validation of autonomous vehicles.Nature, 615(7953), 620–627.Liu, H. X., & Feng, S. (2024). Curse of rarity for autonomous vehicles.Nature Communications, 15(1), 4808.Yan, X., Zou, Z., Feng, S., Zhu, H., Sun, H., & Liu, H. X. (2023). Learning naturalistic driving environment with statistical realism.Nature Communications, 14(1), 2037.Feng, S., Yan, X., Sun, H., Feng, Y., & Liu, H. X. (2021). Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment.Nature Communications, 12(1), 748.Sun, H., Yan, X., Qiao, Z., Zhu, H., Sun, Y., Wang, J., ... & Liu, H. X. (2025). TeraSim: Uncovering unknown unsafe events for autonomous vehicles through generative simulation.arXiv preprint arXiv:2503.03629.Wang, J., Sun, H., Yan, X., Feng, S., Gao, J., & Liu, H.X. (2025). TeraSim-World: Worldwide safety-critical data synthesis for end-to-end autonomous driving.arXiv preprint arXiv:2509.13164.Ren, X., Lu, Y., Cao, T., Gao, R., Huang, S., Sabour, A., ... & Ling, H. (2025). Cosmos-Drive-Dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042.TeraSim:https://github.com/mcity/TeraSimCosmos-Drive:https://github.com/nv-tlabs/Cosmos-Drive-Dreams
Towards Safe Multi-Modal Learning: Evolving Threats and Safety Solutions
Monte Carlo physical simulation
Abstract: Accurately analyzing large amounts of geometric data is critical for many scientific and engineering applications. Techniques based onpartial differential equations (PDEs)provide powerful tools for analyzing physical systems, but conventional solvers are not at a stage where they “just work” on problems of real-world complexity. A constant challenge is spatial discretization, which divides the domain into a high-quality volumetric mesh or background grid for PDE-based analysis. Unfortunately, this approach does not scale well to modern computer architectures, and as such, there remains a large divide between our ability tovisualizeandanalyzethe natural world.
3D Human Mesh Modeling and Recovery from RGB and LiDAR
The understanding of human pose and shape is the cornerstone of multiple AI applications ranging from monitoring, AR/VR, sport and posture analysis, human-robot interaction all the way to autonomous driving. Accurate human perception enables digital systems to interact appropriately with people in both indoor and outdoor environments.Recent advances have pushed the field forward: modern methods now begin to achieve strong in-the-wild Human Mesh Recovery (HMR) performance, making them more reliable and useful for a wide variety of downstream tasks. With this growing interest, the community has seen the emergence of datasets and shape-recovery models, as well as an expanding range of input modalities; including RGB, depth, LiDAR, etc. At the same time, multiple human body models are being developed, each offering different levels of detail, interpretability and expressivity.While these developments open up exciting new opportunities, they also introduce new challenges. Designing and deploying human mesh recovery systems remains difficult due to dependency on the chosen body model, peculiarities of single-person and multi-person settings, challenges of occlusions and interactions with the 3D scene, and the reliance on data-hungry training pipelines.This tutorial is therefore motivated by the need for a clear, structured, and accessible overview of the current HMR landscape. The increasing use of foundation models and large-scale pretrained systems makes it particularly timely to disseminate a clear picture of the underlying principles of human body modeling and HMR, so that these methods can be more easily adopted, extended, and applied to adjacent fields beyond core human pose estimation. Our goal is to lower the entry barrier for newcomers, provide a unifying perspective for practitioners, and foster collaboration between communities working on human modeling, 3D vision, graphics, and embodied AI. By providing access to these concepts, we aim to maximize the impact of recent advances and encourage their use in downstream applications.
DataMFM: Emerging Directions in Data for Multimodal Foundation Models
Cognitive Foundations for Multimodal Models
10th Affective & Behavior Analysis in-the-wild
Computer Vision for Biomechanics Workshop
Authenticity & Provenance in the age of Generative AI
Computer Vision for the Built World
Workshop Proposal: AI-assisted Long Video Creation
3rd Workshop on ScanNet++ Novel View Synthesis and 3D Semantic Understanding Challenge
The 2nd 3D-LLM/VLA Workshop: Bridging Language, Vision and Action in 3D Environments
The 5th Workshop on “What is Next in Multimodal Foundation Models?”
3rd Workshop on Efficient and On-Device Generation (EDGE), CVPR 2026
Synthetic & Adversarial ForEnsics
The 7th International Workshop and CVML Challenge on Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture
Spatial Intelligence for Cultural Heritage
Sixth Workshop on Neural Architecture Search
Computer Vision with Small Data: Beyond Scale -- Toward Data-Efficient Dynamically-Aware Video Intelligence
End-to-End 3D Learning
OpenSUN3D: 6th Workshop on Open-World 3D Scene Understanding with Foundation Models
The 1st Workshop on Monitoring the World through an Imperfect Lens
Rediscovering Intelligence: Can AI Still Learn from Humans?
The 1st Workshop on Vision for Intelligent Task Assistants
Auto-Annotation with Expert-Crafted Guidelines
Machine-learned visual systems are transforming numerous fields such as autonomous driving, biodiversity assessment, and ecological monitoring, but they hunger for vast, high-quality annotated data. Asking domain experts to manually annotate large-scale data is unrealistic; the current paradigm to scale up data annotation is to have domain experts craft annotation guidelines using visual examples and descriptions for non-expert annotators to apply. This paradigm is commonly adopted by companies which provide data labeling services. Lacking domain knowledge, ordinary annotators often produce annotations that are erroneous, subjective, biased, and inconsistent. Further, this process is labor-intensive, tedious, and costly. This workshop aims to pioneer auto-annotation, developing AI agents that can interpret expert-crafted annotation guidelines and generate labels automatically. In essence, we seek to replace ordinary human annotators with AI.
Machine Unlearning for Vision
2nd Workshop on Multimodal Sign Language Recognition
MSLR 2026 is the second edition of a rapidly growing venue on multimodal sign language recognition and translation. The program combines invited talks, a peer-reviewed track published in CVPR Workshops, and the SignEval Challenge featuring updated datasets for isolated LIS and continuous SLR. We emphasize privacy-preserving sensing (e.g., radar), healthcare accessibility, and inclusive practices with sign interpreters. Building on the success at ICCV 2025, MSLR 2026 will consolidate a global, interdisciplinary community spanning computer vision, linguistics, healthcare, and Deaf studies.