Tutorial

Full-Stack, GPU-based Acceleration of Deep Learning

Maying Shen · Danny Yin · Jason Clemons · Pavlo Molchanov · Jan Kautz · Jose M. Alvarez

2024 Tutorial

Project Page

Abstract

This tutorial focuses on describing techniques to allow deep learning practitioners to accelerate the training and inference of large deep networks while also reducing memory requirements across a spectrum of off-the-shelf hardware for important applications such as autonomous driving and large language models. Topics include, but are not limited to: - Deep learning specialized hardware overview. We review the architecture of the most used deep learning acceleration hardware, including the main computational processors and memory modules. We will also cover aspects of algorithmic intensity and an overview of theoretical aspects of computing.

Best practices for acceleration. We provide an overview of best practices for designing efficient neural networks including channel number selection, compute heavy operations, or reduction operations among others.
Existing tools for model acceleration. In this part we will focus on existing tools to accelerate a trained neural network on GPU devices. We will particularly discuss operation folding, TensorRT, ONNX graph optimization, sparsity.
Foundation models. Here we will focus on best practices for training and deploying foundation models efficiently.
Research overview of recent techniques. In the last part, we will focus on recent advanced techniques for post training model optimization including pruning, quantization, model distillation or NAS among others.

Chat is not available.