TAPE: Task-Adaptive Prototype Evolution in Audio-Language Models for Fully Few-shot Class-incremental Audio Classification
Yunlong Gao ⋅ Wenxin Liang ⋅ Guanglu Wang ⋅ Senqi Guan ⋅ Linlin Zong ⋅ Dongyu Zhang ⋅ Xinyue Liu
Abstract
Fully Few-shot Class-incremental Audio Classification (FFCAC) is challenging since the training samples are limited both in the incremental sessions and in the base session. Existing few-shot learning methods suffer from catastrophic forgetting and overfitting when applied to FFCAC.Pre-trained Audio-Language Models (ALMs) have achieved success in many audio learning tasks. However, we find that it is impractical to directly use ALM on FFCAC, since misalignment between text and audio causes even severe catastrophic forgetting and overfitting. We propose a Task-Adaptive Prototype Evolution (TAPE) framework to facilitate ALMs to tackle the challenges of FFCAC, which consists of two key components:(1) A Task-Adapter that isolates audio features in a metric space to mitigate catastrophic forgetting while preserving knowledge across sessions, and (2) A Prototype Evolution mechanism that dynamically refines class prototypes using query samples during inference, thereby enabling adaptive learning and reducing overfitting.To the best of our knowledge, we are the first to use ALMs on the FFCAC task. We conduct experiments on three audio datasets: NSynth-100 (instrument recognition), FSC-89 (event detection), and LBS-100 (voice recognition). The experimental results show that our proposed approach TAPE significantly surpasses the baselines. Specifically, it averagely improves upon the second best from 54.93\% to 82.76\% in terms of Average Accuracy (AA $\uparrow$), and from 28.74\% to 12.56\% in terms of Performance Dropping rate (PD $\downarrow$).
Successful Page Load