LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

1Institute for Computer Graphics and Vision, TU Graz, Austria.
2MIT-IBM Watson AI Lab, USA.
NeurIPS 2023
Image Description

LaFTer proposes to first train a classifier on a natural language text dataset mined in a controlled manner from a set of target classes by generating descriptions for each class label using an LLM and mixing them with handcrafted templates. The training objective is to classify each description to the correct (source) class name (top-left). In the second stage, LaFTer employs the text- only classifier to generate pseudo-labels on the unlabeled data to further finetune the vision encoder in a parameter-efficient manner (bottom-left). Finally, the finetuned visual encoder and text classifier is used for eventual classification (bottom-middle). Combining our text-only pre-trained classifier together with the proposed pseudo-labeling pipeline lets us consistently improve the previous SOTA results for label-free finetuning, UPL (right).

Abstract

Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine-tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision.

Method

Image Description

(Top) Given a set of class labels, we generate a data set of short texts by prompting a Large Language Model (LLM) multiple times with each class name. We compute embeddings of these texts using CLIP text encoder. This lets us train a neural network, the Text Classifier, to infer the class used to prompt the LLM from the embedding of the text it generated. Even though the Text Classifier has been trained exclusively on text, it performs well in classifying image embeddings generated by CLIP vision encoder. (Bottom) We further take advantage of the Text Classifier by leveraging it in a pseudo-labeling setup to finetune the VL model.

Main Results

Image Description

Top-1 Classification Accuracy (%) while using the CLIP pre-trained ViT-B/32 backbone for 12 image classification benchmarks. LaFTer represents results obtained by first pre-training the visual classifier on text-only data and then performing unsupervised finetuning on the unlabeled image data. Highest accuracy is shown in bold, while second best is underlined.

BibTeX

@InProceedings{mirza2023lafter,
        author    = {Mirza, M. Jehanzeb and Karlinsky, Leonid and Lin, Wei and Kozinski, Mateusz and 
                     Possegger, Horst and Feris, Rogerio and Bischof, Horst},
        title     = {LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections},
        booktitle = {Conference on Neural Information Processing Systems (NeurIPS)},
        year      = {2023}
    }