Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

1ICG, TU Graz, Austria. 2CDL-EML. 3MIT-IBM Watson AI Lab, USA. 4JKU, Linz, Austria. 5IBM Research, Israel.
6Weizmann Institute of Science, Israel. 7University of Bonn, Germany.
ArXiv Preprint
Image Description

Our MPVR utilizes a Meta Prompt, comprising a system prompt (instruction), in-context example demonstrations (fixed throughout), and metadata (name and description) for a downstream task of interest. The Meta Prompt instructs an LLM to generate diverse task-specific LLM queries, which are used to obtain category-specific VLM prompts (visual text descriptions) by again querying the LLM after specifying the class name. These category-specific VLM prompts are then ensembled into a zero-shot classifier for recognizing the downstream task categories.

Abstract

Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs.

Method

Image Description

MPVR framework. In the first stage, a meta-prompt comprising of a system prompt, in-context examples, and metadata consisting of downstream task specification is queried to the LLM instructing it to generate multiple diverse task-specific LLM queries, which are populated with the category of interest and again queried to the LLM to obtain the category-level prompts for assembling a zero-shot classifier.

Main Results

Image Description

Top-1 accuracy (%) for 20 datasets obtained by employing the ViT-B/32 backbone from OpenAI CLIP. S-TEMP refer to the results obtained by using the default template (a photo of a class name), while DS-TEMP refer to the results obtained by using the ensemble of dataset specific prompts. An empty placeholder for CUPL indicates that the respective baseline did not provide the handcrafted prompts for the dataset. For Waffle, mean results from 7 random runs are reported, following the original publication.

BibTeX

@article{mirza2023mpvr,
    author    = {Mirza, M. Jehanzeb and Karlinsky, Leonid and Lin, Wei and Doveh, Sivan and
                 and Micorek, Jakub and Kozinski, Mateusz and Kuhene, Hilde and Possegger, Horst},
    journal   = {ArXiv},
    title     = {{Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs}},
    year      = {2024}
    }