GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

1ICG, TU Graz, Austria. 2Sony Group Corporation, Japan. 3IBM Research, Israel. 4JKU Linz, Austria. 5Offenburg University, Germany. 6University of Amsterdam, Netherlands. 7UNSW Sydney, Australia. 8SonyAI, USA. 9MIT-IBM Watson AI Lab, USA. 10MIT CSAIL, USA.
ArXiv Preprint
Image Description

The effect of prompt evolution on the downstream task performance. The shaded regions represent the absolute top-1 classification accuracies for ImageNet at each optimization step by ensembling the top-3 prompts found w.r.t the accuracy on the 1-shot train set whereas the solid lines represent the exponential moving average. The left plot is with CLIP VIT-B/32, and the right is with LlaVA-OV while the LLM employed is Llama-3. Due to high computational cost, we only perform 25 optimization steps for LlaVA-OV.

Abstract

In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to their fitness for the downstream vision task. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of prompts preferred by the downstream VLM. Furthermore, we also explicitly steer the LLM generation in each optimization step by specifically adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step. This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models -- showing that the discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models.

Method

Image Description

Overview of GLOV. GLOV consists of a Meta-Prompt, which constitutes system instruction, task description, and in-context examples (VLM prompts) which are evaluated (and ranked) on a few-shot training data in each iteration. The Meta-Prompt instructs the LLM to generate several candidate solutions in each optimization iteration, conditioned on the in-context examples which are fed in conjunction with the accuracy values, highlighting their effectiveness. Furthermore, to steer the LLM generation towards the language preferred by the VLM, we add the scaled difference of the sentence embeddings (autoregressively) from the positive and negative text prompts to the intermediate layer of the LLM. This process is repeated until the stopping condition is met (e.g., maximum iterations). Note, that H+ and H- refer to the sentence embeddings from the text prompts.

BibTeX

@article{mirza2024glov,
        author    = {Mirza, M. Jehanzeb and Zhao, Mengjie and Mao, Zhuoyuan and Doveh, Sivan and Lin, Wei and Gavrikov, Paul and Dorkenwald, Michael and Yang, Shiqi and Jha, Saurav and Wakaki, Hiromi and Mitsufuji, Yuki and Possegger, Horst and Feris, Rogerio and Karlinsky, Leonid and Glass, James},
        journal   = {ArXiv},
        title     = {GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models},
        year      = {2024}
        }