Jehanzeb Mirza

Jehanzeb Mirza

Jehanzeb Mirza

Agentic AI | Multimodal Reasoning | Test-Time Learning

Xero, USA | Formerly MIT CSAIL

Hi, I am Jehanzeb Mirza. I am a Staff Research Scientist at Xero, where I build agentic AI systems for financial applications (tool use, structured reasoning, evaluation, and optimization). Previously, I was a Postdoctoral Researcher at MIT CSAIL in the Spoken Language Systems Group led by Dr. James Glass. I received my Ph.D. in Computer Science (Computer Vision) from TU Graz, Austria, where I was advised by Professor Horst Bischof, and Professor Serge Belongie served as an external referee.

My research spans multimodal foundation models (vision, language, audio) and test-time learning, with an emphasis on robust reasoning and decision-making. I am particularly interested in building reliable AI agents that can interface with tools and operate over complex structured data.

Selected work has been featured by MIT News and CSAIL research spotlights. I’m always happy to connect with student collaborators and researchers working on multimodal learning, LLM/VLM reasoning, and agentic systems—feel free to email me for feedback or collaboration.

Contact

  • jmirza [at] mit.edu
  • Office: 32-G442.
  • MIT, Cambridge, USA.

Education

  • Ph.D. in Computer Vision (2021 - 2024)
    TU Graz, Austria.
  • MS in ETIT (2017 - 2020)
    KIT, Germany.
  • BS in EE (2013 - 2017)
    NUST, Pakistan

Recent News

02/26: 2 papers accepted at CVPR, 2026.
01/26: I left MIT and joined Xero as a Staff Research Scientist.
01/26: 1 paper accepted at ICLR, 2026.
10/25: Our recent ICCV work was covered by MIT News: Story.
10/25: I have been recognized as NeurIPS 2025 Exceptional Reviewer.
09/25: My recent research was covered by MIT-CSAIL: Blogpost | Video.
09/25: 1 paper accepted at NeurIPS, 2025.
08/25: 1 paper accepted at TMLR, 2025.
07/25: 1 paper accepted at COLM, 2025.
06/25: 2 paper accepted at ICCV, 2025.
04/25: Our workshops "Long Multi-Scene Video Foundations" and "MMFM" got accepted at ICCV 2025.
03/25: Talk at EI Seminar, MIT-CSAIL.
02/25: 2 paper accepted at CVPR, 2025 (workshops).
01/25: 3 papers accepted at ICLR, 2025.
12/24: Our workshop "What's Next in Multi-Modal Foundation Models" got accepted at CVPR 2025.
11/24: I joined MIT CSAIL as a Postdoctoral Researcher.
11/24: 1 paper accepted at 3DV, 2025.
09/24: 1 paper accepted at NeurIPS, 2024.
07/24: 1 paper accepted at BMVC, 2024.
07/24: 2 papers accepted at ECCV, 2024.
04/24: I successfully defended my Ph.D. thesis.
12/23: Our workshop "What's Next in Multi-Modal Foundation Models" got accepted at CVPR 2024.
10/23: Invited talk at Cohere.
10/23: Invited talk at VIS Lab, University of Amsterdam.
9/23: 1 paper accepted at NeurIPS, 2023.
9/23: Invited talk at Center for Robotics, Paris Tech.
7/23: 1 paper accepted at ICCV, 2023.
4/23: I will be attending ICVSS 2023.
3/23: 2 papers accepted at CVPR, 2023.
2/23: Reviewing for CVPR, ICCV, and TPAMI.
3/22: 2 papers accepted at CVPR, 2022.

Experience

Staff Research Scientist - Xero (USA): Agentic AI systems for finance: tool use, structured reasoning, evaluation, and optimization. (01.26 - Present).
Postdoctoral Researcher - MIT CSAIL (Boston, USA): Multimodal learning with speech/audio, vision, and language. (11.24 - 12.25).
Research Assistant - TU Graz (Graz, Austria): Self-supervised learning, test-time adaptation, and vision-language understanding. (01.21 - 10.24).
Research Scientist Internship - Sony AI (Tokyo, Japan): Multimodal learning with vision, language, and audio. (05.24 - 08.24).
Internship - Intel (Karlsruhe, Germany): Robustness of 2D/3D perception systems in adverse conditions for autonomous driving. (03.19 - 08.20).

Selected Publications

Filter:

No publications match this combination.

Publication thumbnail
TTRV: Test-Time Reinforcement Learning for Vision Language Models
CVPR 2026
Publication thumbnail
Teaching VLMs to Localize Specific Objects from In-context Examples
ICCV 2025
Publication thumbnail
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
TMLR 2025
Publication thumbnail
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?
ICLR 2025
[ Paper]
Publication thumbnail
Mining your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models
ICLR 2025
[ Paper]
Publication thumbnail
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
NeurIPS 2024
Publication thumbnail
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
ECCV 2024
Publication thumbnail
Towards Multimodal In-Context Learning for Vision & Language Models
ECCVW 2024
[ Paper]
Publication thumbnail
LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
NeurIPS 2023
Publication thumbnail
MATE: Masked Autoencoders are Online 3D Test-Time Learners
*M. Jehanzeb Mirza, *Inkyu Shin, *Wei Lin, Andreas Schriebl, Kunyang Sun, Jaesung Choe, Mateusz Kozinski, Horst Possegger, In So Kweon, Kun-Jin Yoon, Horst Bischof (*Equal Contribution)
ICCV 2023
Publication thumbnail
ActMAD: Activation Matching to Align Distributions for Test-Time-Training
CVPR 2023
Publication thumbnail
Video Test-Time Adaptation for Action Recognition
*Wei Lin, *M. Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, Horst Bischof (*Equal Contribution)
CVPR 2023
[ Paper | Code]
Publication thumbnail
The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization
CVPR 2022
[ Paper | Code]