MATE: Masked Autoencoders are Online 3D Test-Time Learners

1Institute for Computer Graphics and Vision, TU Graz, Austria.
2Korea Advanced Institute of Science and Technology (KAIST), South Korea.
3Southeast University, China.
ICCV 2023

(*Equal Contribution)
Image Description

Overview of our Test-Time Training methodology. We adapt the encoder to a single out-of-distribution (OOD) test sample online by updating its weights using a self-supervised reconstruction task. We then use the updated weights to make a prediction on the test sample. To enable this approach, the encoder, decoder, and the classifier are co-trained in the classification and reconstruction tasks, which is not shown in the figure.

Abstract

Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. We show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, making it extremely lightweight. Our experiments show that MATE also achieves competitive performance by adapting sparsely on the test data, which further reduces its computational overhead, making it ideal for real-time applications.

Method

Image Description

The input point cloud is first tokenized and then randomly masked. For our setup, we mask 90% of the point cloud. For joint training the visible tokens from the training data are fed to the encoder to get the latent embeddings from the visible tokens. These embeddings are fed to the classification head for the classification loss and concatenated with the masked tokens and fed to the decoder for reconstruction to obtain the reconstruction loss. Both losses are optimized jointly. For adaptation to an out-of-distribution test sample at test-time, we only use the MAE reconstruction task. Finally, after adapting the encoder on this single sample, evaluation is performed by using the updated encoder weights.

Corruptions

Image Description

We adapt to 15 different types of corruptions which can be commonly occurring in the 3D point cloud data.

BibTeX

@article{mirza2023mate,
    author    = {Mirza, M. Jehanzeb and Shin, Inkyu and Lin, Wei and Schriebl, Andreas and Sun, Kunyang and
                 Choe, Jaesung and Kozinski, Mateusz and Possegger, Horst and Kweon, In So and Yoon, Kun-Jin and Bischof, Horst},
    title     = {MATE: Masked Autoencoders are Online 3D Test-Time Learners},
    journal   = {arXiv preprint arXiv:2211.11432},
    year      = {2023}
}