Hello! I am Massimiliano (Massi) an Assistant Professor (RTD-a) at the Multimedia and Human Understanding Group at the Department of Information Engineering and Computer Science of University of Trento. The goal of my work is developing algorithms for increasing the generalization capabilities of deep architectures to new visual domains and semantic concepts, focusing on problems such as domain generalization, incremental learning, and compositionality in computer vision. I am a member of ELLIS.
Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However existing works focus on detecting closed sets of biases defined a priori limiting the studies to well-known concepts. In this paper we tackle the challenge of open-set bias detection in text-to-image generative models presenting OpenBias a new pipeline that identifies and quantifies the severity of biases agnostically without access to any precompiled set. OpenBias has three stages. In the first phase we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly the target generative model produces images using the same set of captions. Lastly a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. We study the behavior of Stable Diffusion 1.5 2 and XL emphasizing new biases never investigated before. Via quantitative experiments we demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement.
Harnessing Large Language Models for Training-free Video Anomaly Detection
Luca Zanella , Willi Menapace , Massimiliano Mancini, and 2 more authors
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision one-class supervision or in an unsupervised setting. Training-based methods are prone to be domain-specific thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD) a method tackling VAD in a novel training-free paradigm exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence) showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.
MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning
Matteo Farina , Massimiliano Mancini, Elia Cunegatti , and 3 more authors
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
While excellent in transfer learning Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue removing parameters via model pruning is a viable solution. However existing techniques for VLMs are task-specific and thus require pruning the network from scratch for each new task of interest. In this work we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus we propose Multimodal Flow Pruning (MULTIFLOW) a first gradient-free pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP experimenting with two VLMs three vision-language tasks and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated combinatorial competitors in the vast majority of the cases paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.
Vision-by-Language for Training-Free Compositional Image Retrieval
Shyamgopal Karthik , Karsten Roth , Massimiliano Mancini, and 1 more author
In International Conference on Learning Representations (ICLR) , 2024
Given an image and a target modification (e.g an image of the Eiffel tower and the text “without people and at night-time”), Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database. While supervised approaches rely on annotating triplets that is costly (i.e. query image, textual modification, and target image), recent research sidesteps this need by using large-scale vision-language models (VLMs), performing Zero-Shot CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require training task-specific, customized models over large amounts of image-text pairs. In this work, we proposeto tackle CIR in a training-free manner via our Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple, yet human-understandable and scalable pipeline that effectively recombines large-scale VLMs with large language models (LLMs). By captioning the reference image using a pre-trained generative VLM and asking a LLM to recompose the caption based on the textual target modification for subsequent retrieval via e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we find competitive, in-part state-of-the-art performance - improving over supervised methods Moreover, the modularity of CIReVL offers simple scalability without re-training, allowing us to both investigate scaling laws and bottlenecks for ZS-CIR while easily scaling up to in parts more than double of previously reported results. Finally, we show that CIReVL makes CIR human-understandable by composing image and text in a modular fashion in the language domain, thereby making it intervenable, allowing to post-hoc re-align failure cases. Code will be released upon acceptance.
Vocabulary-free Image Classification
Alessandro Conti , Enrico Fini , Massimiliano Mancini, and 3 more authors
Advances in Neural Information Processing Systems (NeurIPS), 2023
Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories.
Open World Compositional Zero-Shot Learning
Massimiliano Mancini, Muhammad Ferjad Naeem , Yongqin Xian , and 1 more author
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021
Compositional Zero-Shot learning (CZSL) requires to recognize state-object compositions unseen during training. In this work, instead of assuming prior knowledge about the unseen compositions, we operate in the open world setting, where the search space includes a large number of unseen compositions some of which might be unfeasible. In this setting, we start from the cosine similarity between visual features and compositional embeddings. After estimating the feasibility score of each composition, we use these scores to either directly mask the output space or as a margin for the cosine similarity between visual features and compositional embeddings during training. Our experiments on two standard CZSL benchmarks show that all the methods suffer severe performance degradation when applied in the open world setting. While our simple CZSL model achieves state-of-the-art performances in the closed world scenario, our feasibility scores boost the performance of our approach in the open world setting, clearly outperforming the previous state of the art.
Modeling the Background for Incremental Learning in Semantic Segmentation
Fabio Cermelli , Massimiliano Mancini, Samuel Rota Bulò , and 2 more authors
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2020
Despite their effectiveness in a wide range of tasks, deep architectures suffer from some important limitations. In particular, they are vulnerable to catastrophic forgetting, ie they perform poorly when they are required to update their model as new classes are available but the original training set is not retained. This paper addresses this problem in the context of semantic segmentation. Current strategies fail on this task because they do not consider a peculiar aspect of semantic segmentation: since each training step provides annotation only for a subset of all possible classes, pixels of the background class (ie pixels that do not belong to any other classes) exhibit a semantic distribution shift. In this work we revisit classical incremental learning methods, proposing a new distillation-based framework which explicitly accounts for this shift. Furthermore, we introduce a novel strategy to initialize classifier’s parameters, thus preventing biased predictions toward the background class. We demonstrate the effectiveness of our approach with an extensive evaluation on the Pascal-VOC 2012 and ADE20K datasets, significantly outperforming state of the art incremental learning methods.
Towards Recognizing Unseen Categories in Unseen Domains
Massimiliano Mancini, Zeynep Akata , Elisa Ricci , and 1 more author
In Proceedings of the European Conference on Computer Vision (ECCV) 2020 , 2020
Current deep visual recognition systems suffer from severe performance degradation when they encounter new images from classes and scenarios unseen during training. Hence, the core challenge of Zero-Shot Learning (ZSL) is to cope with the semantic-shift whereas the main challenge of Domain Adaptation and Domain Generalization (DG) is the domain-shift. While historically ZSL and DG tasks are tackled in isolation, this work develops with the ambitious goal of solving them jointly, i.e. by recognizing unseen visual concepts in unseen domains. We present CuMix (Curriculum Mixup for recognizing unseen categories in unseen domains), a holistic algorithm to tackle ZSL, DG and ZSL+DG. The key idea of CuMix is to simulate the test-time domain and semantic shift using images and features from unseen domains and categories generated by mixing up the multiple source domains and categories available during training. Moreover, a curriculum-based mixing policy is devised to generate increasingly complex training samples. Results on standard ZSL and DG datasets and on ZSL+DG using the DomainNet benchmark demonstrate the effectiveness of our approach.
Adagraph: Unifying predictive and continuous domain adaptation through graphs
Massimiliano Mancini, Samuel Rota Bulò , Barbara Caputo , and 1 more author
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019