Fusion of Vision and Text Features for Breast Cancer Classification using a Few-Shot Approach

Title:Fusion of Vision and Text Features for Breast Cancer Classification using a Few-Shot Approach

Authors:Saptarshi Pani, Gouranga Maity, Irina Shpakovskaya, Dmitrii Kaplun and Ram Sarkar

Conference:IEEE CBMS 2025

Tags:Breast Cancer, Few-shot learning, Histopathology images, Vision Language Model and Vision Transformer

Abstract:

Breast cancer diagnosis using histopathological images is a challenging task due to the scarcity of annotated medical data, particularly for rare cancer stages. Traditional deep learning models struggle to generalize effectively in such low-data scenarios. To address this problem, we propose a few-shot classification framework for breast histopathological images based on metric-based learning. Our approach leverages Vision Transformers (ViTs) for feature extraction, capturing global contextual information better than conventional Convolutional Neural Networks (CNNs). Additionally, we integrate BLIP-2, a Vision Language Model (VLM), to incorporate manual text prompts and contextual textual descriptions, enhancing the model's interpretability and adaptability. The extracted visual and textual features are fused using a novel feature fusion module, and classified the samples based on cosine distance. We evaluated our approach on BreakHis and BACH datasets, showing its effectiveness in few-shot learning (FSL). Our model achieves 57.12% and around 89% in 5-shot setting, respectively, on the BACH and BreakHis datasets. As the number of support samples increases, performance improves. These findings suggest that combining transformer-based architectures with VLMs enhances the performance of FSL based medical image classification systems.