Download PDFOpen PDF in browser

Re-thinking Text Clustering for Images with Text

EasyChair Preprint no. 10256

16 pagesDate: May 24, 2023


Text-VQA refers to the set of problems that reason about the text present in an image to answer specific questions regarding the image content. Previous works in text-VQA have largely followed the common strategy of feeding various input modalities (OCR, Objects, Question) to an attention-based learning framework. Such approaches treat the OCR tokens as independent entities and ignore the fact that these tokens often come correlated in an image representing a larger ‘meaningful’ entity. The ‘meaningful’ entity potentially represented by a group of OCR tokens could be primarily discerned by the layout of the text in the image along with the broader context it appears. In the proposed work, we aim to cluster the OCR tokens using a novel spatially-aware and knowledge-enabled clustering technique that uses an external knowledge graph to improve the answer prediction accuracy of the text-VQA problem. Our proposed algorithm is generic enough to be applied to any multimodal transformer architecture used for text-VQA training. We showcase the objective and subjective effectiveness of the proposed approach by improving the performance of the M4C model on multiple text-VQA datasets.

Keyphrases: OCR VQA, Scene Text Clustering, Text VQA

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
  author = {Shwet Kamal Mishra and Soham Joshi and Viswanath Gopalakrishnan},
  title = {Re-thinking Text Clustering for Images with Text},
  howpublished = {EasyChair Preprint no. 10256},

  year = {EasyChair, 2023}}
Download PDFOpen PDF in browser