Download PDFOpen PDF in browser

Ontology-Driven Scientific Literature Classification using Clustering and Self-Supervised Learning

EasyChair Preprint no. 7288

19 pagesDate: January 5, 2022

Abstract

The rapid growth of scientific literature in the fields of computer engineering (CE) and computer science (CS) presents difficulties to researchers who are interested in exploring publication records based on standard scientific categories. This urges the need for automatic classification of text documents into scientific categories using content and contextual information. Document classification is a significant application of supervised learning which requires a labeled data set for training the classifier. However, research publication records available on Google Scholar and dblp services are not labeled. First, manual annotation of a large body of scientific research work based on standard scientific terminology requires domain expertise and is extremely time-consuming. Second, hierarchical labeling of records facilitates a more effective and context-aware retrieval of documents. In this paper, we propose an ontology-driven classification technique based on zero-shot learning in conjunction with agglomerative clustering to automatically label a scientific literature data set related to CE and CS.

We study and compare the effectiveness of multiple text classifiers such as logistic regression, support vector machines (SVM), gradient boosting with Word2vec and bag of words (BOW) embedding, recurrent neural networks (RNN) with GloVe embedding, and feed-forward neural networks with BOW embedding. Our study shows that RNN with GloVe embedding outperforms other models with an above 0.85 F1 score on all granularity levels.

Our proposed technique will help junior and experienced researchers in identifying new emerging technologies and domains for their research purposes.

Keyphrases: document classification, Granularity Level, Hierarchical Document Classification, machine learning, Machine Learning Application, Natural Language Processing, scientific literature, self-supervised learning, text classification, unsupervised learning

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@Booklet{EasyChair:7288,
  author = {Zhengtong Pan and Patrick Soong and Setareh Rafatirad},
  title = {Ontology-Driven Scientific Literature Classification using Clustering and Self-Supervised Learning},
  howpublished = {EasyChair Preprint no. 7288},

  year = {EasyChair, 2022}}
Download PDFOpen PDF in browser