Download PDFOpen PDF in browserExtracting Semantic Entity Triplets by Leveraging LLMsEasyChair Preprint 149143 pages•Date: September 16, 2024AbstractAs Large Language Models (LLMs) become increasingly powerful and accessible, there is a rise in concerns regarding the automatic generation of academic papers. Several instances of undeniable usage of LLMs in reputable journals have been reported. Probably significantly more articles were partially or entirely written by LLMs but have not yet been detected, posing a threat to the veracity of academic journals. The current consensus among researchers is that detecting LLM-generated text is ineffective or easy to evade in a general setting. Therefore, we explore an alternative approach, targeting the stochastic nature of LLMs. As LLMs are stochastic text generators, hallucinations in long texts are a persistent problem, and the generated output regularly contains counterfactual components. Semantic entity triplets can be used to assess a text's factual accuracy and filter the publication corpus accordingly. Previous work has built a classical triplet-extraction pipeline based on spaCy. However, the limitation of this method is the retrieval of relatively few triplets that tend to be overly generic, to the point of being domain-agnostic. We overcome these limitations by applying few-shot prompting on the recently released Meta-Llama-3-8B-Instruct. The results show we can extract more triplets per paragraph than the classical extraction method. Moreover, we show that the triplets are more specific and find no evidence of hallucination when comparing the extracted subjects and objects to the original reference text. Keyphrases: Natural Language Processing, Noun Extraction, entity extraction, large language models, machine learning
|