Download PDFOpen PDF in browserEvaluating LLMs for Arabic Code Summarization: Challenges and Insights from GPT-4EasyChair Preprint 155696 pages•Date: December 13, 2024AbstractGPT-4 —the backbone of ChatGPT—has demonstrated remarkable performance in both natural language and source code tasks. Recently, Large Language Models (LLMs) like GPT-4 have significantly advanced software engineering tasks such as code summarization. These advancements boost developer productivity and help address often neglected tasks like code documentation. While code summarization and commenting are essential for maintaining code quality and facilitating communication among developers, writing comments manually is time-consuming. Although several studies have proposed and evaluated deep learning-based approaches and LLMs to automate comment generation, these efforts primarily focus on the English language, leaving a gap for other languages, particularly Arabic. In this study, we evaluate the ability of GPT-4 to generate accurate Arabic comments. We support our evaluation with both manual and automatic analysis to measure the correctness and nature of the generated comments. Our findings reveal that while GPT-4 generally produces correct Arabic summaries, they often do not align with the developer's intent as reflected in the BERT-Similarity, ROUGE, and BLEU scores. We also show that GPT-4's comments are more verbose due to the morphological richness of the Arabic language and a systematic approach that tends to describe each code component in detail. Finally, the readability of these comments is moderate, with scores ranging from 30.29 to 100. Keyphrases: Arabic language, Code Summarization, GPT-4, LLMs
|