| ||||
| ||||
![]() Title:How Smart Is Smart Enough? Benchmarking LLMs with Embedding-Based Similarity in Python Code Generation Conference:ACIIDS2026 Tags:AI, LLM, Python and Software Development Abstract: The growing capabilities of generative AI, particularly Large Language Models (LLMs), are reshaping software development by en- abling automated code generation. This study presents a comparative evaluation of state-of-the-art models—OpenAI GPT-4.5 Preview, GPT- 4o, GPT-4o Mini, GPT-4 Turbo, GPT-3.5 Turbo, GPT-o1, GPT-o3 Mini; Google’s Gemini 1.5 Pro, 1.5 Flash, 2.0 Flash, 2.0 Flash Lite; An- thropic’s Claude 3 Opus, 3 Sonnet, 3 Haiku, 3.5 Sonnet, 3.5 Haiku, 3.7 Sonnet; and Meta’s LLaMA 3.0 and 3.1 8B Instruct—across ten Python programming tasks of varying complexity. Model outputs were assessed using an embedding-based semantic similarity metric against expert- crafted reference solutions. Results show that top performers like GPT- 4.5 Preview and GPT-4o Mini achieve consistently high similarity scores, while LLaMA 3.1 8B ranks lowest. Interestingly, complex tasks yielded higher similarity, possibly due to more structured outputs. The findings highlight the strengths and limitations of current LLMs and advocate for complementary evaluation criteria such as execution correctness and efficiency in practical use. How Smart Is Smart Enough? Benchmarking LLMs with Embedding-Based Similarity in Python Code Generation ![]() How Smart Is Smart Enough? Benchmarking LLMs with Embedding-Based Similarity in Python Code Generation | ||||
| Copyright © 2002 – 2026 EasyChair |
