How Smart Is Smart Enough? Benchmarking LLMs with Embedding-Based Similarity in Python Code Generation

Title:How Smart Is Smart Enough? Benchmarking LLMs with Embedding-Based Similarity in Python Code Generation

Authors:Dominik Palla, Blanka Klímová, Marcel Pikhart and Eva Švejdarová

Conference:ACIIDS2026

Tags:AI, LLM, Python and Software Development

Abstract:

The growing capabilities of generative AI, particularly Large Language Models (LLMs), are reshaping software development by en- abling automated code generation. This study presents a comparative evaluation of state-of-the-art models—OpenAI GPT-4.5 Preview, GPT- 4o, GPT-4o Mini, GPT-4 Turbo, GPT-3.5 Turbo, GPT-o1, GPT-o3 Mini; Google’s Gemini 1.5 Pro, 1.5 Flash, 2.0 Flash, 2.0 Flash Lite; An- thropic’s Claude 3 Opus, 3 Sonnet, 3 Haiku, 3.5 Sonnet, 3.5 Haiku, 3.7 Sonnet; and Meta’s LLaMA 3.0 and 3.1 8B Instruct—across ten Python programming tasks of varying complexity. Model outputs were assessed using an embedding-based semantic similarity metric against expert- crafted reference solutions. Results show that top performers like GPT- 4.5 Preview and GPT-4o Mini achieve consistently high similarity scores, while LLaMA 3.1 8B ranks lowest. Interestingly, complex tasks yielded higher similarity, possibly due to more structured outputs. The findings highlight the strengths and limitations of current LLMs and advocate for complementary evaluation criteria such as execution correctness and efficiency in practical use.