Journal article
An overview of evaluation and enhancement methods for code generation by large language models
Information and Software Technology, Vol.197, pp.1-21
2026
Abstract
Context:
Recent advances in Large Language Models (LLMs) have led to the rapid deployment of automated generation tools capable of producing source code. As these models increasingly transition from being experimental tools to established elements of the software development, a critical question arises: to what extent do the models and the code they generate satisfy, or can be made to satisfy, the rigorous, multifaceted quality standards required for professional, real-world engineering?
Objective:
The primary aim of this study is to find the answer to this question by exploring existing evaluation frameworks and enhancement strategies for LLMs and the code they generate. By examining how generated code quality is currently assessed and improved, we hope to determine if the current research methodologies provide a balanced coverage of the software quality spectrum or if significant disparities exist.
Method:
We propose a code quality dimension taxonomy adapted from the ISO/IEC 25010 standard, encompassing four principal attributes: Functional Correctness (FC), Security (SE), Performance Efficiency (PE), and Maintainability (MA). Using this framework, we conduct a literature review analysing existing research in evaluation frameworks and enhancement strategies across these dimensions.
Results:
Our analysis reveals a substantial imbalance in research focus. FC, and increasingly, SE have well-established evaluation frameworks and improvement strategies. In contrast, PE and MA remain significantly underexamined, with few standardised benchmarks and a lack of targeted fine-tuning approaches for these critical software quality dimensions.
Conclusion:
The survey identifies a pressing need for broader research into PE and MA-oriented evaluation and enhancement. We propose several promising directions: (i) the creation of formal benchmarks; (ii) the development of reinforcement learning techniques leveraging static and dynamic code feedback; and (iii) the use of multi-agent frameworks for iterative, critique-based improvement grounded in verifiable diagnostic artefacts.
Details
- Title
- An overview of evaluation and enhancement methods for code generation by large language models
- Authors
- Jacob Truong - Monash UniversityVan Nguyen - Monash UniversityThanh Thi Nguyen (Corresponding Author) - University of the Sunshine Coast
- Publication details
- Information and Software Technology, Vol.197, pp.1-21
- Publisher
- Elsevier BV
- Date published
- 2026
- DOI
- 10.1016/j.infsof.2026.108185
- ISSN
- 1873-6025
- Copyright note
- © 2026 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ).
- Data Availability
- Data will be made available on request.
- Organisation Unit
- School of Science, Technology and Engineering
- Language
- English
- Record Identifier
- 991239299102621
- Output Type
- Journal article
Metrics
1 Record Views