An overview of evaluation and enhancement methods for code generation by large language models

Jacob Truong; Van Nguyen; Thanh Thi Nguyen

doi:10.1016/j.infsof.2026.108185

Back

An overview of evaluation and enhancement methods for code generation by large language models

Journal article

Open access

Peer reviewed

An overview of evaluation and enhancement methods for code generation by large language models

Jacob Truong, Van Nguyen and Thanh Thi Nguyen

Information and Software Technology, Vol.197, pp.1-21

2026

DOI: https://doi.org/10.1016/j.infsof.2026.108185

Files and links (1)

pdf

1-s2.0-S0950584926001746-main2.11 MBDownload View

Published Version Open Access CC BY V4.0

Abstract

Code generation

Efficiency

Functional correctness

Large language models

Maintainability

Security

Context: Recent advances in Large Language Models (LLMs) have led to the rapid deployment of automated generation tools capable of producing source code. As these models increasingly transition from being experimental tools to established elements of the software development, a critical question arises: to what extent do the models and the code they generate satisfy, or can be made to satisfy, the rigorous, multifaceted quality standards required for professional, real-world engineering? Objective: The primary aim of this study is to find the answer to this question by exploring existing evaluation frameworks and enhancement strategies for LLMs and the code they generate. By examining how generated code quality is currently assessed and improved, we hope to determine if the current research methodologies provide a balanced coverage of the software quality spectrum or if significant disparities exist. Method: We propose a code quality dimension taxonomy adapted from the ISO/IEC 25010 standard, encompassing four principal attributes: Functional Correctness (FC), Security (SE), Performance Efficiency (PE), and Maintainability (MA). Using this framework, we conduct a literature review analysing existing research in evaluation frameworks and enhancement strategies across these dimensions. Results: Our analysis reveals a substantial imbalance in research focus. FC, and increasingly, SE have well-established evaluation frameworks and improvement strategies. In contrast, PE and MA remain significantly underexamined, with few standardised benchmarks and a lack of targeted fine-tuning approaches for these critical software quality dimensions. Conclusion: The survey identifies a pressing need for broader research into PE and MA-oriented evaluation and enhancement. We propose several promising directions: (i) the creation of formal benchmarks; (ii) the development of reinforcement learning techniques leveraging static and dynamic code feedback; and (iii) the use of multi-agent frameworks for iterative, critique-based improvement grounded in verifiable diagnostic artefacts.

Details

Title: An overview of evaluation and enhancement methods for code generation by large language models
Authors: Jacob Truong - Monash University
Van Nguyen - Monash University
Thanh Thi Nguyen (Corresponding Author) - University of the Sunshine Coast
Publication details: Information and Software Technology, Vol.197, pp.1-21
Publisher: Elsevier BV
Date published: 2026
DOI: 10.1016/j.infsof.2026.108185
ISSN: 1873-6025
Data Availability: Data will be made available on request.
Organisation Unit: School of Science, Technology and Engineering
Language: English
Record Identifier: 991239299102621
Output Type: Journal article

Metrics

1 Record Views