Utility and Trustworthiness of Generated Code
Learning Objectives
- You know of some of the issues related to the utility and trustworthiness of code generated by large language models.
Large language models do not produce perfect code. The generated code may contain errors, lack context awareness, and raise security and ethical concerns. The expertise of the developer using the large language model also plays a significant role in the process of using large language models.
As a developer using large language models, you are responsible for the code you produce, even if you use large language models to help with the task. Do not push code to production, unless you fully stand behind the code.
As an example of not producing perfect code, although large language models are already quite good at solving programming problems, their performance in solving real-world issues extracted from GitHub has still plenty of room for improvement. At the time writing these materials, the best model on the SWE-bench could solve a over 40% of the issues included in the lite version of the benchmark.
Similarly, large language models have problems with contextual awareness. If the code that the large language model is tasked to create would already exist elsewhere in the codebase, outside the context of the large language model, the large language model is likely to generate new code even if the code would already exist.
GitClear highlighted this phenomenon in their report “Coding on Copilot: 2023 Data Suggests Downward Pressure on Code Quality”. Their report highlights that the proportion of copy-pasted code and added code has recently increased when contrasted to other types of code changes (deletion, moving, and updating).
In the same vein, large language models may produce code that contains security vulnerabilities or that is otherwise insecure. This means that code generated by large language models should not be trusted blindly, and the developer using the code produced by the large language model should review that the code is secure and reliable.
For additional details, see e.g. “Large Language Models for Code: Security Hardening and Adversarial Testing”.
Even if the code generated by large language models is correct and secure, the use of the generated code may be problematic. Large language models may have been trained on code that is copyrighted or has specific licenses that influence how the code can be reused, modified, or distributed. This information is not always available to the end user, however. The developer using the code generated by the large language model should ensure that the code can be used in the intended way.
This aspect was briefly discussed in the chapter Ownership of Data and Outputs of the Introduction to Large Language Models course. For additional details on the legal aspects of using code generated by large language models, see e.g. “An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets”.
Interpreting and acting on code suggested by large language models requires a certain level of expertise. When novice programmers are using large language models for coding, they may have a hard time understanding the code generated by the large language model. This may lead to the developer blindly trusting the code generated by the large language model, even if the code is not solving the problem or it is insecure. As an example, novice programmers may end up drifting down a rabbit hole, even if the suggestions do not help with the problem.
For additional details, see e.g. ""It’s Weird That it Knows What I Want”: Usability and Interactions with Copilot for Novice Programmers”.
In addition, the use of large language models (and AI assistants) may increase trust in the written code, even if the produced code has flaws. This relates to the automation bias phenomenon, which was discussed in the part on Learning Concerns of the Introduction to Large Language Models course.
In a somewhat older study — in terms of large language model evolution — developers with access to an AI assistant wrote less secure code than those who did not have access to the AI assistant. Despite the difference, the developers who had access to the AI assistant, were also more likely to believe that their code was secure.
Although the study highlights issues in code and possible unfounded trust in the created code, one should keep in mind that the model used to generate the code was less powerful than the current state of the art. In addition, as developers learn about the capabilities of large language models, they — or we — will hopefully also learn to better question the code produced by the models.
For additional details, see “Do Users Write More Insecure Code with AI Assistants?”.