Software Security and Large Language Models

Vulnerabilities and Code Generation

Learning Objectives

You know that large language models can produce vulnerable code.
You know that the prompt structure influences the vulnerability of the generated code.
You know that the use of large language models can lead to users creating more vulnerable code while still increasing their belief in the security of their code.

Large language models have been trained with publicly (and privately) available code. When considering the existence of vulnerabilities in existing code, as just discussed in Common Weaknesses in Code, it is not a surprise that the code that large language models have been trained with may also contain vulnerabilities. Thus, large language models may also produce vulnerable code.

As an example, the article “Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions” from 2022 outlines a study where the authors used large language models to produce programs to complete tasks relevant to the CWE Top 25 Most Dangerous Software Weaknesses lists. From the total of 1,689 generated programs, approximately 40% had vulnerabilities.

Loading Exercise...

While the abovementioned article was published in 2022, more attention has since been given to the security of large language models. There has been calls for — and answers to — benchmarks for secure code generation, like the one outlined in the article “CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models”. The article also highlights that the prompt structure influences the vulnerability of the generated code.

As models improve, the likelihood of the models producing code that contains trivial flaws is likely to decrease. As an example, the state-of-the-art large language models are less likely to provide code that is vulnerable to SQL injection attacks, which is among the top vulnerabilities.

Loading Exercise...

The developers who use large language models and code generation tools are responsible for their code (and the code produced with the help of tools). In a same way that it is hard to imagine a scenario, where a doctor would blame a computer for their misdiagnosis, it should be hard to imagine a scenario where a developer would blame their tools for a vulnerability in their code.

When using large language models and LLM-powered tools, it is very important to keep in mind that such tools can contribute to users’ beliefs about their code.

At an extreme, as highlighted in the article Do Users Write More Insecure Code with AI Assistants, users who work with an AI code assistant may end up creating more vulnerable code, while still tending to believe that their code is secure. Already knowledge of this phenomenon can help developers be more cautious when using large language models.

Loading Exercise...

← Common Weaknesses in Code

Detecting Vulnerabilities with Large Language Models →