Coding with Large Language Models

Test Generation


Learning Objectives

  • You know of different types of testing approaches.
  • You know that large language models can be used for generating tests.
  • You know that, in some cases, existing code generation tools can be more efficient than large language models for generating tests.

In a similar way that large language models can be used for source code generation and code completion, they can also be used for generating test source code or for completing test code.

Software testing

Creating tests is a part of software testing. Software testing is a process of evaluating a code or an application to verify whether it works as expected. Software testing can be done at multiple levels and in multiple ways, starting at testing the functionality of a single function, and ending at testing the functionality of the entire application.

Here, we briefly visit some of the different types of tests that can be generated using large language models.

Unit tests in Python

Unit tests are automatically runnable tests that can be used to verify whether a single function or a component works as expected. Unit tests are typically written by the developers of the application, and they are typically written in the same programming language as the application itself.

When working with Python, unit tests are typically written using the unittest library, which provides a set of tools for writing and running unit tests. The library is part of the Python standard library, meaning that it is available by default when Python is installed.

Let’s look at an example of a unit test. The example below defines a function called fun, which returns the string hello world!. The function is defined within a file called app.py.

def fun():
    return 'hello world!'

To test the function, we write a unit test that verifies whether the function returns the expected value. Unit tests are typically placed in a separate file that correspond to the files that contain the functionality that is being tested. For a file app.py, we would create a file called test_app.py.

The test file would imports the tested function, in this case fun, from the file app.py. Then, a concrete unit test would be written that verifies whether the function returns the expected value. An example of a test file test_app.py is shown below. The test in the file verifies that the function fun returns the string "hello world!".

import unittest

from app import fun

class TestApp(unittest.TestCase):
    def test_fun(self):
        self.assertEqual(fun(), 'hello world!')

Assuming that the files are in the same folder and we are in the folder using terminal (you can e.g. use the ls or tree commands to check the contents of the folder), we can run the unit tests using the command python -m unittest. The command runs the tests defined within the test file test_app.py, and outputs the results of the tests.

tree
.
├── app.py
└── test_app.py

1 directory, 2 files

python3 -m unittest
.
---------------------
Ran 1 test in 0.000s

OK

The above output shows that we have one test, and that the test passed. Let’s add another test case to the file test_app.py that checks that the output is "hello world" and see what happens.

import unittest

from app import fun

class TestApp(unittest.TestCase):
    def test_fun(self):
        self.assertEqual(fun(), 'hello world!')

    def test_fun_invalid(self):
        self.assertEqual(fun(), 'hello world')

When we run the tests again, we see that there is a failing test.

python3 -m unittest
.F
======================================================================
FAIL: test_fun_invalid (test_app.TestApp)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "~/python-projects/test_app.py", line 10, in test_fun_invalid
    self.assertEqual(fun(), 'hello world')
AssertionError: 'hello world!' != 'hello world'
- hello world!
?            -
+ hello world


----------------------------------------------------------------------
Ran 2 tests in 0.000s

FAILED (failures=1)

Generating unit tests

Large language models can be used to generate unit tests. As an example, we could ask for a set of unit tests for the sorting algorithm that we created earlier with a large language model.

Write Python unit tests for the following function.

def quicksort(array):
if len(array) <= 1:
return array
pivot = array[len(array) // 2]
left = [x for x in array if x < pivot]
middle = [x for x in array if x == pivot]
right = [x for x in array if x > pivot]
return quicksort(left) + middle + quicksort(right)

Certainly! Here's a set of unit tests for the quicksort function using Python's unittest framework:

...

class TestQuickSort(unittest.TestCase):

def test_empty_array(self):
self.assertEqual(quicksort([]), [])

def test_single_element(self):
self.assertEqual(quicksort([5]), [5])

def test_two_elements(self):
self.assertEqual(quicksort([5, 2]), [2, 5])

def test_multiple_elements(self):
self.assertEqual(quicksort([12, 4, 5, 6, 7, 3, 1, 15]), [1, 3, 4, 5, 6, 7, 12, 15])

def test_duplicate_elements(self):
self.assertEqual(quicksort([3, 3, 3, 5, 1, 5, 2]), [1, 2, 3, 3, 3, 5, 5])

def test_negative_elements(self):
self.assertEqual(quicksort([-5, -2, -3, 0, 2, 3]), [-5, -3, -2, 0, 2, 3])

def test_float_elements(self):
self.assertEqual(quicksort([-2.5, 3.5, 0, 3.5]), [-2.5, 0, 3.5, 3.5])

Loading Exercise...

Test generation in IDE integrations

IDE plugins such as GitHub Copilot come with built-in support for generating tests. For example, in VSCode, you can ask GitHub Copilot to create tests for an open file using the command /test.

Microsoft also provides a learning module on developing unit tests using GitHub Copilot tools.

Integration tests

Integration tests are used to verify whether multiple components work together as expected. Some integration tests can be written using the unittest library used above, but in practice, integration tests are typically written using a separate library. When working with Python, integration tests are typically written using the pytest library or similar, which provide an easy way to mock components that are not part of the test.

Mocking means replacing a component with a fake component that behaves in a predefined way. For example, if we want to test a function that uses a database, we can mock the database so that the function uses a fake database instead of the real database. This way, we can test the function without having to worry about the state of the real database.

We could, for example, use large language models to write integration tests for a wrapper of an application programming interface. For this, we use the following SimpleJokeAPI class that can be used to retrieve jokes from a simple joke API.

import requests

class SimpleJokeAPI:
  BASE_URL = "https://simple-joke-api.deno.dev"

  @classmethod
  def get_random_joke(cls):
    """Retrieve a random joke."""
    response = requests.get(f"{cls.BASE_URL}/random")
    if response.status_code == 200:
      joke_data = response.json()
      return joke_data
    else:
      response.raise_for_status()

For such a class, we could write an integration test that mocks the HTTP request using the pytest library. Now, this would test the functionality of the class without having to worry about the state of the real API.

Write a pytest-based integration test for the following class SimpleJokeAPI, mocking the HTTP get request with monkeypatch. The example code below also shows how the class is used.

import requests

class SimpleJokeAPI:
BASE_URL = "https://simple-joke-api.deno.dev"

@classmethod
def get_random_joke(cls):
"""Retrieve a random joke."""
response = requests.get(f"{cls.BASE_URL}/random")
if response.status_code == 200:
joke_data = response.json()
return joke_data
else:
response.raise_for_status()

...

Here's an example test suite:

import pytest
import requests
from your_module import SimpleJokeAPI # adjust the import path as necessary

class MockResponse:
def __init__(self, json_data, status_code): # pylint: disable=unused-argument
self.json_data = json_data
self.status_code = status_code

def json(self):
return self.json_data

def raise_for_status(self):
if 400 <= self.status_code < 600:
raise requests.HTTPError(f"{self.status_code} Client Error")

def test_get_random_joke(monkeypatch):
# This function will be used to replace `requests.get`
def mock_get(*args, **kwargs):
return MockResponse({"setup": "Why did the chicken cross the road?", "punchline": "To get to the other side."}, 200)

# Apply the monkeypatch for requests.get to mock_get
monkeypatch.setattr(requests, "get", mock_get)

joke = SimpleJokeAPI.get_random_joke()

assert joke["setup"] == "Why did the chicken cross the road?"
assert joke["punchline"] == "To get to the other side."

def test_get_random_joke_failure(monkeypatch):
# This function will mock a failed get request
def mock_get(*args, **kwargs):
return MockResponse({}, 400)

# Apply the monkeypatch for requests.get to mock_get
monkeypatch.setattr(requests, "get", mock_get)

with pytest.raises(requests.HTTPError):
SimpleJokeAPI.get_random_joke()

The output above highlights the possibility of writing integration tests and unit tests, as well as using large language models for mocking components that are not part of the test.

Similar benefits of large language models can be achieved also for Robot Framework, Behave, and so on.

Not guaranteed to be correct

As usual, it is important to also understand the limitations of the approach. For example, the tests generated by large language models are not guaranteed to be correct, and they are not guaranteed to cover all possible cases.

It is important to understand the limitations of the approach — nevertheless, the generated tests can be used as a starting point for further development.

System tests and end-to-end-testing

System testing and end-to-end testing are used to assess that the application works as a whole, meaning that the application has all of the components that it would also have when deployed to production. When conducting system tests, we simulate user actions in the application to reach specific goals, mimicking the interactions that a user would actually perform.

As an example, for web applications, this includes opening up a page in a browser, navigating the page, perhaps filling in a form, submitting the form, and checking that the data is shown as expected. As systems often have multiple use cases — i.e. how the user interacts with the system and how the system reacts to the user — multiple system tests are written and used.

There are multiple libraries for writing end to end tests, including Selenium, Puppeteer, Cypress, and Playwright.

Perhaps not surprisingly, large language models can be used to generate system tests and end-to-end tests. Here, however, creating the correct prompts may become trickier, as the tests are typically longer and more complex than unit tests and integration tests, and the inputs do not necessarily follow a specific pattern.

Using existing code generation tools that allow replaying user interactions, such as Playwright’s Codegen, is very likely still more efficient than prompting large language models for the tests.