Landing on the Moon with Large Language Models

Landing with Learning AI

Reinforcement learning

The rule-based system that the large language model ended up with was able to land the lunar lander with simple rules. In more complex systems, it might not be feasible to provide the rules, or it might even be that the system itself is somewhat unclear, and determining good rules might not be feasible.

This is where reinforcement learning can be a meaningful option. Reinforement learning is an area of machine learning that seeks to identify actions in a system that lead to best possible outcomes. In other words, reinforcement learning seeks to find an optimal behavior that maximizes a reward.

To achieve this, reinforcement learning uses an agent (“the AI”) to interact with the system over an over again, learning from the actions and the feedback that the system provides. As an example, for the lunar lander, an agent would learn to decide whether to use thrust or not to use thrust, based on the state of the lunar lander.

Reinforcement learning is about finding an optimal behavior that maximizes a reward by interacting with the system over and over again.

Q-learning

Q-learning is a reinforcement learning algorithm that seeks to learn the expected outcome of an action for each state of a system. There is no explicit need for an understanding of what the system is, as long as the system can be presented as a set of states, a set of actions available in those states, and a reward for the states.

As an example, for our lunar lander, the set of actions could include thrusting and not thrusting, the states represent specific combinations of the location of the lunar lander, the speed of the lander, and the amount of fuel the lander has. To limit the number of states, one would typically discretize the state space, effectively bucketing nearby values together and providing a range to which the values are expected to fall into.

For this to work, we would need to both adjust the game and to create new functionality. First, we would need the Q-learning algorithm, a way to represent the game states for the Q-learning algorithm, and some reward that the Q-learning algorithm tries to optimize the agent behavior find.

Second, we would need the possibility to train the AI controller separately from the game user interface, as the AI controller (or, the Q-learning agent) needs to play the game over and over again to learn the rules, and playing with the user interface open would be slower.

Finally, once the training is finished, we need to be able to inject the AI controller to the game to see concretely how the AI controls the lunar lander.

First attempt

We start by asking for the implementation from the large language model. As the starting point, we take a version of the game that we had at the end of the Implementation. This is to not to have existing AI functionality in the code to avoid the model to use that.

The following code outlines a design for the Lunar Lander game. Modify the code so that it uses the AI uses the Q-learning algorithm for landing the lunar lander safely. Here are the key steps that are needed.

First, we would need the Q-learning algorithm, a way to represent the game states for the Q-learning algorithm, and some reward that the Q-learning algorithm tries to optimize the agent behavior find.

Second, we would need the possibility to train the AI controller separately from the game user interface, as the AI controller (or, the Q-learning agent) needs to play the game over and over again to learn the rules, and playing with the user interface open would be slower.

Finally, once the training is finished, we need to be able to inject the AI controller to the game to see concretely how the AI controls the lunar lander.

###

(the code)

To integrate Q-learning into the Lunar Lander game, we'll need to follow these steps:

...

The rest of the output is as follows.

Model output

Add Q-learning Algorithm and State Representation: Implement the Q-learning algorithm and define the state and action spaces.
Training the AI Controller: Create a separate training loop without rendering the game to speed up the learning process.
Using the Trained AI Controller: Inject the trained AI controller into the game to see how it performs.

Here’s the modified code with these steps included:

import pygame
import numpy as np
import random
import pickle

# Constants for the Q-learning algorithm
ACTIONS = [0, 1]  # 0: No thrust, 1: Thrust
STATE_SPACE = (800, 600, 100, 10)  # Discretized state space: position_x, position_y, fuel, velocity_y
ALPHA = 0.1  # Learning rate
GAMMA = 0.99  # Discount factor
EPSILON = 0.1  # Exploration rate

# Class to manage the overall game state
class LunarLanderGame:
    def __init__(self, ai_controller=None):
        pygame.init()
        self.screen = pygame.display.set_mode((800, 600))
        pygame.display.set_caption("Lunar Lander")
        self.lander = LunarLander()
        self.surface = LunarSurface()
        self.ui = GameUI(self.screen, self.lander, self.surface)
        self.player_controller = PlayerController(self.lander)
        self.ai_controller = ai_controller if ai_controller else AIController(self.lander)
        self.score_manager = ScoreManager()
        self.is_running = True

    def start_game(self):
        # Main game loop
        while self.is_running:
            self.handle_events()
            self.update_game_state()
            self.render()
            pygame.time.delay(30)  # Control frame rate

    def handle_events(self):
        # Handle player input and other events
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                self.is_running = False
            self.player_controller.handle_event(event)

    def update_game_state(self):
        # Update lander and check for collisions
        self.lander.update()
        collision_status = self.surface.check_collision(self.lander)
        if collision_status is not None:
            self.is_running = False
            self.ui.display_game_over(collision_status)
        else:
            self.ai_controller.control_lander()

    def render(self):
        # Render the game UI
        self.ui.render()

# Class to represent the Lunar Lander
class LunarLander:
    def __init__(self):
        self.position = [390, 0]  # Center the lander horizontally at the top
        self.velocity = [0, 0]
        self.fuel = 100
        self.thrusting = False

    def update(self):
        # Update position and velocity based on thrust and gravity
        self.apply_gravity()
        if self.thrusting and self.fuel > 0:
            self.apply_thrust()
        self.position[1] += self.velocity[1]

    def apply_gravity(self):
        # Simulate gravity affecting the lander
        self.velocity[1] += 0.1  # Simple gravity effect

    def apply_thrust(self):
        # Apply thrust to counteract gravity
        self.velocity[1] -= 0.5
        self.fuel -= 0.1

# Class to represent the lunar surface
class LunarSurface:
    def __init__(self):
        # Initialize lunar surface (e.g., terrain)
        self.height = 50  # Height of the terrain

    def check_collision(self, lander):
        # Check for collision with the lunar surface
        leg_bottom = lander.position[1] + 30  # Bottom of the legs
        if leg_bottom >= 600 - self.height:
            if lander.velocity[1] < 3:
                return "win"
            else:
                return "lose"
        return None

# Class to manage the graphical user interface
class GameUI:
    def __init__(self, screen, lander, surface):
        self.screen = screen
        self.lander = lander
        self.surface = surface
        self.font = pygame.font.Font(None, 36)

    def render(self):
        # Render the lander, surface, and status information
        self.screen.fill((0, 0, 0))  # Fill the screen with black

        # Draw the lunar lander
        self.draw_lander()

        # Draw the terrain as a green rectangle
        pygame.draw.rect(self.screen, (0, 255, 0), (0, 600 - self.surface.height, 800, self.surface.height))

        # Draw the HUD
        self.render_hud()

        pygame.display.flip()  # Update the screen

    def draw_lander(self):
        # Draw the lander body as a white square
        pygame.draw.rect(self.screen, (255, 255, 255), (self.lander.position[0], self.lander.position[1], 20, 20))

        # Draw the dome as a white circle
        pygame.draw.circle(self.screen, (255, 255, 255), (self.lander.position[0] + 10, self.lander.position[1]), 10)

        # Draw the legs as white lines
        pygame.draw.line(self.screen, (255, 255, 255), (self.lander.position[0], self.lander.position[1] + 20), (self.lander.position[0] - 10, self.lander.position[1] + 30))
        pygame.draw.line(self.screen, (255, 255, 255), (self.lander.position[0] + 20, self.lander.position[1] + 20), (self.lander.position[0] + 30, self.lander.position[1] + 30))

        # Draw the thrust flame as an upside-down triangle if thrusting
        if self.lander.thrusting:
            flame_color = (255, 0, 0)  # Red color for the flame
            flame_points = [
                (self.lander.position[0] + 10, self.lander.position[1] + 40),  # Bottom point
                (self.lander.position[0], self.lander.position[1] + 20),  # Top-left point
                (self.lander.position[0] + 20, self.lander.position[1] + 20)  # Top-right point
            ]
            pygame.draw.polygon(self.screen, flame_color, flame_points)

    def render_hud(self):
        # Render the HUD with altitude, velocity, and fuel level
        altitude = 600 - self.surface.height - (self.lander.position[1] + 30)  # Adjusted for leg length
        velocity = self.lander.velocity[1]
        fuel = self.lander.fuel

        altitude_text = self.font.render(f"Altitude: {altitude:.2f}", True, (255, 255, 255))
        velocity_text = self.font.render(f"Velocity: {velocity:.2f}", True, (255, 255, 255))
        fuel_text = self.font.render(f"Fuel: {fuel:.2f}", True, (255, 255, 255))

        self.screen.blit(altitude_text, (10, 10))
        self.screen.blit(velocity_text, (10, 40))
        self.screen.blit(fuel_text, (10, 70))

    def display_game_over(self, status):
        # Display the game over screen with win/lose message
        font = pygame.font.Font(None, 74)
        if status == "win":
            text = font.render("Landed safely", True, (0, 255, 0))
        else:
            text = font.render("Crash", True, (255, 0, 0))
        self.screen.blit(text, (200, 250))
        pygame.display.flip()
        pygame.time.delay(2000)  # Display for 2 seconds

# Class to handle player input
class PlayerController:
    def __init__(self, lander):
        self.lander = lander

    def handle_event(self, event):
        # Handle player input events to control the lander
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_UP:
                self.lander.thrusting = True
        elif event.type == pygame.KEYUP:
            if event.key == pygame.K_UP:
                self.lander.thrusting = False

# Class to provide AI control for the lander using Q-learning
class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = EPSILON

    def get_state(self):
        # Discretize the continuous state space
        position_x = int(self.lander.position[0] // 40)
        position_y = int(self.lander.position[1] // 30)
        fuel = int(self.lander.fuel // 1)
        velocity_y = int(self.lander.velocity[1] // 0.5)
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                reward = 1 if collision_status == "win" else -100 if collision_status == "lose" else -1
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    done = True

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

# Class to manage scoring and feedback
class ScoreManager:
    def __init__(self):
        self.score = 0

    def calculate_score(self, lander):
        # Calculate score based on landing success, fuel efficiency, and time taken
        pass

    def display_score(self):
        # Display the score to the player
        pass

# Training the AI
if __name__ == "__main__":
    ai_controller = AIController(LunarLander())
    ai_controller.train(episodes=1000)
    ai_controller.save_q_table('q_table.pkl')

    # Load the trained AI and start the game
    ai_controller.load_q_table('q_table.pkl')
    game = LunarLanderGame(ai_controller=ai_controller)
    game.start_game()
    pygame.quit()

Explanation of the Changes:

Q-learning Algorithm:
- Implemented the Q-learning algorithm inside the AIController class.
- Defined the state space and action space.
- Added methods to choose actions, update Q-values, and train the AI.
Training the AI:
- Added a train method to perform training over multiple episodes without rendering the game.
- Included methods to save and load the Q-table for persistent learning.
Using the Trained AI:
- After training, the AI controller is injected into the game to control the lunar lander.

This setup allows the AI to learn the optimal policy for landing the lunar lander safely through repeated simulations and then apply this learned policy in the actual game.

Reflection

On a surface level, the changes look meaningful. The model also added the functionality for saving the AI to a file, which we did not ask for. However, as training a model can take a while, this is certainly something that would be useful.

However, when we try to run the provided code, we see an error.

Traceback (most recent call last):
  File "location/lunar_lander.py", line 252, in <module>
    ai_controller.train(episodes=1000)
  File "location/lunar_lander.py", line 217, in train
    action = self.choose_action(state)
  File "location/lunar_lander.py", line 198, in choose_action
    return np.argmax(self.q_table[state])  # Exploit
IndexError: index 100 is out of bounds for axis 2 with size 100

Finally! We have a problem to solve. Let’s see if the large language model can fix this.

Fixing index out of bounds

To address the error, we provide the large language model the code and the error, and ask the large language model to explain the issue and to provide a solution.

The following code outlines code for the Lunar Lander game that uses the Q-learning algorithm for learning how to land the lunar lander safely.

###

(code)

###

When the program is run, the following error message is shown. Explain the reason for the error and fix the program.

###

Traceback (most recent call last):
File "location/lunar_lander.py", line 252, in <module>
ai_controller.train(episodes=1000)
File "location/lunar_lander.py", line 217, in train
action = self.choose_action(state)
File "location/lunar_lander.py", line 198, in choose_action
return np.argmax(self.q_table[state]) # Exploit
IndexError: index 100 is out of bounds for axis 2 with size 100

The error message indicates that there is an `IndexError` when trying to access an element in the Q-table using `self.q_table[state]`. Specifically, the error occurs because one of the indices in the tuple `state` is out of bounds for the corresponding dimension of the Q-table.

...

The model output continues as follows.

Model output

Let’s analyze the state space and how it is discretized:

STATE_SPACE = (800, 600, 100, 10)  # Discretized state space: position_x, position_y, fuel, velocity_y

This means the Q-table has dimensions (800, 600, 100, 10, len(ACTIONS)).

In the get_state method, the state is discretized as follows:

def get_state(self):
    # Discretize the continuous state space
    position_x = int(self.lander.position[0] // 40)
    position_y = int(self.lander.position[1] // 30)
    fuel = int(self.lander.fuel // 1)
    velocity_y = int(self.lander.velocity[1] // 0.5)
    return (position_x, position_y, fuel, velocity_y)

Let’s break down the discretization:

position_x is divided by 40, which means it ranges from 0 to 19 (800 // 40).
position_y is divided by 30, which means it ranges from 0 to 19 (600 // 30).
fuel is divided by 1, which means it ranges from 0 to 99 (100 // 1).
velocity_y is divided by 0.5, which means it can potentially be very large, but it should be limited to the range 0 to 9 (10 // 0.5).

Given these ranges, the dimensions of the Q-table should be adjusted to match the discretized state space. The correct dimensions should be (20, 20, 100, 20, len(ACTIONS)).

Let’s fix the code by updating the STATE_SPACE and adjusting the get_state method to ensure the indices stay within bounds:

STATE_SPACE = (20, 20, 100, 20)  # Updated discretized state space

class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = EPSILON

    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // 40), 19)
        position_y = min(int(self.lander.position[1] // 30), 19)
        fuel = min(int(self.lander.fuel // 1), 99)
        velocity_y = min(max(int(self.lander.velocity[1] // 0.5), 0), 19)
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                reward = 1 if collision_status == "win" else -100 if collision_status == "lose" else -1
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    done = True

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

Now, the state indices should stay within the bounds of the Q-table, and the IndexError should be resolved.

Reflection

The explanation is meaningful and the changes look sensible. When changing the code according to the suggestions, we see another error.

Traceback (most recent call last):
  File "location/lunar_lander.py", line 252, in <module>
    ai_controller.train(episodes=1000)
  File "location/lunar_lander.py", line 223, in train
    self.update_q_value(state, action, reward, next_state)
  File "location/lunar_lander.py", line 206, in update_q_value
    best_next_action = np.argmax(self.q_table[next_state])
IndexError: index -21 is out of bounds for axis 1 with size 20

Let’s continue with the same strategy, and ask the large language model to fix the issue.

Fixing index out of bounds, part 2

Again, we provide the large language model the code and the error, asking the large language model to explain the issue and to provide a solution.

The following code outlines code for the Lunar Lander game that uses the Q-learning algorithm for learning how to land the lunar lander safely.

###

(code)

###

When the program is run, the following error message is shown. Explain the reason for the error and fix the program.

###

Traceback (most recent call last):
File "location/lunar_lander.py", line 252, in <module>
ai_controller.train(episodes=1000)
File "location/lunar_lander.py", line 223, in train
self.update_q_value(state, action, reward, next_state)
File "location/lunar_lander.py", line 206, in update_q_value
best_next_action = np.argmax(self.q_table[next_state])
IndexError: index -21 is out of bounds for axis 1 with size 20

The error you're encountering is due to an index out-of-bounds issue in the Q-table. Specifically, the `velocity_y` component of the state is being discretized in such a way that it can produce negative values, which are invalid indices for the Q-table.

...

Model output

Let’s review the discretization of velocity_y in the get_state method:

velocity_y = min(max(int(self.lander.velocity[1] // 0.5), 0), 19)

The issue here is that self.lander.velocity[1] can be negative, and when divided by 0.5 and cast to an integer, it can produce negative values. The max function should ensure that the value is at least 0, but it seems there might be a logical error or an edge case that is not being handled correctly.

To fix this, let’s ensure that the velocity is always within the bounds of 0 to 19. One way to do this is to clamp the values more explicitly:

velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)

Here, adding 10 to the velocity before discretizing ensures that even negative velocities are shifted into a positive range, which can then be safely discretized and clamped.

Let’s update the get_state method with this change:

def get_state(self):
    # Discretize the continuous state space
    position_x = min(int(self.lander.position[0] // 40), 19)
    position_y = min(int(self.lander.position[1] // 30), 19)
    fuel = min(int(self.lander.fuel // 1), 99)
    velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)  # Adjusted for velocity range
    return (position_x, position_y, fuel, velocity_y)

This should resolve the out-of-bounds error. Additionally, you might want to ensure that other components of the state are also within valid ranges, although they seem to be handled correctly in the current code.

Here is the updated get_state method in the context of the complete AIController class:

class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = EPSILON

    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // 40), 19)
        position_y = min(int(self.lander.position[1] // 30), 19)
        fuel = min(int(self.lander.fuel // 1), 99)
        velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)  # Adjusted for velocity range
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                reward = 1 if collision_status == "win" else -100 if collision_status == "lose" else -1
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    done = True

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

With this adjustment, the program should run without the index out-of-bounds error.

Reflection

When updating the code to match the suggestion from the large language model, the error is still present.

Traceback (most recent call last):
  File "location/lunar_lander.py", line 252, in <module>
    ai_controller.train(episodes=1000)
  File "location/lunar_lander.py", line 223, in train
    self.update_q_value(state, action, reward, next_state)
  File "location/lunar_lander.py", line 206, in update_q_value
    best_next_action = np.argmax(self.q_table[next_state])
IndexError: index -21 is out of bounds for axis 1 with size 20

After a few more attempts at providing the error and asking the large language model to provide a fix, we notice that we need to be more specific in our request.

Fixing index out of bounds, part 3

When we consider the output from the last attempt, the model output suggests that the issue is in how the velocity is calculated. However, the error message highlights that the index error is in index 1 (axis 1). When we consider the state space (position_x, position_y, fuel, velocity_y), at index 1, we have the position_y.

This suggests that the issue is likely not related to velocity, but to how the position_y is calculated or handled.

Let’s try again, this time providing an additional suggestion to the model.

The following code outlines code for the Lunar Lander game that uses the Q-learning algorithm for learning how to land the lunar lander safely.

###

(code)

###

When the program is run, the following error message is shown. The error message suggests that the index error is in index 1 (axis 1). When we consider the state space `(position_x, position_y, fuel, velocity_y)`, at index 1, we have the `position_y`. This suggests that the issue might be related to `position_y`. Explain the reason for the error and fix the program.

###

Traceback (most recent call last):
File "location/lunar_lander.py", line 252, in <module>
ai_controller.train(episodes=1000)
File "location/lunar_lander.py", line 223, in train
self.update_q_value(state, action, reward, next_state)
File "location/lunar_lander.py", line 206, in update_q_value
best_next_action = np.argmax(self.q_table[next_state])
IndexError: index -21 is out of bounds for axis 1 with size 20

The error message indicates that there is an issue with indexing into the Q-table, specifically with the `position_y` component of the state space. The `IndexError: index -21 is out of bounds for axis 1 with size 20` suggests that the index for `position_y` is becoming negative, which is not valid given that the Q-table's dimensions are limited to the range defined by `STATE_SPACE`.

...

The output continues as follows.

Model output

The problem lies in the get_state method of the AIController class. The position_y is being discretized by dividing the lander’s vertical position by 30 and then taking the integer part. However, if the lander’s vertical position is negative (which can happen at the start of the game or during the game due to physics simulation), this will result in a negative index.

To fix this, we should ensure that the position_y index is always non-negative and within the bounds of the Q-table. We can achieve this by taking the maximum of 0 and the calculated index.

Here’s the corrected get_state method:

class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = EPSILON
        
    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // 40), 19)
        position_y = min(max(int(self.lander.position[1] // 30), 0), 19)  # Ensure position_y is within range
        fuel = min(int(self.lander.fuel // 1), 99)
        velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)  # Ensure velocity_y is within range
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                reward = 1 if collision_status == "win" else -100 if collision_status == "lose" else -1
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    done = True

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

The key change is in the line that calculates position_y:

position_y = min(max(int(self.lander.position[1] // 30), 0), 19)  # Ensure position_y is within range

This ensures that position_y is always within the valid range [0, 19], preventing any negative indices and thus avoiding the IndexError.

Reflection

Finally, the error got resolved and we can run the program. Running the program takes a while. Modifying the collision_status check in the train method to print the progress of the training is sensible. We could ask for this from the model, but it’s faster to simply modify it ourselves.

The modified train method is as follows.

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                reward = 1 if collision_status == "win" else -100 if collision_status == "lose" else -1
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    # new line below
                    print(episode, collision_status)
                    done = True

Now, when we run the program, se see that the collision status is printed for each episode. However, we also see that we always lose. Let’s try to fix this.

Attention to rewards

One of the key aspects of reinforcement learning and Q-learning is coming up with a meaningful reward function. At the moment, the reward function in the Lunar Lander game is quite simplistic, with rewards of 1 for a win, -100 for a loss, and -1 for every other step. This might not provide enough incentive for the lander to learn to land safely.

At this point, when simply asking the model to fix the program to help with the landing, after a few attempts, the model did not provide a satisfactory solution. Therefore, we are explicitly asking for a more specific fix related to the reward function.

The following code outlines code for the Lunar Lander game that uses the Q-learning algorithm for learning how to land the lunar lander safely. Currently, the algorithm does not work as expected as the lander does not learn to land safely. One of the problems is that the reward function provides a penalty in all of the scenarios where the AI does not win, which means that the penalty can override the rewards from the successes. Modify the reward so that the reward is positive when the lander lands safely and negative when the lander crashes. In other scenarios, do not provide a reward. Please explain the issue and provide a fix.

###

(code)

The issue with the current implementation of the reward function in your `AIController` is that it provides a negative reward in all scenarios where the AI does not win, which means that these penalties can override the rewards from successful landings. This can impede the learning process because the AI is consistently penalized, even when it is making progress towards a successful landing.

...

The output of the model continues as follows.

Model output

To address this, we need to modify the reward function to provide a positive reward when the lander lands safely, a negative reward when it crashes, and no reward in other scenarios. This way, the AI will be incentivized to learn the actions that lead to a safe landing without being overly penalized for intermediate steps.

Here is the modified code for the train method in the AIController class:

class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = EPSILON
        
    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // 40), 19)
        position_y = min(max(int(self.lander.position[1] // 30), 0), 19)  # Ensure position_y is within range
        fuel = min(int(self.lander.fuel // 1), 99)
        velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)  # Ensure velocity_y is within range
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                
                # Modify the reward function
                if collision_status == "win":
                    reward = 100  # Positive reward for a successful landing
                elif collision_status == "lose":
                    reward = -100  # Negative reward for a crash
                else:
                    reward = 0  # No reward for intermediate steps
                
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    print(episode, collision_status)
                    done = True

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

Explanation of Changes:

Positive Reward for Safe Landing: When the lander lands safely (collision_status == "win"), a positive reward of 100 is given. This encourages the AI to learn actions that lead to a safe landing.
Negative Reward for Crash: When the lander crashes (collision_status == "lose"), a negative reward of -100 is given. This discourages the AI from actions that result in a crash.
No Reward for Intermediate Steps: In other scenarios where neither a win nor a loss has occurred, no reward (reward = 0) is given. This prevents the AI from being penalized for intermediate steps and allows it to focus on achieving the end goal of a safe landing.

By making these changes, the AI should be able to learn more effectively how to land the lunar lander safely.

Reflection

Now, when we run the program we see from the outputs that the AI succeeds in landing the lunar lander safely every now and then. This is a significant improvement from the previous where the AI always crashed.

We also, however, notice that the AI never works in the actual game. Let’s ask for a fix.

Landing in the actual game

One of the key issues is that while the AI controller learns to control the lunar lander during training, the lunar lander that is controlled in the LunarLanderGame is not passed to the AI controller. This means that the AI controller is not controlling the actual lunar lander in the game, leading to the lander crashing every time. To fix this, we need to ensure that the AI controller is controlling the lunar lander in the game.

Again, at this point, when asking the model to fix the program without explicitly pointing out the issue, after a few attempts, the model did not provide a satisfactory solution. Therefore, we are explicitly asking for a more specific fix.

The following code outlines code for the Lunar Lander game that uses the Q-learning algorithm for learning how to land the lunar lander safely. Currently, the AI learns to land safely every now and then, but the AI never lands safely in the actual game after the training. The key issue is that when the LunarLanderGame is started, the trained AI controller is never given the new lunar lander to control. Please provide a fix for the issue.

###

(code)

The issue you're encountering is that the `AIController` is not correctly associated with the new `LunarLander` instance created in the `LunarLanderGame` class. When the game starts, the `LunarLander` instance created in the `LunarLanderGame` constructor is not the same instance that the `AIController` was trained with. The `AIController` needs to be updated with the new `LunarLander` instance each time a new game starts.

...

The output from the model continued as follows.

Model output

To fix this, you should update the AIController to use the new LunarLander instance created in the LunarLanderGame class. Here’s how you can do it:

Modify the LunarLanderGame class to pass the new LunarLander instance to the AIController.
Ensure that the AIController is updated to control the new LunarLander instance.

Here’s the updated code:

import pygame
import numpy as np
import random
import pickle

# Constants for the Q-learning algorithm
ACTIONS = [0, 1]  # 0: No thrust, 1: Thrust
STATE_SPACE = (20, 20, 100, 20)  # Discretized state space: position_x, position_y, fuel, velocity_y
ALPHA = 0.1  # Learning rate
GAMMA = 0.99  # Discount factor
EPSILON = 0.1  # Exploration rate

# Class to manage the overall game state
class LunarLanderGame:
    def __init__(self, ai_controller=None):
        pygame.init()
        self.screen = pygame.display.set_mode((800, 600))
        pygame.display.set_caption("Lunar Lander")
        self.lander = LunarLander()
        self.surface = LunarSurface()
        self.ui = GameUI(self.screen, self.lander, self.surface)
        self.player_controller = PlayerController(self.lander)
        self.ai_controller = ai_controller
        if self.ai_controller:
            self.ai_controller.set_lander(self.lander)
        self.score_manager = ScoreManager()
        self.is_running = True

    def start_game(self):
        # Main game loop
        while self.is_running:
            self.handle_events()
            self.update_game_state()
            self.render()
            pygame.time.delay(30)  # Control frame rate

    def handle_events(self):
        # Handle player input and other events
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                self.is_running = False
            self.player_controller.handle_event(event)

    def update_game_state(self):
        # Update lander and check for collisions
        self.lander.update()
        collision_status = self.surface.check_collision(self.lander)
        if collision_status is not None:
            self.is_running = False
            self.ui.display_game_over(collision_status)
        else:
            if self.ai_controller:
                self.ai_controller.control_lander()

    def render(self):
        # Render the game UI
        self.ui.render()

# Class to represent the Lunar Lander
class LunarLander:
    def __init__(self):
        self.position = [390, 0]  # Center the lander horizontally at the top
        self.velocity = [0, 0]
        self.fuel = 100
        self.thrusting = False

    def update(self):
        # Update position and velocity based on thrust and gravity
        self.apply_gravity()
        if self.thrusting and self.fuel > 0:
            self.apply_thrust()
        self.position[1] += self.velocity[1]

    def apply_gravity(self):
        # Simulate gravity affecting the lander
        self.velocity[1] += 0.1  # Simple gravity effect

    def apply_thrust(self):
        # Apply thrust to counteract gravity
        self.velocity[1] -= 0.5
        self.fuel -= 0.1

# Class to represent the lunar surface
class LunarSurface:
    def __init__(self):
        # Initialize lunar surface (e.g., terrain)
        self.height = 50  # Height of the terrain

    def check_collision(self, lander):
        # Check for collision with the lunar surface
        leg_bottom = lander.position[1] + 30  # Bottom of the legs
        if leg_bottom >= 600 - self.height:
            if lander.velocity[1] < 3:
                return "win"
            else:
                return "lose"
        return None

# Class to manage the graphical user interface
class GameUI:
    def __init__(self, screen, lander, surface):
        self.screen = screen
        self.lander = lander
        self.surface = surface
        self.font = pygame.font.Font(None, 36)

    def render(self):
        # Render the lander, surface, and status information
        self.screen.fill((0, 0, 0))  # Fill the screen with black

        # Draw the lunar lander
        self.draw_lander()

        # Draw the terrain as a green rectangle
        pygame.draw.rect(self.screen, (0, 255, 0), (0, 600 - self.surface.height, 800, self.surface.height))

        # Draw the HUD
        self.render_hud()

        pygame.display.flip()  # Update the screen

    def draw_lander(self):
        # Draw the lander body as a white square
        pygame.draw.rect(self.screen, (255, 255, 255), (self.lander.position[0], self.lander.position[1], 20, 20))

        # Draw the dome as a white circle
        pygame.draw.circle(self.screen, (255, 255, 255), (self.lander.position[0] + 10, self.lander.position[1]), 10)

        # Draw the legs as white lines
        pygame.draw.line(self.screen, (255, 255, 255), (self.lander.position[0], self.lander.position[1] + 20), (self.lander.position[0] - 10, self.lander.position[1] + 30))
        pygame.draw.line(self.screen, (255, 255, 255), (self.lander.position[0] + 20, self.lander.position[1] + 20), (self.lander.position[0] + 30, self.lander.position[1] + 30))

        # Draw the thrust flame as an upside-down triangle if thrusting
        if self.lander.thrusting:
            flame_color = (255, 0, 0)  # Red color for the flame
            flame_points = [
                (self.lander.position[0] + 10, self.lander.position[1] + 40),  # Bottom point
                (self.lander.position[0], self.lander.position[1] + 20),  # Top-left point
                (self.lander.position[0] + 20, self.lander.position[1] + 20)  # Top-right point
            ]
            pygame.draw.polygon(self.screen, flame_color, flame_points)

    def render_hud(self):
        # Render the HUD with altitude, velocity, and fuel level
        altitude = 600 - self.surface.height - (self.lander.position[1] + 30)  # Adjusted for leg length
        velocity = self.lander.velocity[1]
        fuel = self.lander.fuel

        altitude_text = self.font.render(f"Altitude: {altitude:.2f}", True, (255, 255, 255))
        velocity_text = self.font.render(f"Velocity: {velocity:.2f}", True, (255, 255, 255))
        fuel_text = self.font.render(f"Fuel: {fuel:.2f}", True, (255, 255, 255))

        self.screen.blit(altitude_text, (10, 10))
        self.screen.blit(velocity_text, (10, 40))
        self.screen.blit(fuel_text, (10, 70))

    def display_game_over(self, status):
        # Display the game over screen with win/lose message
        font = pygame.font.Font(None, 74)
        if status == "win":
            text = font.render("Landed safely", True, (0, 255, 0))
        else:
            text = font.render("Crash", True, (255, 0, 0))
        self.screen.blit(text, (200, 250))
        pygame.display.flip()
        pygame.time.delay(2000)  # Display for 2 seconds

# Class to handle player input
class PlayerController:
    def __init__(self, lander):
        self.lander = lander

    def handle_event(self, event):
        # Handle player input events to control the lander
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_UP:
                self.lander.thrusting = True
        elif event.type == pygame.KEYUP:
            if event.key == pygame.K_UP:
                self.lander.thrusting = False

# Class to provide AI control for the lander using Q-learning
class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = EPSILON
        
    def set_lander(self, lander):
        self.lander = lander

    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // 40), 19)
        position_y = min(max(int(self.lander.position[1] // 30), 0), 19)  # Ensure position_y is within range
        fuel = min(int(self.lander.fuel // 1), 99)
        velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)  # Ensure velocity_y is within range
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                
                # Modify the reward function
                if collision_status == "win":
                    reward = 100  # Positive reward for a successful landing
                elif collision_status == "lose":
                    reward = -100  # Negative reward for a crash
                else:
                    reward = 0  # No reward for intermediate steps
                
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    print(episode, collision_status)
                    done = True

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

# Class to manage scoring and feedback
class ScoreManager:
    def __init__(self):
        self.score = 0

    def calculate_score(self, lander):
        # Calculate score based on landing success, fuel efficiency, and time taken
        pass

    def display_score(self):
        # Display the score to the player
        pass

# Training the AI
if __name__ == "__main__":
    ai_controller = AIController(LunarLander())
    ai_controller.train(episodes=1000)
    ai_controller.save_q_table('q_table.pkl')

    # Load the trained AI and start the game
    ai_controller.load_q_table('q_table.pkl')
    game = LunarLanderGame(ai_controller=ai_controller)
    game.start_game()
    pygame.quit()

In this updated version, the LunarLanderGame class now has the following changes:

When initializing LunarLanderGame, if an ai_controller is provided, it sets the lander instance in the ai_controller using the set_lander method.
The AIController class now includes a set_lander method to update the lander instance.

This ensures that the AIController always controls the correct LunarLander instance created in the LunarLanderGame class.

Reflection

At this point, the AI controller is now correctly associated with the new LunarLander instance each time a new game starts, allowing the AI to control the lander. The lunar lander is successfully landing in the game after the training every once in a while, but not always.

Let’s still continue modifying the AI a bit. At this point, one could ask the model to provide suggestions for improving the AI’s performance in the game, but we’ll be more direct and ask for a specific fix.

Reducing randomness over time

Reinforcement learning algorithms often use an exploration-exploitation strategy to balance between exploring new actions and exploiting known actions. In the current implementation, the AI controller uses an epsilon-greedy strategy to choose actions, where the constant EPSILON determines the probability of exploration. In 10% of the cases, the model picks a random action, while in 90% of the cases, the model exploits the learned Q-values.

A more suitable approach could be an epsilon-decreasing strategy, where the training would start with a high exploration rate and gradually reduce it over time as the AI learns more about the environment. This would allow the AI to explore more in the beginning and exploit the learned Q-values more as it becomes more confident in its actions.

Let’s ask the large language model to adjust the AI controller to use an epsilon-decreasing strategy for exploration.

The following code outlines code for the Lunar Lander game that uses the Q-learning algorithm for learning how to land the lunar lander safely. Currently, the AI uses an epsilon-greedy strategy to decide when to pick a random option and when to choose an option from the learned Q-table. Modify the code so that it uses an epsilon-decreasing strategy instead. Choose suitable initial epsilon values and reduction rates, accounting for the number of episodes used in the training. After providing the modified code, summarize the modifications.

###

(code)

To implement an epsilon-decreasing strategy in the Q-learning algorithm for the Lunar Lander game, we need to adjust the `AIController` class to reduce the value of epsilon over time as the agent gains more experience. This approach helps the agent explore the environment sufficiently in the early stages of training and gradually exploit the learned policy more as training progresses.

...

The output from the model continued as follows.

Model output

Here are the modifications made to the code:

Initial Epsilon and Reduction Rate: We define an initial epsilon value and a reduction rate for epsilon. These values are chosen to ensure a balance between exploration and exploitation over the training episodes.
Update Epsilon in Training Loop: We update the epsilon value after each episode to decrease it gradually.

The modified code for the AIController class is as follows:

# Class to provide AI control for the lander using Q-learning
class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = 1.0  # Initial exploration rate
        self.epsilon_min = 0.01  # Minimum exploration rate
        self.epsilon_decay = 0.995  # Decay rate for epsilon
        
    def set_lander(self, lander):
        self.lander = lander

    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // 40), 19)
        position_y = min(max(int(self.lander.position[1] // 30), 0), 19)  # Ensure position_y is within range
        fuel = min(int(self.lander.fuel // 1), 99)
        velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)  # Ensure velocity_y is within range
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                
                # Modify the reward function
                if collision_status == "win":
                    reward = 100  # Positive reward for a successful landing
                elif collision_status == "lose":
                    reward = -100  # Negative reward for a crash
                else:
                    reward = 0  # No reward for intermediate steps
                
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    print(episode, collision_status)
                    done = True
            
            # Decrease epsilon after each episode
            if self.epsilon > self.epsilon_min:
                self.epsilon *= self.epsilon_decay

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

Summary of Modifications:

Initial Epsilon Value: Set self.epsilon to 1.0 to start with a high exploration rate.
Minimum Epsilon Value: Added self.epsilon_min to ensure the exploration rate does not go below a certain threshold (0.01).
Epsilon Decay Rate: Added self.epsilon_decay to control the rate at which epsilon decreases (0.995).
Epsilon Update in Training Loop: After each episode, epsilon is multiplied by the decay rate, gradually reducing the exploration as training progresses.

These modifications help the AI to explore more in the early stages and exploit the learned policy more as it becomes more experienced, improving the overall learning efficiency.

Reflection

Well, the modifications are sensible, but the lander is still not landing safely every time. At this point, it could be sensible to try to further adjust the state space. However, presently, the state space discretization and the method get_state are linked using magic numbers and changing one location would require also changing the other. It’s time for some refactoring.

Refactoring

One of the problems with the code that reduce its generalizability is the use of “magic numbers” in the code. As an example, the get_state method in the AIController class uses specific values like 40, 30, 1, and 0.5 to discretize the continuous state space. These magic numbers make the code less readable and harder to maintain.

    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // 40), 19)
        position_y = min(max(int(self.lander.position[1] // 30), 0), 19)  # Ensure position_y is within range
        fuel = min(int(self.lander.fuel // 1), 99)
        velocity_y = min(max(int((self.lander.velocity[1] + 10) // 0.5), 0), 19)  # Ensure velocity_y is within range
        return (position_x, position_y, fuel, velocity_y)

As the discretization is based on the state space that is initialized at the beginning of the code, it would be better to use these values instead of the magic numbers. Let’s ask the model to refactor the code for better readability.

The following code outlines code for the Lunar Lander game that uses the Q-learning algorithm for learning how to land the lunar lander safely. The code uses a handful of magic numbers and does not account for changes in the state space or for the window size, epsilon, epsilon decay, or minimum epsilon. Similarly, the get_state method assumes that the state space and screen size is always the same. Refactor the following code by extracting constants so that modifications to the game functionality can be made directly through the constants without a need to modify other places in the code.

###

(code)

To make the code more maintainable and adaptable, we can extract constants for various aspects of the game. This includes window size, lander properties, gravity, thrust, terrain height, and Q-learning parameters. These constants will be defined at the beginning of the script, making it easier to adjust the game settings without modifying the code logic.

...

The output of the model continued as follows.

Model output

Here’s the refactored code:

import pygame
import numpy as np
import random
import pickle

# Game Constants
WINDOW_WIDTH = 800
WINDOW_HEIGHT = 600
LANDER_WIDTH = 20
LANDER_HEIGHT = 20
LANDER_INIT_X = WINDOW_WIDTH // 2 - LANDER_WIDTH // 2
LANDER_INIT_Y = 0
LANDER_INIT_FUEL = 100
GRAVITY = 0.1
THRUST = 0.5
THRUST_FUEL_CONSUMPTION = 0.1
TERRAIN_HEIGHT = 50
LEG_LENGTH = 30
FRAME_DELAY = 30

# Q-learning Constants
ACTIONS = [0, 1]  # 0: No thrust, 1: Thrust
STATE_SPACE = (20, 20, 100, 20)  # Discretized state space: position_x, position_y, fuel, velocity_y
ALPHA = 0.1  # Learning rate
GAMMA = 0.99  # Discount factor
INITIAL_EPSILON = 1.0  # Initial exploration rate
EPSILON_MIN = 0.01  # Minimum exploration rate
EPSILON_DECAY = 0.995  # Decay rate for epsilon

# Class to manage the overall game state
class LunarLanderGame:
    def __init__(self, ai_controller=None):
        pygame.init()
        self.screen = pygame.display.set_mode((WINDOW_WIDTH, WINDOW_HEIGHT))
        pygame.display.set_caption("Lunar Lander")
        self.lander = LunarLander()
        self.surface = LunarSurface()
        self.ui = GameUI(self.screen, self.lander, self.surface)
        self.player_controller = PlayerController(self.lander)
        self.ai_controller = ai_controller
        if self.ai_controller:
            self.ai_controller.set_lander(self.lander)
        self.score_manager = ScoreManager()
        self.is_running = True

    def start_game(self):
        # Main game loop
        while self.is_running:
            self.handle_events()
            self.update_game_state()
            self.render()
            pygame.time.delay(FRAME_DELAY)  # Control frame rate

    def handle_events(self):
        # Handle player input and other events
        for event in pygame.event.get():
            if event.type == pygame.QUIT:
                self.is_running = False
            self.player_controller.handle_event(event)

    def update_game_state(self):
        # Update lander and check for collisions
        self.lander.update()
        collision_status = self.surface.check_collision(self.lander)
        if collision_status is not None:
            self.is_running = False
            self.ui.display_game_over(collision_status)
        else:
            if self.ai_controller:
                self.ai_controller.control_lander()

    def render(self):
        # Render the game UI
        self.ui.render()

# Class to represent the Lunar Lander
class LunarLander:
    def __init__(self):
        self.position = [LANDER_INIT_X, LANDER_INIT_Y]
        self.velocity = [0, 0]
        self.fuel = LANDER_INIT_FUEL
        self.thrusting = False

    def update(self):
        # Update position and velocity based on thrust and gravity
        self.apply_gravity()
        if self.thrusting and self.fuel > 0:
            self.apply_thrust()
        self.position[1] += self.velocity[1]

    def apply_gravity(self):
        # Simulate gravity affecting the lander
        self.velocity[1] += GRAVITY

    def apply_thrust(self):
        # Apply thrust to counteract gravity
        self.velocity[1] -= THRUST
        self.fuel -= THRUST_FUEL_CONSUMPTION

# Class to represent the lunar surface
class LunarSurface:
    def __init__(self):
        # Initialize lunar surface (e.g., terrain)
        self.height = TERRAIN_HEIGHT

    def check_collision(self, lander):
        # Check for collision with the lunar surface
        leg_bottom = lander.position[1] + LEG_LENGTH  # Bottom of the legs
        if leg_bottom >= WINDOW_HEIGHT - self.height:
            if lander.velocity[1] < 3:
                return "win"
            else:
                return "lose"
        return None

# Class to manage the graphical user interface
class GameUI:
    def __init__(self, screen, lander, surface):
        self.screen = screen
        self.lander = lander
        self.surface = surface
        self.font = pygame.font.Font(None, 36)

    def render(self):
        # Render the lander, surface, and status information
        self.screen.fill((0, 0, 0))  # Fill the screen with black

        # Draw the lunar lander
        self.draw_lander()

        # Draw the terrain as a green rectangle
        pygame.draw.rect(self.screen, (0, 255, 0), (0, WINDOW_HEIGHT - self.surface.height, WINDOW_WIDTH, self.surface.height))

        # Draw the HUD
        self.render_hud()

        pygame.display.flip()  # Update the screen

    def draw_lander(self):
        # Draw the lander body as a white square
        pygame.draw.rect(self.screen, (255, 255, 255), (self.lander.position[0], self.lander.position[1], LANDER_WIDTH, LANDER_HEIGHT))

        # Draw the dome as a white circle
        pygame.draw.circle(self.screen, (255, 255, 255), (self.lander.position[0] + LANDER_WIDTH // 2, self.lander.position[1]), LANDER_WIDTH // 2)

        # Draw the legs as white lines
        pygame.draw.line(self.screen, (255, 255, 255), (self.lander.position[0], self.lander.position[1] + LANDER_HEIGHT), (self.lander.position[0] - 10, self.lander.position[1] + LEG_LENGTH))
        pygame.draw.line(self.screen, (255, 255, 255), (self.lander.position[0] + LANDER_WIDTH, self.lander.position[1] + LANDER_HEIGHT), (self.lander.position[0] + LANDER_WIDTH + 10, self.lander.position[1] + LEG_LENGTH))

        # Draw the thrust flame as an upside-down triangle if thrusting
        if self.lander.thrusting:
            flame_color = (255, 0, 0)  # Red color for the flame
            flame_points = [
                (self.lander.position[0] + LANDER_WIDTH // 2, self.lander.position[1] + LANDER_HEIGHT + 20),  # Bottom point
                (self.lander.position[0], self.lander.position[1] + LANDER_HEIGHT),  # Top-left point
                (self.lander.position[0] + LANDER_WIDTH, self.lander.position[1] + LANDER_HEIGHT)  # Top-right point
            ]
            pygame.draw.polygon(self.screen, flame_color, flame_points)

    def render_hud(self):
        # Render the HUD with altitude, velocity, and fuel level
        altitude = WINDOW_HEIGHT - self.surface.height - (self.lander.position[1] + LEG_LENGTH)  # Adjusted for leg length
        velocity = self.lander.velocity[1]
        fuel = self.lander.fuel

        altitude_text = self.font.render(f"Altitude: {altitude:.2f}", True, (255, 255, 255))
        velocity_text = self.font.render(f"Velocity: {velocity:.2f}", True, (255, 255, 255))
        fuel_text = self.font.render(f"Fuel: {fuel:.2f}", True, (255, 255, 255))

        self.screen.blit(altitude_text, (10, 10))
        self.screen.blit(velocity_text, (10, 40))
        self.screen.blit(fuel_text, (10, 70))

    def display_game_over(self, status):
        # Display the game over screen with win/lose message
        font = pygame.font.Font(None, 74)
        if status == "win":
            text = font.render("Landed safely", True, (0, 255, 0))
        else:
            text = font.render("Crash", True, (255, 0, 0))
        self.screen.blit(text, (200, 250))
        pygame.display.flip()
        pygame.time.delay(2000)  # Display for 2 seconds

# Class to handle player input
class PlayerController:
    def __init__(self, lander):
        self.lander = lander

    def handle_event(self, event):
        # Handle player input events to control the lander
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_UP:
                self.lander.thrusting = True
        elif event.type == pygame.KEYUP:
            if event.key == pygame.K_UP:
                self.lander.thrusting = False

# Class to provide AI control for the lander using Q-learning
class AIController:
    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = INITIAL_EPSILON  # Initial exploration rate
        
    def set_lander(self, lander):
        self.lander = lander

    def get_state(self):
        # Discretize the continuous state space
        position_x = min(int(self.lander.position[0] // (WINDOW_WIDTH / STATE_SPACE[0])), STATE_SPACE[0] - 1)
        position_y = min(max(int(self.lander.position[1] // (WINDOW_HEIGHT / STATE_SPACE[1])), 0), STATE_SPACE[1] - 1)
        fuel = min(int(self.lander.fuel // (LANDER_INIT_FUEL / STATE_SPACE[2])), STATE_SPACE[2] - 1)
        velocity_y = min(max(int((self.lander.velocity[1] + 10) // (20 / STATE_SPACE[3])), 0), STATE_SPACE[3] - 1)
        return (position_x, position_y, fuel, velocity_y)

    def choose_action(self, state):
        if random.uniform(0, 1) < self.epsilon:
            return random.choice(ACTIONS)  # Explore
        else:
            return np.argmax(self.q_table[state])  # Exploit

    def control_lander(self):
        state = self.get_state()
        action = self.choose_action(state)
        self.lander.thrusting = (action == 1)

    def update_q_value(self, state, action, reward, next_state):
        best_next_action = np.argmax(self.q_table[next_state])
        td_target = reward + GAMMA * self.q_table[next_state][best_next_action]
        td_error = td_target - self.q_table[state][action]
        self.q_table[state][action] += ALPHA * td_error

    def train(self, episodes):
        for episode in range(episodes):
            self.lander = LunarLander()
            state = self.get_state()
            done = False
            while not done:
                action = self.choose_action(state)
                self.lander.thrusting = (action == 1)
                self.lander.update()
                next_state = self.get_state()
                collision_status = LunarSurface().check_collision(self.lander)
                
                # Modify the reward function
                if collision_status == "win":
                    reward = 100  # Positive reward for a successful landing
                elif collision_status == "lose":
                    reward = -100  # Negative reward for a crash
                else:
                    reward = 0  # No reward for intermediate steps
                
                self.update_q_value(state, action, reward, next_state)
                state = next_state
                if collision_status is not None:
                    print(episode, collision_status)
                    done = True
            
            # Decrease epsilon after each episode
            if self.epsilon > EPSILON_MIN:
                self.epsilon *= EPSILON_DECAY

    def save_q_table(self, filename):
        with open(filename, 'wb') as f:
            pickle.dump(self.q_table, f)

    def load_q_table(self, filename):
        with open(filename, 'rb') as f:
            self.q_table = pickle.load(f)

# Class to manage scoring and feedback
class ScoreManager:
    def __init__(self):
        self.score = 0

    def calculate_score(self, lander):
        # Calculate score based on landing success, fuel efficiency, and time taken
        pass

    def display_score(self):
        # Display the score to the player
        pass

# Training the AI
if __name__ == "__main__":
    ai_controller = AIController(LunarLander())
    ai_controller.train(episodes=1000)
    ai_controller.save_q_table('q_table.pkl')

    # Load the trained AI and start the game
    ai_controller.load_q_table('q_table.pkl')
    game = LunarLanderGame(ai_controller=ai_controller)
    game.start_game()
    pygame.quit()

In this refactored code, constants are defined at the beginning. This makes it easier to adjust parameters like window size, lander properties, gravity, thrust, terrain height, and Q-learning parameters without changing the core logic of the game.

Reflection

The changes make sense and make it easier to adjust the game parameters without having to modify multiple locations in the code. The code is now more readable and maintainable, and it’s easier to experiment with different settings. However, the lander is still not landing safely every time. It might be worth exploring other aspects of the game to improve the AI’s performance.

Playing out

At this point, the code is at such a state that we could just play with modifying the code and trying to see how it affects the outcomes. As an example, as mentioned earlier, the reward function is important. One possibility would be to modify the reward function to provide a small bonus based on the fuel remaining when the lander lands.

# Modify the reward function
if collision_status == "win":
    reward = 100 + self.lander.fuel / 10 # Positive reward for a successful landing
elif collision_status == "lose":
    reward = -100  # Negative reward for a crash
else:
    reward = 0  # No reward for intermediate steps

Similarly, we could change the epsilon value to 0 when starting to play the game, so that the AI controller would not have any randomness in its decisions.

if __name__ == "__main__":
    ai_controller = AIController(LunarLander())
    ai_controller.train(episodes=1000)
    ai_controller.save_q_table('q_table.pkl')

    # Load the trained AI and start the game
    ai_controller.load_q_table('q_table.pkl')
    ai_controller.epsilon = 0 # disable exploration when playing with UI
    game = LunarLanderGame(ai_controller=ai_controller)
    game.start_game()
    pygame.quit()

Furthermore, a bigger change in terms of learning to behave in the world would be changing the initialization strategy. Right now, in the constructor of the AIController class, the q_table is set as full of zeros. In terms of the actions in the game, by default the lander is not thrusting.

    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.zeros(STATE_SPACE + (len(ACTIONS),))
        self.epsilon = INITIAL_EPSILON  # Initial exploration rate

One could change the initialization of the q_table to have randomly chosen ones and zeros, depicting that the initial “best” behavior at each state would be random.

    def __init__(self, lander):
        self.lander = lander
        self.q_table = np.random.choice([0, 1], size=STATE_SPACE + (len(ACTIONS),))
        self.epsilon = INITIAL_EPSILON  # Initial exploration rate

In the same vein, we could just modify the initialization code so that we just use the existing lander that has been trained previously, without training a new one.

if __name__ == "__main__":
    ai_controller = AIController(LunarLander())
    # ai_controller.train(episodes=1000)
    # ai_controller.save_q_table('q_table.pkl')

    # Load the trained AI and start the game
    ai_controller.load_q_table('q_table.pkl')
    ai_controller.epsilon = 0 # disable exploration when playing with UI
    game = LunarLanderGame(ai_controller=ai_controller)
    game.start_game()
    pygame.quit()

Reinforcement learning and large language models

Large language model developers leverage reinforcement learning during fine-tuning of the large language models. Reinforcement learning from human feedback focuses on learning a reward function from human feedback, which can then be used to guide the fine-tuning of the models.

Loading Exercise...

← Landing with AI

Interacting with Large Language Models Programmatically →