Welcome, tech enthusiasts and future AI gurus! Have you ever dreamed of creating a virtual driver that doesn’t just play a game, but truly masters it? Today, we’re diving into one of the most exciting challenges in the AI world: teaching an agent to drive a race car using Reinforcement Learning.
We’ll be tackling the CarRacing-v3
environment from Gymnasium (the successor to OpenAI Gym). It’s a notoriously tricky environment, but that’s precisely what makes the victory so sweet. This guide is designed to transform a daunting task into a thrilling and rewarding journey. By the end, you’ll have a fully coded AI pilot and the knowledge to tune it for the winner’s circle.
Let’s start our engines!
Why Reinforcement Learning for an AI Car Racing Agent?
The Car Racing environment is simple to look at but complex to master. It’s a pixel-based view of the track (a 96x96x3 image) and must output three continuous actions: steering, acceleration, and braking. Our goal is to cover as many track tiles as possible in the shortest amount of time.
The core difficulties are:
- High-Dimensional State Space: Processing raw pixels is computationally intensive and requires a smart approach to feature extraction.
- Sparse Rewards: The default reward system can be punishing, making it hard for a new agent to figure out what it’s doing right.
- Precise Control: The car is sensitive and can easily spin out of control, demanding nuanced actions from our agent.
Our mission is to build a smart agent that can elegantly overcome these hurdles.
Setting Up Your Development Environment for AI Racing
Before we write a single line of our agent’s code, let’s tweak the environment. A common and powerful strategy in Reinforcement Learning is to make the learning landscape a bit more… forgiving. This is often called Reward Shaping.
Instead of directly modifying the library’s source code (which can be risky and break with updates), we’ll create a custom “wrapper”. This wrapper will inherit from the original environment but allow us to safely override its behavior.
First, let’s get our dependencies in order.
pip install numpy tensorflow shapely "gymnasium[all]"
Now, let’s define our custom environment wrapper. This class will add a new method, car_on_track
, and override the step
method to create a “safe zone” around the track. If the car wanders too far, we’ll give it a significant penalty, teaching it to stay on or near the asphalt.
import gymnasium as gym
from shapely import affinity
from shapely.geometry import Point, Polygon
class SafeDrivingWrapper(gym.Wrapper):
"""
A wrapper to create a 'safe zone' around the track and penalize the agent
for going too far off-road.
"""
def __init__(self, env, border_width=0.5):
super(SafeDrivingWrapper, self).__init__(env)
self.border_width = border_width
def car_on_track(self):
car_on_track = False
x, y = self.unwrapped.car.hull.position
point = Point(x, y)
for poly in self.unwrapped.road_poly:
polygon = Polygon(poly[0])
# Create a larger polygon representing the track + safe border
if self.border_width > 0:
border_scale = 1 + self.border_width
polygon = affinity.scale(polygon, xfact=border_scale, yfact=border_scale)
if polygon.contains(point):
car_on_track = True
break
return car_on_track
def step(self, action):
next_state, reward, terminated, truncated, info = self.env.step(action)
# Apply a heavy penalty if the car is outside the safe zone
if not self.car_on_track():
reward -= 100
terminated = True # End the episode immediately
return next_state, reward, terminated, truncated, info
# We will apply this wrapper later when we instantiate the environment.
By adjusting the border_width
parameter, you can control the size of this safe zone. This simple change drastically speeds up learning by giving the agent clear, immediate feedback.
Building the Brain: A Deep Q-Network (DQN) for Car Racing
Our weapon of choice for this challenge is the Deep Q-Network (DQN). In simple terms, a DQN uses a deep neural network to approximate the Q-function, which estimates the value of taking a certain action in a particular state. It’s the cornerstone algorithm that famously learned to play Atari games directly from pixel data.
Here’s our game plan for implementing it.
Environment setup & hyperparameters
First, let’s import the necessary libraries and set up our environment, now using our new SafeDrivingWrapper
. We’ll also define the hyperparameters that will control our training process.
import random
from collections import deque
import gymnasium as gym
import numpy as np
import tensorflow as tf
from safedriving_wrapper import SafeDrivingWrapper
from tensorflow.keras import layers
# Environment setup with our custom wrapper
env_raw = gym.make("CarRacing-v3", continuous=False, render_mode="human")
env = SafeDrivingWrapper(env_raw)
# Hyperparameters
STATE_SHAPE = (96, 96, 1) # Grayscale image
ACTION_SIZE = env.action_space.n
MEMORY_SIZE = 10000
BATCH_SIZE = 32
GAMMA = 0.99 # Discount factor
ALPHA = 0.00025 # Learning rate
EPSILON = 1.0 # Exploration rate
EPSILON_DECAY = 0.99999
EPSILON_MIN = 0.01
TARGET_UPDATE_FREQUENCY = 25 # Episodes
WEIGHTS_FILE = "car_racing.weights.h5"
SAVE_WEIGHTS = True
LOAD_EXISTING_WEIGHTS = True
SAVE_WEIGHTS_INTERVAL = 100 # Episodes
The DQN agent class
Next, we’ll define the DQNAgent
class. This class will contain our neural network architecture (a CNN, perfect for image processing), the logic for choosing actions (the epsilon-greedy policy), and the training mechanism (Experience Replay).
class DQNAgent:
def __init__(self):
self.memory = deque(maxlen=MEMORY_SIZE)
self.epsilon = EPSILON
self.model = self._build_model()
self.target_model = self._build_model()
self.update_target_model()
self.optimizer = tf.keras.optimizers.Adam(learning_rate=ALPHA)
def _build_model(self):
"""Builds the CNN model."""
model = tf.keras.Sequential([
layers.Input(STATE_SHAPE),
layers.Conv2D(32, (8, 8), strides=4, activation='relu'),
layers.Conv2D(64, (4, 4), strides=2, activation='relu'),
layers.Conv2D(64, (3, 3), strides=1, activation='relu'),
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dense(ACTION_SIZE, activation='linear') # Q-values
])
return model
def update_target_model(self):
"""Copies weights from the main model to the target model."""
self.target_model.set_weights(self.model.get_weights())
def act(self, state):
"""Chooses an action using an epsilon-greedy policy."""
if np.random.rand() <= self.epsilon:
return random.randrange(ACTION_SIZE)
q_values = self.predict(self.model, tf.expand_dims(state, axis=0))
return np.argmax(q_values[0].numpy())
def remember(self, state, action, reward, next_state, done):
"""Stores an experience in the replay buffer."""
self.memory.append((state, action, reward, next_state, done))
def replay(self):
"""Trains the model using a random batch of experiences from memory."""
if len(self.memory) < BATCH_SIZE:
return
batch = random.sample(self.memory, BATCH_SIZE)
states, actions, rewards, next_states, dones = zip(*batch)
states = np.array(states)
next_states = np.array(next_states)
# Predict Q-values for current states and next states
q_values_current = self.predict(self.model, states).numpy()
q_values_next = self.predict(self.target_model, next_states).numpy()
for i in range(BATCH_SIZE):
if dones[i]:
q_values_current[i, actions[i]] = rewards[i]
else:
q_values_current[i, actions[i]] = rewards[i] + GAMMA * np.amax(q_values_next[i])
self.train_step(states, q_values_current)
# Decay epsilon
if self.epsilon > EPSILON_MIN:
self.epsilon *= EPSILON_DECAY
def save_weights(self):
if SAVE_WEIGHTS:
self.model.save_weights(WEIGHTS_FILE)
def load_weights(self):
if LOAD_EXISTING_WEIGHTS:
try:
self.model.load_weights(WEIGHTS_FILE)
self.update_target_model()
self.epsilon = EPSILON_MIN
print("Pre-trained weights found. Resuming training...")
except FileNotFoundError:
print("No pre-trained weights found. Starting from scratch...")
@tf.function
def predict(self, model, states):
return model(states, training=False)
@tf.function
def train_step(self, states, targets):
with tf.GradientTape() as tape:
q_values = self.model(states, training=True)
loss = tf.keras.losses.MeanSquaredError()(targets, q_values)
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
The Training Loop: How Your AI Actually Learns to Drive
This is where the magic happens. We’ll set up a loop that runs for a set number of episodes. In each episode, the agent will interact with the environment, store its experiences, and learn from them. We’ll also add another layer of reward shaping here—giving small bonuses for speed and penalties for being too slow—to encourage more aggressive driving.
def preprocess_state(state):
"""Converts the image to grayscale, resizes, and normalizes it."""
state_tensor = tf.convert_to_tensor(state)
state_tensor = tf.image.rgb_to_grayscale(state_tensor)
return tf.cast(state_tensor, tf.float32) / 255.0
def train_agent(episodes=1000):
agent = DQNAgent()
agent.load_weights()
for episode in range(episodes):
state, _ = env.reset()
state = preprocess_state(state)
total_reward = 0
done = False
while not done:
action = agent.act(state)
next_state, reward, done, _, _ = env.step(action)
next_state = preprocess_state(next_state)
# Custom Reward Shaping
car_on_track = env.car_on_track()
if car_on_track:
speed = np.linalg.norm(env.unwrapped.car.hull.linearVelocity)
# Reward for high speed, penalize for low speed
speed_bonus = max(0, speed * 0.1)
low_speed_penalty = -1 if speed < 1.0 else 0
reward += speed_bonus + low_speed_penalty
else:
# This penalty is now handled by the wrapper, but we can add more
reward -= 2
agent.remember(state, action, reward, next_state, done)
state = next_state
total_reward += reward
agent.replay()
if episode % TARGET_UPDATE == 0:
agent.update_target_model()
if episode > 0 and episode % SAVE_WEIGHTS_INTERVAL == 0:
agent.save_weights()
print(f"Episode {episode+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.4f}")
# Let's train!
train_agent()
Visualizing Success: Watch Your Trained AI Agent in Action!
While the code above can achieve decent results in around 500-1000 epochs, true mastery comes from fine-tuning. Once you have a baseline agent that can navigate the track, it’s time to become the pit crew chief. This is what separates a Sunday driver from a world champion.
Here are some pro-tips for optimization:
- Patience is a Virtue: Training a good agent takes time. Don’t be afraid to let your model train for thousands of episodes. The best results often come after a long night of computation.
- Hyperparameter Tuning: Experiment with the
ALPHA
(learning rate),GAMMA
(discount factor), andMEMORY_SIZE
. These values have a massive impact on performance. - Reward Shaping: Go back and refine your custom reward function. Can you add a bonus for overtaking? A penalty for sharp, jerky movements? The logic inside the training loop is your playground.
- Network Architecture: Is your CNN deep enough? Could adding more layers or filters help it recognize more complex patterns? Maybe a different activation function?
Next Steps: How to Improve Your AI Racing Agent
You now have the complete, robust roadmap to building a sophisticated AI that can conquer the Car Racing challenge. This project isn’t just about the final result; it’s about the journey of building, testing, and iterating. It’s a hands-on lesson in the power and intricacies of Reinforcement Learning.
So, grab your favorite IDE, fire up your machine, and start coding. I encourage you to take this foundation, experiment wildly, and push the limits of what’s possible.
The complete, ready-to-run code for this project is available on my GitHub repository.
https://github.com/felsangom/GymnasiumAI
Share your results! Drop a link to your project or share your best score in the comments below. Let’s see who can build the fastest AI driver!
You can check a portuguese version of this guide here: Car Racing com Inteligência Artificial: Um guia completo para iniciantes (e não tão iniciantes) 🚗💨