Machine Learning Menu
AI/ML Tools
Data Sources
General Data Sources (Data Aggregators)
DLZP Solutions
Simultaneous localization and Mapping
Contact your DLZP Group Account Manager if you're unable to View Security Resources
AI/ML Tools
Data Sources
General Data Sources (Data Aggregators)
DLZP Solutions
Simultaneous localization and Mapping
Contact your DLZP Group Account Manager if you're unable to View Security Resources
AWS DeepRacer is the fastest way to get rolling with machine learning, literally. Get hands-on with a fully autonomous 1/18th scale race car driven by reinforcement learning, 3D racing simulator, and global racing league.
AWS DeepRacer is a 1/18th scale race car which gives you an interesting and fun way to get started with reinforcement learning (RL). RL is an advanced machine learning (ML) technique which takes a very different approach to training models than other machine learning methods. Its super power is that it learns very complex behaviors without requiring any labeled training data, and can make short term decisions while optimizing for a longer term goal.
With AWS DeepRacer, you now have a way to get hands-on with RL, experiment, and learn through autonomous driving. You can get started with the virtual car and tracks in the cloud-based 3D racing simulator, and for a real-world experience, you can deploy your trained models onto AWS DeepRacer and race your friends, or take part in the global AWS DeepRacer League. Developers, the race is on.
The AWS Deepracer console is now available from AWS. If you login to the console and search for Deepracer.
Where does the Reward come from? Reinforcement learning algorithms are geared for optimization of cumulative rewards.
The model will learn which action (and then subsequent actions) will result in the highest cumulative reward on the way to the goal.
The critical part to make your reinforcement learning model work is the reward function. In general you design your reward function to act like an incentive plan.
Reward function parameters for AWS DeepRacer In AWS DeepRacer, the reward function is a Python function which is given certain parameters that describe the current state and returns a numeric reward value.
The parameters passed to the reward function describe various aspects of the state of the vehicle, such as its position and orientation on the track, its observed speed, steering angle and more.
We will explore some of these parameters and how they describe the vehicle as it drives around the track:
Link to Documentation: https://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-console-train-evaluate-models.html
If you clone a previously trained model as the starting point of a new round of training, you could improve training efficiency. To do this, modify the hyperparameters to make use of already learned knowledge.
| Hyperparameters | Description |
|---|---|
| Gradient descent batch size | The number recent vehicle experiences sampled at random from an experience buffer and used for updating the underlying deep-learning neural network weights. Random sampling helps reduce correlations inherent in the input data. Use a larger batch size to promote more stable and smooth updates to the neural network weights, but be aware of the possibility that the training may be longer or slower. |
| Number of epochs | The number of passes through the training data to update the neural network weights during gradient descent. The training data corresponds to random samples from the experience buffer. Use a larger number of epochs to promote more stable updates, but expect a slower training. When the batch size is small, you can use a smaller number of epochs |
| Learning rate | During each update, a portion of the new weight can be from the gradient-descent (or ascent) contribution and the rest from the existing weight value. The learning rate controls how much a gradient-descent (or ascent) update contributes to the network weights. Use a higher learning rate to include more gradient-descent contributions for faster training, but be aware of the possibility that the expected reward may not converge if the learning rate is too large. |
| Entropy | A degree of uncertainty used to determine when to add randomness to the policy distribution. The added uncertainty helps the AWS DeepRacer vehicle explore the action space more broadly. A larger entropy value encourages the vehicle to explore the action space more thoroughly. |
| Discount factor | A factor specifies how much of the future rewards contribute to the expected reward. The larger the Discount factor value is, the farther out contributions the vehicle considers to make a move and the slower the training. With the discount factor of 0.9, the vehicle includes rewards from an order of 10 future steps to make a move. With the discount factor of 0.999, the vehicle considers rewards from an order of 1000 future steps to make a move. The recommended discount factor values are 0.99, 0.999 and 0.9999. |
| Loss type | Type of the objective function used to update the network weights. A good training algorithm should make incremental changes to the agent's strategy so that it gradually transitions from taking random actions to taking strategic actions to increase reward. But if it makes too big a change then the training becomes unstable and the agent ends up not learning. The Huber loss and Mean squared error loss types behave similarly for small updates. But as the updates become larger, Huber loss takes smaller increments compared to Mean squared error loss. When you have convergence problems, use the Huber loss type. When convergence is good and you want to train faster, use the Mean squared error loss type. |
| Number of experience episodes between each policy-updating iteration | The size of the experience buffer used to draw training data from for learning policy network weights. An experience episode is a period in which the agent starts from a given starting point and ends up completing the track or going off the track. It consists of a sequence of experiences. Different episodes can have different lengths. For simple reinforcement-learning problems, a small experience buffer may be sufficient and learning is fast. For more complex problems that have more local maxima, a larger experience buffer is necessary to provide more uncorrelated data points. In this case, training is slower but more stable. The recommended values are 10, 20 and 40. |
In total there are 13 parameters you can use in your reward function
For more information on these parameters and the values they can take, read the detailed documentation.
This example created by Nick Sefiddashti shows how to add a reward based on Speed and distance from center of track.
def reward_function(on_track, x, y, distance_from_center, car_orientation, progress, steps, throttle, steering, track_width, waypoints, closest_waypoint):
import math
if(on_track):
reward = (1/((.5 * track_width) - abs(distance_from_center))) * throttle * progress
else:
reward = 0
return float(reward)
This example created by AWS and enhanced by Nick Sefiddashti shows how to add a reward for Steering around the track.
def reward_function(on_track, x, y, distance_from_center, car_orientation, progress, steps, throttle, steering, track_width, waypoints, closest_waypoint):
'''
@on_track (boolean) :: The vehicle is off-track if the front of the vehicle is outside of the white
lines
@x (float range: [0, 1]) :: Fraction of where the car is along the x-axis. 1 indicates
max 'x' value in the coordinate system.
@y (float range: [0, 1]) :: Fraction of where the car is along the y-axis. 1 indicates
max 'y' value in the coordinate system.
@distance_from_center (float [0, track_width/2]) :: Displacement from the center line of the track
as defined by way points
@car_orientation (float: [-3.14, 3.14]) :: yaw of the car with respect to the car's x-axis in
radians
@progress (float: [0,1]) :: % of track complete
@steps (int) :: numbers of steps completed
@throttle :: (float) 0 to 1 (0 indicates stop, 1 max throttle)
@steering :: (float) -1 to 1 (-1 is right, 1 is left)
@track_width (float) :: width of the track (> 0)
@waypoints (ordered list) :: list of waypoint in order; each waypoint is a set of coordinates
(x,y,yaw) that define a turning point
@closest_waypoint (int) :: index of the closest waypoint (0-indexed) given the car's x,y
position as measured by the eucliedean distance
@@output: @reward (float [-1e5, 1e5])
'''
import math
# Example Centerline following reward function
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
reward = 1e-3
if distance_from_center >= 0.0 and distance_from_center <= marker_1:
if throttle > .75:
if abs(steering) < .25:
reward = 1
else:
reward = .9
else:
if abs(steering) < .25:
reward = .8
else:
reward = .7
elif distance_from_center <= marker_2:
if throttle > .75:
if abs(steering) < .25:
reward = .6
else:
reward = .5
else:
if abs(steering) < .25:
reward = .4
else:
reward = .3
elif distance_from_center <= marker_3:
if throttle > .75:
if abs(steering) < .25:
reward = .2
else:
reward = .1
else:
if abs(steering) < .25:
reward = .05
else:
reward = 0
else:
reward = 1e-3 # likely crashed/ close to off track
return float(reward)
Stay On Track In this example, we give a high reward for when the car stays on the track, and penalize if the car deviates from the track boundaries. This example uses the all_wheels_on_track, distance_from_center and track_width parameters to determine whether the car is on the track, and give a high reward if so. Since this function doesn't reward any specific kind of behavior besides staying on the track, an agent trained with this function may take a longer time to converge to any particular behavior.
def reward_function(params):
'''
Example of rewarding the agent to stay inside the two borders of the track
'''
# Read input parameters
all_wheels_on_track = params['all_wheels_on_track']
distance_from_center = params['distance_from_center']
track_width = params['track_width']
# Give a very low reward by default
reward = 1e-3
# Give a high reward if no wheels go off the track and
# the agent is somewhere in between the track borders
if all_wheels_on_track and (0.5*track_width - distance_from_center) >= 0.05:
reward = 1.0
# Always return a float value
return float(reward)
Follow Center Line In this example we measure how far away the car is from the center of the track, and give a higher reward if the car is close to the center line. This example uses the track_width and distance_from_center parameters, and returns a decreasing reward the further the car is from the center of the track. This example is more specific about what kind of driving behavior to reward, so an agent trained with this function is likely to learn to follow the track very well. However, it is unlikely to learn any other behavior such as accelerating or braking for corners.
def reward_function(params):
'''
Example of rewarding the agent to follow center line
'''
# Read input parameters
track_width = params['track_width']
distance_from_center = params['distance_from_center']
# Calculate 3 markers that are at varying distances away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
reward = 1.0
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3 # likely crashed/ close to off track
return float(reward)
No incentive An alternative strategy is to give a constant reward on each step, regardless of how the car is driving.
This example doesn't use any of the input parameters — instead it returns a constant reward of 1.0 on each step.
The agent's only incentive is to successfully finish the track, and it has no incentive to drive faster or follow any particular path. It may behave erratically.
However, since the reward function doesn't constrain the agent's behavior, it may be able to explore unexpected strategies and behaviors that turn out to perform well.
def reward_function(params): ''' Example of no incentive ''' # Always return 1 if the car does not crash return 1.0