Previously, I reviewed core concepts in Reinforcement Learning (RL) and introduced important parts of the OpenAI Gym API. You can review that introduction here:
Reinforcement learning notebooks for posting on the internet.
As these notebooks are instructional and experimental I don't recommend running them locally. I test and run these notebooks on Google Colab or Kaggle before I run them -- you can do it there too!
...is somewhat painful. Many online notebook services like colab and Kaggle don't allow you to install some of the OpenAI environments, so I'm going to stick to Atari for now. If you're interested in trying to set up OpenAI gym with more flexibility, you might start with this interesting write-up.
In order to write agents that actually take the game screen into account when making decisions, we'll need to update our run_job utility from last time:
And our RandomAgent will need to be modified accordingly:
classRandomAgentContainer:"""A model that takes random actions and doesn't learn"""def__init__(self):passdefdecision_function(self,obs=None,env=None):ifenv:returnenv.action_space.sample()else:return0deftrain_on_history(self,history):passmodel=RandomAgentContainer()
Now we can use our RandomAgent to explore all the information that our job creates.
Our job produced video of the game being played, as well as a history of images, actions, predictions, and rewards. It also saved the environment object from OpenAI Gym.
Let's begin by trying to understand our observations in a video:
render_video(0,result['env']);
If, like me, you have never played Robot Tank on an Atari... you can read the manual! You can learn about the heads up display and more.
So anyway, it looks like the image has a bunch of noise in it. Let's see if we can extract any of that...
So using the ticks on the axes and with a little trial and error we can find the bounding boxes for the radar panel and the periscope.
imshow(observation_sample[0][139:172,60:100,:])
<matplotlib.image.AxesImage at 0x7f48ad329b38>
So, we can certainly crop this image and worry less about the noise...
radar_bounding_box=((139,172),(60,100),(None))
From reading the manual we know that one of the four indicators bracketing this radar display is "R" for "radar." In other words, we' can't rely on radar as the only input, because all of those indicators represent subsystems that can be disabled.
Let's also take a bounding box for the periscope:
imshow(observation_sample[0][80:124,12:160,:])
<matplotlib.image.AxesImage at 0x7f48ad693b38>
peri_bounding_box=((80,124),(12,160),(None))
Let's also check the info field because it sometimes has observation data.
I'm going to intentionally ignore the V, C, R, T boxes and we can always reintrouduce them later if we think a performance gain is in the offing. You saw in the manual how they work so I don't think it's a cause for concern...
Hmm...not helpful at all. But that's what you'd naturally think to do... It turns out that extracting action meanings has its own namespace in the gym API.
So "NOOP" presumably means "no-op" i.e. "do nothing." The rest apparently constitute all the permutations of actions available to the client. This is what we would expect the action space to be. These are also discrete actions so we can code our model to take exactly one action per step.
Finally, let's visualize the rewards
set([r['reward']forrinresult['history']])# Unique rewards across history
Looks like the reward function is simply "score a hit=1 else 0". We can confirm by visualizing the observations at reward time.
# Observations when reward was given
reward_incidents=list(filter(lambdai:i['reward'],result['history']))i=0imshow(reward_incidents[i]['observation'])
i=1imshow(reward_incidents[i]['observation'])
i=2imshow(reward_incidents[i]['observation'])
They're all images that seem to be captured right after the tank scores a hit.
Writing the tank commander
The Brain
The TankCommander agent needs to learn how to decide which action to take. So, we first need to give it a mechanism for learning. In this case, we're going to use a special kind of graph. In this graph there are three kinds of nodes:
Input nodes
Which take our inputs and send signals to the nodes that they are connected to
Regular nodes
These nodes can have many connections from other nodes (including input nodes.) Some connections are strong, and some connections are weak. This node uses the signals and the signal strength from all the connections to decide what signal to send along all of its own outgoing connections.
Output nodes
These nodes are only different from regular nodes in that we read their
signal.
The nodes are organized into "layers" that can share many connections and have a function in common for how they decide to aggregate and send signals.
Now, I've simplified a lot, but the graph that we're talking about, if properly organized, is a deep learning neural net. To create one we can use tensorflow like so:
Finally we compile the model, giving it some special parameters. To teach our graph to drive a tank, we need to call the fit method e.g. model.fit(data, correct_prediction). Each time this happens, our model goes back and updates the strength (aka the 'weight') of the regular connections. The fancy name for this is "backpropagation." Anyway, the parameters of model.compile help determine how backpropagation is executed.
The Experiences
Now that we have a brain, the brain needs experiences to train on. As we know from our manual and our little exploration of the data, each episode maps to one full game of Robot Tank. However, in each game the player gets several tanks and if a tank is destroyed, the player simply "respawns" in a new tank. So the episode is actually not the smallest unit of experience, rather, it's the in-game "life" of a single tank.
Take our original action, prediction, and periscope image as data
nudge our prediction only at the action index, since we can only learn from the actions we have taken.
Call model.fit(image, prediction_with_nudge)
To visualize the problem with this, imagine you are tasked with riding a bike down a mountain blindfolded. As you miraculously ride down the mountain without killing yourself, you may reach a point where you seem to have reached the bottom. In any direction you try to go, you must pedal uphill. You take your blindfold off only to realize that you've barely gone a mile, and that you still have far to go before you reach the base of the mountain. The fancy term for this is a "local minimum."
To address this we can just force the commander to randomly take actions sometimes:
forobsingame_life:action,prediction=obs.get('action')ifself.epsilonand(random.uniform(0,1.0)<self.epsilon):action=random.randrange(18)# Copy
update=list(prediction)# Update only the target action
update[0][action]=update[0][action]+displacement
With only 120k steps, our tank already seemed decided on a strategy as shown in this long, long gif.
As you can see, turning left is powerful in Robotank -- a whole squadron killed!
...but then it dies. I think this is pretty strong for a stumpy model trained only on the periscope viewport. Time permitting, I may continue to tinker with this one -- increasing the epsilon value, tinkering with the graph parameters, and adding views, all could help nudge the tank commander towards a more nuanced strategy.
Anyway you've learned a bit about implementing DL with RL in Python. You learned:
DL basic concepts
Exploratory data analysis for RL
Selectively applying rewards to specific actions, and smallest divisible units of experience
Introducing random actions to help explore the "action space"