$ \def\RR{\mathbb{R}} \def\tr{\mathrm{tr}} \def\th{\mathrm{th}} \def\EE{\mathbb{E}} \def\coloneqq{\colon=} \def\grad{\nabla} $


I’ve been playing around with deep reinforcement learning for a little while, but have always found it hard to get the state of the art algorithms working. This is in part because getting any algorithm to work requires some good choices for hyperparameters, and I have to do all of these experiments on my Macbook.

In this tutorial I will go over how to implement the asynchronous advantage actor-critic algorithm (or A3C for short). The original paper can be found here but I hope that I can contribute by making everything a little easier to understand. The paper describes 4 algorithms: one step Q-learning, $n$-step Q-learning, one step SARSA and A3C. I also implemented one step Q-learning and got this to work on Space Invaders, but the reason I focus on A3C is because it is the best performing algorithm from the paper.

The exciting thing about the paper, at least for me, is that you don’t need to rely on a GPU for speed. In fact, the whole idea is to use multiple cores of a CPU, run in parallel, which gives a speedup proportional to the number of cores used. Since I normally run things on my laptop, which only has one CPU (albeit with two cores), I don’t actually bother implementing things properly in parallel, and instead use threads. Consequently, my implementation won’t actually give a speedup in this way, but we’ll see that it still manages to work on a laptop, just a little more slowly.

I’ll use tensorflow to make things a little easier, as we’ll need to work with convolutional neural networks, but in principle everything could be implemented using just numpy if you were willing to write the backprop code yourself.

My aim is to make the algorithm easy to understand, and also make it so that there are as few choices of hyperparameters and as little deep learning magic as possible.

The A3C algorithm

As with a lot of recent progress in deep reinforcement learning, the innovations in the paper weren’t really dramatically new algorithms, but how to force relatively well known algorithms to work well with a deep neural network. As I will soon explain in more detail, the A3C algorithm can be essentially described as using policy gradients with a function approximator, where the function approximator is a deep neural network and the authors use a clever method to try and ensure the agent explores the state space well.

In the paper there are actually two versions of the A3C algorithm: one just uses a feedforward convolutional neural network, while the other includes a recurrent layer. I’ll focus on the first one, in order to simplify everything as much as possible. I also only focus on the discrete action case here. Perhaps I will write a follow up to this including the recurrent layer, as well as the extension to continuous action spaces.

Brief reminder of reinforcement learning

I gave an introduction to reinforcement learning and the policy gradient method in my first post on reinforcement learning, so it might be worth reading that first, but I will briefly summarise what we need here anyway.

First we need to discuss actions and states. Actions are the things that the agent can do at a given time, and the state is the agent’s representation of everything that it knows in the game. The agent has to decide what to do using only its current state as information – since at each time step, that’s all it knows.

For example, in Space Invaders, the possible $a$ might be: 0 for moving left, 1 for moving right and 2 for shoot. And if we are playing ATARI games from pixel values, then the input $x$ just looks like a screenshot of the game. This can be represented as a 3D array by considering each of the red, green and blue colour channels as being a 2D image. But in a standard preprocessing trick, since colours don’t seem to matter, we usually just take the average of all three colour channels, thus getting back to a 2D image. However, one can see that a single game screen gives insufficient information to the agent, since it doesn’t show which direction objects are moving in. A common solution to this is to show the agent several frames at a time: the current frame, along with the previous 3 frames, say.

Now, an ATARI game screen is $210 \times 160$ pixels, so we would have $210 \times 160 \times 4 = 134,400$ entries in the array representing one screen, which seems like a lot of information in order to decide where to go. Andrej Karpathy has a great blog post about playing Pong from pixels, and manages to get the most basic policy gradient algorithm to work on this, using just a feedforward neural network. He uses a bit of preprocessing to make it easier though – for example, he makes sure that only the paddles and ball have nonzero values in the image (by zeroing out the background), and also uses a difference of frames (current frame minus previous frame) instead of showing the agent several frames at a time. We will use a convolutional neural network, which will hopefully make it easier to do a little less preprocessing. In Space Invaders, you would expect to have to work harder to get the agent to identify the aliens and bullets.

Overview of policy gradients

The idea of a policy is to parametrise the conditional probability of performing a given action $a$ given some state $x$. That is, given a state $x$, we want to compute, for each possible action $a$, the probability $P(a \mid x; \theta)$ of performing action $a$ given that we are in state $x$. We write $\theta$ inside the probability to denote that the probability is determined by the parameter $\theta$. In the following $\theta$ will be the weights of our deep neural network, so will actually be a million numbers or so.

Note that $P(a \mid x; \theta)$ is actually quite concrete. It just means that while the agent is playing the game, or training, whenever we see state $x$ we will choose the action by first computing $P(a \mid x; \theta)$ for each action $a$ and then sampling the action from this probability distribution. As it happens, if you really have a Markov Decision Process (i.e. the future is independent of the past, given the current state), then if an optimal policy exists, there is a deterministic optimal policy. However, using a stochastic policy (i.e. choosing actions with some randomness) is helpful. For one, it allows the agent to explore the state space without always taking the same trajectory, which is important for the exploration-exploitation tradeoff. In Space Invaders, for example, it turns out that if you choose to shoot at each timestep then you get a reward of something like 180 every time. If you weren’t forced to explore other actions, you may well think this was good enough and exploit this (over, say, always moving left, which would give you zero reward).

To fit with standard notation, we define $\pi(a \mid x; \theta) = P(a \mid x; \theta)$. We are now looking to find a good set of parameters $\theta$, such that, if we follow the policy $\pi(a \mid x; \theta)$, then we have a high expected reward.

Denote by $\tau$ a trajectory of the MDP. That is, $\tau$ is a sequence $x_0, a_0, r_1, x_1, a_1, r_2, \ldots, x_{T-1}, a_{T-1}, r_T$, where $r_{t+1}$ is the reward for having been in state $x_t$, taken action $a_t$, and being transitioned to state $x_{t+1}$. In the episodic setting, we have terminal states and we assume that the sequence terminates after some finite time $T$ (e.g. when the agent dies or completes the game). We let $R_\tau = r_1 + \cdots + r_T$, or sometimes we use a discount factor $0 < \gamma \le 1$ and let $R_\tau = r_1 + \gamma r_2 + \cdots \gamma^{T-1} r_T$.

The policy gradient method aims to maximise $\EE_\tau[R_\tau \mid \pi; \theta]$. In words, this is the expected reward of a trajectory when the agent follows policy $\pi$. The method tries to maximise this by gradient ascent with respect to the parameters $\theta$. Thus we compute $\grad_\theta \EE_\tau[R_\tau \mid \pi; \theta]$ and then iteratively improve $\theta$ by $\theta \mapsto \theta + \alpha \cdot \grad_\theta \EE_\tau [R_\tau \mid \pi; \theta]$.

We saw last time that we can compute $\grad_\theta \EE_\tau[R_\tau \mid \pi; \theta]$ as the expected value of

So we fix an initial parameter $\theta$ and then iterate the following: sample a trajectory $\tau$ from the environment using $\pi(a \mid x; \theta)$; compute $\hat{g}(\tau)$; update $\theta$ by $\theta \mapsto \theta + \alpha \hat{g}(\tau)$. None of the updates will actually be in the correct direction, but if we update enough times then on average the step will be correct.

The only thing we have to be able to do is compute $\grad_\theta \log \pi(a \mid x; \theta)$ for any given action $a$ and state $x$. This is where tensorflow comes in, which is basically an automatic differentiation machine.

Policy gradients with lower variance

Although $\hat{g}(\tau)$ is an unbiased estimator of the gradient of the expected reward, it has high variance, which means that we take a lot of steps in poor directions, even though on average the step will be in the correct direction. The next helpful idea is to try and reduce the variance of our estimate of the gradient.

One can replace $R_\tau$ in the expression for $\hat{g}(\tau)$ by $R_t = \sum_{t’=t}^T r_{t’}$ and get the same expectation. This is because if $b_t$ is a function of states, actions and rewards up to time $t$, then

and is thus a constant. Here, the subscript $s_{t+1\colon \infty}, a_{t+1\colon \infty}$ just means those states and actions in the given range. Thus the equation above just says that the expectation of $b_t$ over future states and actions is just $b_t$ (which sounds obvious when you say it like that!). Now note that, repeating the calculation from my first post,

This follows by the same trick as last time, where we both multiply and divide by $P(\tau \mid \theta)$ inside the expectation. When we now include the $b_t$ (which is constant), we thus also get zero expectation. Hence the following two expectations are equal

If we take $b_t = r_1 + \cdots + r_{t-1}$, then we arrive at the estimator

of the gradient.

A3C on the Cart and Pole problem

There’s a really nice discussion of things you can replace $R_t$ with in this paper. The authors there investigate using biased estimators instead of just the unbiased estimators that we have been talking about. That is, the expected value of the estimator doesn’t actually agree with the gradient $\grad_\theta \EE_\tau (R_\tau)$, but they show that you can get lower variance estimators this way.

In fact, if you just use $R_t$ for the estimator, you recover the REINFORCE algorithm. If you use $R_t - b_t$, where $b_t$ is some function of the state $s_t$ (that you have to learn) it is possible to reduce the variance, as we discussed above.

The A3C algorithm changes this estimator by replacing $R_t$ by something called the advantage function.

Value functions

To discuss the advantage function, we first have to define some useful value functions. Firstly, the state-value function $V_\pi(s)$ is defined as the expected sum of rewards when starting in state $s$ and following policy $\pi$. Formally,

Even more formally, let $s_0 = s$ and $s_t, a_t$ be generated by the following stochastic process:

for all $t \ge 0$. Then $R_\tau = r_1 + \cdots + r_T$ as usual, where $r_t$ is the reward received by the agent for the sequence $s_t, a_t, s_{t+1}$.

Next we define the closely related state-action value function $Q_\pi(s, a)$. This is the expected sum of rewards when starting in state $s$, taking action $a$ and from then on following the policy $\pi$. We have

More formally, let $s_0 = s, a_0 = a$, and let $s_t, a_t$ be generated by the following stochastic process (for $t \ge 0$)

Then as before $R_\tau = r_1 + \cdots + r_T$, where $T$ is the length of the episode (assumed finite). And $Q_\pi(s, a) = \EE_\tau(R_\tau)$ in the stochastic process just described.

One can then see that $Q_\pi$ and $V_\pi$ satisfy the following equation:

In words: the left hand side is the expected total reward when starting in state $s$, taking action $a$ and then following policy $\pi$ for the rest of the episode. The right hand side is the expected first reward for starting in state $s$ plus the expected value (computed over all possible states $s_1$) of being in state $s_1$ when following $\pi$.

One then defines the advantage function as $A_\pi(s, a) = Q_\pi(s, a) - V_\pi(s)$. Intuitively, this is the advantage of taking action $a$ over following the policy $\pi$ at this time step.

One could try and estimate the value function $b_t(s_t) \approx V_\pi(s_t)$, and then note that $R_t$ is an estimate of $Q_\pi(s_t, a_t)$ and $b_t$ is an estimate of $V_\pi(s_t)$, so that $R_t - b_t(s_t)$ is an estimate of $A_\pi(s_t, a_t)$. This leads us to consider the estimator

where $b_t(s_t)$ is an estimate of $V_\pi(s_t)$, which we learn simultaneously.

Advantage actor-critic

This leads us to the following algorithm, which could be called advantage actor-critic (i.e. no ‘asynchronous’ yet). The critic refers to the estimate of the value function, while the actor refers to the estimate of the policy.

initialise parameters theta_pi for policy pi
initialise parameters theta_v for value estimate V
choose max_t (e.g. 20)
choose max_training_steps (e.g. 100 million)
choose discount factor gamma

# Update parameters once each time through this loop
for T in range(max_training_steps):
    s = current state of emulator
    initialise array rewards
    for t in range(max_t):
        sample action a from pi(a | s; theta_pi)
        append a to actions
        append s to states
        perform action a and get new state s' and reward r
        append r to rewards
        s' = s
        break if terminal
    # Now train the parameters
    R = 0
    if not terminal
        R = V(s, theta_v) (the estimate of the latest state)
    # Compute discounted rewards
    append R to rewards array
    rewards = discounted(rewards, gamma)
    # Compute the gradients
    for each i
        R_i = sum(rewards[i:])
        val_i = V(s_i, theta_v)
        compute grad_theta_pi log pi(a_i, s_i, theta_pi) * (R_i - val_i)
        compute grad_theta_v (R_i - val_i)^2

    update theta_pi, theta_v by the computed gradients

Here discounted(rewards, gamma) takes an array $[r_1, r_2, \ldots, r_n]$ and a discount factor $0 < \gamma \le 1$ and computes the array $[R_1, \ldots, R_n]$, where $R_t = \sum_{t’=t}^n \gamma^{t’-t} r_{t’}$. Thus $R_t$ is the discounted sum of rewards from time $t$ onwards. Note that the last value in the array, $r_n$, isn’t actually the reward that the agent receives at time $n$, and is in fact either 0 (if the frame was terminal) or is the agent’s estimate of the value of the state that the agent just transitioned to (and hasn’t been able to act on).

Asynchronous advantage actor-critic

The above algorithm has the problem that the agent only observes one part of state space at a time and so the updates might only affect the agent’s performance positively in these parts of state space while decreasing the agent’s performance in other areas. Also, since the updates come from consecutive states played by the agent, the updates are correlated, which is usually a bad thing in machine learning. Even in supervised learning, (for example, classifying images), it is good practice to shuffle training examples while training so as to avoid correlated updates to the network. In the reinforcement learning setting, it can be disastrous!

In the DQN (deep Q-learning) algorithm, the authors get around this problem of correlated updates by using experience replay: a large buffer of all transitions $s_t, a_t, s_{t+1}, r_{t+1}$ that the agent continually adds to. During training, you sample from this replay buffer randomly and use these samples to make updates. This ensures you see uncorrelated transitions, as opposed to transitions in sequence.

With the A3C algorithm, we no longer use experience replay, but instead just use many agents, all exploring the state space simultaneously. The hope is that the different agents will be in different parts of the state space, and thus give uncorrelated updates to the gradients.

Implemented as stated in the asynchronous paper, each agent should run in a separate process so that the training occurs in parallel. A central network holds shared parameters theta_pi, theta_v and the agents asynchronously update to these. This just means that the updates are not synchronised, i.e. each agent updates the shared parameters as soon as it can. Between each update, the agent takes a copy of the shared network, with parameters theta_local_pi, theta_local_v, say, and then runs t_max steps of the simulator, acting via the local policy. After t_max steps, or if it sees a terminal frame, then that agent computes the gradients (in its own process) and then updates the shared parameters. If this all works in parallel, you expect to see a linear speedup with the number of agents.

The code

Let’s implement the algorithm now. The full code can be found as the files custom_gym.py, a3c.py, agent.py in this github repository. There are a few simplifications in the code below, as I felt it was easier to explain without including things like saving and restoring in tensorflow.

Wrapper around Open AI Gym

First we’ll write a small wrapper around Open AI Gym to give us more control about what the agent sees. We’ll see in the tricks section that it’s important to preprocess the images. This includes something called ‘frame skipping’, which is where we only show the agent every $n$th frame for some small $n$. Open AI Gym lets us hold multiple instances of an environment. In the main training program, we create an environment using:

import gym
env = gym.make('SpaceInvaders-v0')

Our ‘CustomGym’ takes in an environment and gives a few extra features on top. It still provides the same functions as a gym environment though. The code in this section all goes into custom_gym.py

# A wrapper class for Open AI Gym.
class CustomGym:
  def __init__(self, game_name, skip_actions=4, num_frames=4, w=84, h=84):
    # game_name: the name of the Open AI Gym environment
    # skip_actions: the number of frames to repeat an action for
    # num_frames: the number of frames to stack in one state
    # w: width of the state input
    # h: height of the state input.

    self.env = gym.make(game_name)
    self.num_frames = num_frames
    self.skip_actions = skip_actions
    self.w = w
    self.h = h

    # For some of the ATARI games the action spaces are much larger than we want
    # and contain repeated actions, so I worked out simplified actions for these
    # games.
    if game_name == 'SpaceInvaders-v0':
      self.action_space = [1,2,3]
    elif game_name == 'Pong-v0':
      self.action_space = [1,2,3]
    elif game_name == 'Breakout-v0':
      self.action_space = [1,4,5]
      # Otherwise, use the actions specified by Open AI.
      self.action_space = range(env.action_space.n)

    # Store the rest.
    self.action_size = len(self.action_space)
    self.state = None
    self.game_name = game_name

Now we have a function to preprocess the frames.

# Preprocess the input and stack the frames.
def preprocess(self, obs, is_start=False):
  # First convert to grayscale
  grayscale = obs.astype('float32').mean(2)
  # Resize the image to w x h and scale to between 0 and 1
  s = imresize(grayscale, (self.w, self.h)).astype('float32') * (1.0/255.0)
  # Next reshape the image to a 4D array with 1st and 4th dimensions of
  # size 1
  s = s.reshape(1, s.shape[0], s.shape[1], 1)
  # Now stack the frames. If this is the first frame, then repeat it
  # num_frames times.
  if is_start or self.state is None:
    self.state = np.repeat(s, self.num_frames, axis=3)
    self.state = np.append(s, self.state[:,:,:,:self.num_frames-1], axis=3)
  return self.state

Next we have the ‘render’, ‘step’ and ‘reset’ functions. The only one of interest is ‘step’, where we want to repeat the given action skip_actions times.

  # Render the current frame
  def render(self):

  # Reset the environment and return the state.
  def reset(self):
    return self.preprocess(self.env.reset(), is_start=True)

  # Step the environment with the given action.
  def step(self, action_idx):
    action = self.action_space[action_idx]
    accum_reward = 0
    prev_s = None
    for _ in range(self.skip_actions):
      s, r, term, info = self.env.step(action)
      accum_reward += r
      if term:
      prev_s = s

    # Takes maximum value for each pixel value over the current and previous
    # frame. Used to get round Atari sprites flickering (Mnih et al. (2015))
    if self.game_name == 'SpaceInvaders-v0' and prev_s is not None:
      s = np.maximum.reduce([s, prev_s])
    return self.preprocess(s), accum_reward, term, info

Note that we have a trick at the end of ‘step’ to get around flickering in Space Invaders. It turns out that if you skip 4 frames at a time for Space Invaders then the bullets can sometimes be invisible. So when skipping frames in Space Invaders, we take the maximum value out of the 3rd and 4th frames (if we skip 4 frames at a time).


We now look at the code in the ‘agent.py’ file. First, we build the network in tensorflow. Note that we use variable scopes, which are a useful way of separating out the tensors in your graph. Also we use tf.contrib, which is a helpful way of specifying hyperparameters in a graph. It turns out to be important to initialise the weights and biases well. One well-known method is called ‘Xavier initialisation’, which chooses the variance of the distribution (either uniformly or normally distributed) to be $1/N_\mathrm{in}$ where $N_\mathrm{in}$ is the number of neurons that are input to a given neuron in this layer. Another commonly used alternative is $2 / (N_\mathrm{in} + N_\mathrm{out})$, where $N_\mathrm{out}$ is the number of neurons output from a given neuron.

For fully connected layers, this is easy: the number of neurons input to a given layer is the number of neurons in the previous layer. You can ignore the batch size, since the number of neurons is independent of batch size. For convolutional layers, where the input is a 3D array (ignoring the 1st dimension, which is the batch size), the number of neuron inputs to a given layer is the product of the filter size by the number of input filters. So if the filter size of the current layer is $[4,4]$ and there are 16 filters on the previous layer, then the number of inputs would be $4 \times 4 \times 16$.

There are slightly different heuristics if the nonlinearity is ‘relu’ rather than ‘sigmoid’, but this should work pretty well.

# Builds the DQN model as in Mnih et. al (DQN paper). We edit the output so that
# there is a softmax output for the policy from the fc1 layer, and a linear
# output from the fc1 layer for the value function.
def build_model(self, h, w, channels):
  state = tf.placeholder('float32', shape=(None, h, w, channels), name='state')
  # First convolutional layer
  with tf.variable_scope('conv1'):
    conv1 = tf.contrib.layers.convolution2d(inputs=state,
    num_outputs=16, kernel_size=[8,8], stride=[4,4], padding="VALID",

  # Second convolutional layer
  with tf.variable_scope('conv2'):
    conv2 = tf.contrib.layers.convolution2d(inputs=conv1, num_outputs=32,
    kernel_size=[4,4], stride=[2,2], padding="VALID",

  # Flatten the network
  with tf.variable_scope('flatten'):
    flatten = tf.contrib.layers.flatten(inputs=conv2)

  # Fully connected layer with 256 hidden units
  with tf.variable_scope('fc1'):
    fc1 = tf.contrib.layers.fully_connected(inputs=flatten, num_outputs=256,

  # The policy output
  with tf.variable_scope('policy'):
    policy = tf.contrib.layers.fully_connected(inputs=fc1,
    num_outputs=self.action_size, activation_fn=tf.nn.softmax,

  # The value output
  with tf.variable_scope('value'):
    value = tf.contrib.layers.fully_connected(inputs=fc1, num_outputs=1,

  return state, policy, value

Next we make the tensors we need for training.

import tensorflow as tf
class Agent():
    def __init__(self, session, action_size, optimizer=tf.train.AdamOptimizer(1e-4)):
        # session: the tensorflow session
        # action_size: the number of actions
        self.action_size = action_size
        self.optimizer = optimizer
        self.sess = session

        with tf.variable_scope('network'):
            # Store the state, policy and value for the network
            self.state, self.policy, self.value = self.build_model(84, 84, 4)

            # Get the weights for the network
            self.weights = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope='network')

            # Placeholders for the action, advantage and target value
            self.action = tf.placeholder('int32', [None], name='action')
            self.target_value = tf.placeholder('float32', [None], name='target_value')
            self.advantages = tf.placeholder('float32', [None], name='advantages')

        with tf.variable_scope('optimizer'):
            # Compute the one hot vectors for each action given.
            action_one_hot = tf.one_hot(self.action, self.action_size, 1.0, 0.0)

            # Clip the policy output to avoid zeros and ones -- these don't play well with taking log.
            min_policy = 0.000001
            max_policy = 0.999999
            self.log_policy = tf.log(tf.clip_by_value(self.policy, 0.000001, 0.999999))

            # For a given state and action, compute the log of the policy at
            # that action for that state. This also works on batches.
            self.log_pi_for_action = tf.reduce_sum(tf.multiply(self.log_policy, action_one_hot), reduction_indices=1)

            # Takes in R_t - V(s_t) as in the async paper. Note that we feed in
            # the advantages so that V(s_t) is treated as a constant for the
            # gradient. This is because V(s_t) is the baseline (called 'b' in
            # the REINFORCE algorithm). As long as the baseline is constant wrt
            # the parameters we are optimising (in this case those for the
            # policy), then the expected value of grad_theta log pi * b is zero,
            # so the choice of b doesn't affect the expectation. It reduces the
            # variance though.
            # We want to do gradient ascent on the expected discounted reward.
            # The gradient of the expected discounted reward is the gradient of
            # log pi * (R - estimated V), where R is the sampled reward from the
            # given state following the policy pi. Since we want to maximise
            # this, we define the policy loss as the negative and get tensorflow
            # to do the automatic differentiation for us.
            self.policy_loss = -tf.reduce_mean(self.log_pi_for_action * self.advantages)

            # The value loss is much easier to understand: we want our value
            # function to accurately estimated the sampled discounted rewards,
            # so we just impose a square error loss.
            # Note that the target value should be the discounted reward for the
            # state as just sampled.
            self.value_loss = tf.reduce_mean(tf.square(self.target_value - self.value))

            # We follow Mnih's paper and introduce the entropy as another loss
            # to the policy. The entropy of a probability distribution is just
            # the expected value of - log P(X), denoted E(-log P(X)), which we
            # can compute for our policy at any given state with
            # sum(policy * -log(policy)), as below. This will be a positive
            # number, since self.policy contains numbers between 0 and 1, so the
            # log is negative. Note that entropy is smaller when the probability
            # distribution is more concentrated on one action, so a larger
            # entropy implies more exploration. Thus we penalise small entropy,
            # or equivalently, add -entropy to our loss.
            self.entropy = tf.reduce_sum(tf.multiply(self.policy, -self.log_policy))

            # Try to minimise the loss. There is some rationale for choosing the
            # weighted linear combination here that I found somewhere else that
            # I can't remember, but I haven't tried to optimise it.
            # Note the negative entropy term, which encourages exploration:
            # higher entropy corresponds to less certainty.
            self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

            # Compute the gradient of the loss with respect to all the weights,
            # and create a list of tuples consisting of the gradient to apply to
            # the weight.
            grads = tf.gradients(self.loss, self.weights)
            grads, _ = tf.clip_by_global_norm(grads, 40.0)
            grads_vars = list(zip(grads, self.weights))

            # Create an operator to apply the gradients using the optimizer.
            # Note that apply_gradients is the second part of minimize() for the
            # optimizer, so will minimize the loss.
            self.train_op = optimizer.apply_gradients(grads_vars)

Finally, we have some helper functions for getting the policy, value and training.

    def get_value(self, state):
        return self.sess.run(self.value, {self.state: state}).flatten()

    def get_policy_and_value(self, state):
        policy, value = self.sess.run([self.policy, self.value], {self.state:
        return policy.flatten(), value.flatten()

    # Train the network on the given states and rewards
    def train(self, states, actions, target_values, advantages):
        # Training
        self.sess.run(self.train_op, feed_dict={
            self.state: states,
            self.action: actions,
            self.target_value: target_values,
            self.advantages: advantages

The main training program

In the file ‘a3c.py’, we run the main training loop. Note that we use a Queue in order to count the number of training steps we have done globally. This is a nice way of incrementing a single number across many threads. I found that if I tried to do this with a central variable in tensorflow, then the number wouldn’t always increment well across threads and you would find sequences like $1, 2, 3, 4, 4, 4, 7, 8$.

def async_trainer(agent, env, sess, thread_idx, T_queue, summary, saver,
    print "Training thread", thread_idx
    T = T_queue.get()
    t = 0

    last_verbose = T
    last_time = time()
    last_target_update = T

    terminal = True
    while T < T_MAX:
        t_start = t
        batch_states = []
        batch_rewards = []
        batch_actions = []
        baseline_values = []

        if terminal:
            terminal = False
            state = env.reset()

        while not terminal and len(batch_states) < I_ASYNC_UPDATE:
            # Save the current state

            # Choose an action randomly according to the policy
            # probabilities. We do this anyway to prevent us having to compute
            # the baseline value separately.
            policy, value = agent.get_policy_and_value(state)
            action_idx = np.random.choice(agent.action_size, p=policy)

            # Take the action and get the next state, reward and terminal.
            state, reward, terminal, _ = env.step(action_idx)

            # Update counters
            t += 1
            T = T_queue.get()

            # Clip the reward to be between -1 and 1
            reward = np.clip(reward, -1, 1)

            # Save the rewards and actions

        target_value = 0
        # If the last state was terminal, just put R = 0. Else we want the
        # estimated value of the last state.
        if not terminal:
            target_value = agent.get_value(state)[0]
        last_R = target_value

        # Compute the sampled n-step discounted reward
        batch_target_values = []
        for reward in reversed(batch_rewards):
            target_value = reward + DISCOUNT_FACTOR * target_value
        # Reverse the batch target values, so they are in the correct order
        # again.

        # Compute the estimated value of each state
        batch_advantages = np.array(batch_target_values) - np.array(baseline_values)

        # Apply asynchronous gradient update
        agent.train(np.vstack(batch_states), batch_actions, batch_target_values,

    global training_finished
    training_finished = True

We have basically written the code to do the advantage actor-critic algorithm, but here is where we make it asynchronous. We have to run the ‘async_trainer’ function in separate threads. Each one has a separate CustomGym, so that the agents can be in different parts of the state space at any one time. In fact, we make one more CustomGym than the number of trainer threads, which is used to evaluate the training.

def a3c(game_name, num_threads=8, restore=None, save_path=None):
    # Create the environments
    envs = []
    for _ in range(num_threads):

    # Also create an environment for evaluating the agent
    evaluation_env = CustomGym(gym_name)

    # Create the tensorflow session and use it in context
    with tf.Session() as sess:
        # Create the agent
        agent = Agent(session=sess, action_size=envs[0].action_size,

        # Create a saver, and only keep 2 checkpoints.
        saver = tf.train.Saver(max_to_keep=2)

        T_queue = Queue.Queue()

        # Either restore the parameters or don't.
        if restore is not None:
            saver.restore(sess, save_path + '-' + str(restore))
            last_T = restore
            print "T was:", last_T

        summary = Summary(save_path, agent)

        # Create a process for each worker
        processes = []
        for i in range(num_threads):
            processes.append(threading.Thread(target=async_trainer, args=(agent,
            envs[i], sess, i, T_queue, summary, saver, save_path,)))

        # Create a process to evaluate the agent
        processes.append(threading.Thread(target=evaluator, args=(agent,
        evaluation_env, sess, T_queue,)))

        # Start all the processes
        for p in processes:
            p.daemon = True

        # Until training is finished
        while not training_finished:

        # Join the processes, so we get this thread back.
        for p in processes:

There is a simple function to estimate the reward of the agent at a given point in time.

def estimate_reward(agent, env, episodes=10, max_steps=10000):
    episode_rewards = []
    episode_vals = []
    t = 0
    for i in range(episodes):
        episode_reward = 0
        state = env.reset()
        terminal = False
        while not terminal:
            policy, value = agent.get_policy_and_value(state)
            action_idx = np.random.choice(agent.action_size, p=policy)
            state, reward, terminal, _ = env.step(action_idx)
            t += 1
            episode_reward += reward
            if t > max_steps:
                return episode_rewards, episode_vals
    return episode_rewards, episode_vals

We also have the evaluator function, which runs in a separate thread and evaluates every VERBOSE_EVERY training steps. It estimates the reward of the agent at the current time. If it is given a saver and a save_path, then it saves the parameters also. The settings for this can be found in the github version of this code, but for simplicity I don’t include it here.

def evaluator(agent, env, sess, T_queue):
    # Read T and put the same T back on.
    T = T_queue.get()
    last_time = time()
    last_verbose = T
    while T < T_MAX:
        T = T_queue.get()
        if T - last_verbose >= VERBOSE_EVERY:
            print "T", T
            current_time = time()
            print "Train steps per second", float(T - last_verbose) / (current_time - last_time)
            last_time = current_time
            last_verbose = T

            print "Evaluating agent"
            episode_rewards, episode_vals = estimate_reward(agent, env, episodes=5)
            avg_ep_r = np.mean(episode_rewards)
            avg_val = np.mean(episode_vals)
            print "Avg ep reward", avg_ep_r, "Average value", avg_val

Finally, we import some libraries and set up the all-important hyperparameters.

import os
import sys, getopt
import threading
import tensorflow as tf
import numpy as np
from time import time, sleep
import Queue
from custom_gym import CustomGym
import random
from agent import Agent

# Train for this many steps
T_MAX = 100000000
# Use this many threads
# Initial learning rate for Adam
# The discount factor
# Evaluate the agent and print out average reward every this many steps
# Update the parameters in each thread after this many steps in that thread
# Use this global variable to exit the training loop in each thread once we've finished.
training_finished = False

Now just call a3c with the desired game name.


Tricks to get it to work

This part is coming soon!

While you wait… Here is the agent playing Space Invaders: