Deterministic Policy Gradient Methods

Hey guys, today we will talking about deterministic policy gradient algorithms, first introduced by David Silver - here in 2013

We talked about stochastic policy gradient algorithms in another post - here, including Monte carlo policy gradient with and without baseline, Actor-Critic method and finite difference methods. Do go through that post as well.

You can find Deterministic Policy gradients code we use for this post - here (C++)

Advantage of deterministic policy gradient algorithms is that we now need to sample only over the state space to get an estimate of the gradient for learning. I.e if your policy was stochastic and we have chance of selecting multiple actions per state, your sample complexity for estimating the gradient would be of order S X A.

For deterministic policy gradient algorithms, the integral over the action space while estimating the gradient vanishes!

We will be replicating results from David Silver's paper today, on a Continuous Action Mountain car problem. Note that the paper does not mention learning rates, and I found this algorithm to be highly sensitive to learning rate. Changing it even by factor of 0.75 produced inferior results. I had to do parameter sweep, and this was feasible only because of the simplicity of the environment. This algorithm has 3 learning rates, one for the actor, and 2 in the value function estimation part. We will be using Richard Sutton's tile coding  for the state representations, with 10 tiling's.


Time steps vs Training Episodes, Alpha actor=0.00075




Clearly, at the best possible learning rate the  performance of this algorithm is extremely good, considering that we are working in continuous action spaces. We have used the version
DPG + Q-learning, feel free to modify the equations to incorporate gradient temporal difference learning algorithms.

Exploration was turned off during testing, and during learning exploration was facilitated using Gaussian policy with mean = theta.T * phi(s) and variance which started at 1 and decreased over time.

Pros:  1)Great asymptotic performance, mostly better than stochastic policy gradient algorithms as                   the state space dimensionality grows.
          2) Policy is deterministic! This is highly desirable, say in robotics where we may not want                      stochastic actions during functionality.
           3)  Faster learning compared to stochastic policy gradient algorithms, due to reduction in                            sample complexity needed for sampling the gradient.

Cons: 1) Highly sensitive to learning rate. Parameter sweep becomes infeasible for bigger MDPs.
           2) If the policy is inherently stochastic, it will not able to estimate it. This might be the case in                 POMDPS, where a probability distribution over actions might be useful.

The scaled up extension of this paper is the Deep Deterministic Policy gradient algorithms. We will not be discussing this in this blog, as there are plenty of good tutorials available online, including code, for example - here.

Hope you found the code and the post useful. Feel free to drop me a mail if you have any questions. Cheers :)