10.6 Q&A

Q: What is the relationship between actor-critic and policy gradient methods?

A: Actor-critic methods are actually policy gradient methods. Sometimes, we use them interchangeably. It is required to estimate action values in any policy gradient algorithm. When the action values are estimated using temporal-difference learning with value function approximation, such a policy gradient algorithm is called actor-critic. The name "actor-critic" highlights its algorithmic structure that combines the components of policy update and value update. This structure is also the fundamental structure used in all reinforcement learning algorithms.

Q: Why is it important to introduce additional baselines to actor-critic methods?

A: Since the policy gradient is invariant to any additional baseline, we can utilize the baseline to reduce estimation variance. The resulting algorithm is called advantage actor-critic.

Q: Can importance sampling be used in value-based algorithms other than policy-based ones?

A: The answer is yes. That is because importance sampling is a general technique for estimating the expectation of a random variable over one distribution using some samples drawn from another distribution. The reason why this technique is useful in reinforcement learning is that the many problems in reinforcement learning are to estimate expectations. For example, in value-based methods, the action or state values are defined as expectations. In the policy gradient method, the true gradient is also an expectation. As a result, importance sampling can be applied in both value-based and policy-based algorithms. In fact, it has been applied in the value-based component of Algorithm 10.3.

Q: Why is the deterministic policy gradient method off-policy?

A: The true gradient in the deterministic case does not involve the action random variable. As a result, when we use samples to approximate the true gradient, it is not required to sample actions and hence any policy can be used. Therefore, the deterministic policy gradient method is off-policy.

10.6_Q&A

10.6 Q&A

Appendix A