Deterministic Policy Gradient Algorithms

태그

작성일자

1 more property

Jul 16, 2018 deep-learning, reinforcement-learning

WHY?

Policy gradient usually requires integral over all the possible actions.

WHAT?

The purpose of reinforcement learning is to learn the policy to maximize the objective function. J(πθ) = ∫𝒮ρπ(s)∫𝒜πθ(s, a)r(s, a)dads = 𝔼s ∼ ρπ, a ∼ πθ[r(s, a)] Policy gradient directly train the policy network to minimize the objective function. - Stochastic Policy Gradient ∇J(πθ) = ∫𝒮ρπ(s)∫𝒜∇thetaπθ(a|s)Qπ(s, a)dads = 𝔼s ∼ ρπ, a ∼ πθ[∇θlog πθ(a|s)Qπ(s, a)] Since this assumes stochastic policy, this is called Stochastic Policy Gradient. If a sample return is used to estimate the action-value function, it is called REINFORCE algorithm. - Stochastic Actor-Critic We can train another network to directly learn the value of action-value function by td learning. ∇J(πθ) = = 𝔼s ∼ ρπ, a ∼ πθ[∇θlog πθ(a|s)Qw(s, a)]epsilon2(w) = 𝔼s ∼ ρπ, a ∼ πθ[(Qw(s, a) − Qπ(s, a))2] - Off-policy Actor-Critic (OffPAC) On-policy learning has limitation in exploration. Off-policy learning use different policies to behave and to evaluate. $$J_{\beta}(\pi_{\theta}) = \int_{\mathcal{S}}\int_{\mathcal{A}}\rho^{\beta}(s)\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds nabla_{\theta} J_{\beta}(\pi_{\theta}) = \int_{\mathcal{S}}\int_{\mathcal{A}}\rho^{\beta}(s)\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)da ds = \mathbb{E}_{s\sim\rho^{\beta}, a\sim \beta}[\frac{\pi_{\theta}(a|s)}{\beta_{\theta}(a|s)}\nabla_{\theta} \log \pi_{\theta}(a|s)Q^{\pi}(s,a)]$$ This Off-Policy Actor-Critic(OffPAC)require importance sampling. - Deterministic Policy Gradient In continuous action space, integral over all the action space is intractable. Deterministic policy gradient uses the deterministic policy μθ(s) instead of πθ(a|s). And then, move the policy in the direction of the gradient of Q. This deterministic policy gradient is a special form of stochastic policy gradient. $$\theta^{k+1} = \theta^k + \alpha \mathcal{E}_{s\sim\rho^{\mu^k}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{\a}Q^{\mu^k}(s,a)|_{a=\mu_{\theta}(s)}] J(\mu_{\theta}) = \int_{\mathcal{S}}\rho^{\mu}(s)r(s, \mu_{\theta}(s))d s nabla J(\mu_{\theta}) = = \mathbb{E}_{s\sim\rho^{\mu}}[\nabla_{theta} \mu_{\theta}(s) \nabla_{a}Q^{\mu}(s,a)|_{a=\mu_{\theta}(s)}]$$ - Off-Policy Deterministic Actor-Critic (OPDAC) As in the case of stochastic policy gradient, off-policy is required to ensure adequate exploration. We can use Q-learning to train the critic. Jβ(μθ) = ∫𝒮ρβ(s)Qμ(s, muθ(s))dsnablaθJβ(μθ) = 𝔼s ∼ ρβ[∇thetaμθ(s)∇aQμ(s, a)|a = μθ(s)]deltat = rt + γQw(st + 1, μθ(st + 1)) − Qw(st, at)wt + 1 = wt + αwδt∇wQw(st, at)thetat + 1 = θt + αθ∇θμθ(st)∇aQw(st, at)|a = μθ(s) Deterministic policy removes the need for integral of actions and Q-learning removes the need for importance sampling. - Compatible Off-Policy Deterministic Actor-Critic (COPDAC) Since function approximator Qw(s, a) may not follow the true gradient, this paper suggest two restriction for the compatible action-value function. 1. ∇aQw(s, a)|a = μθ(s) = ∇θμθ(s)Tw 2. w minimize MSE of ϵ(s; θ, w) = ∇aQw(s, a)|a = μθ(s) − ∇aQμ(s, a)|a = μθ(s) The resulting algorithm is called compatible off-policy deterministic actor-critic(COPDAC). We can use baseline function to reduce the variance of gradient estimator. If we use gradient Q-learning for critic, the algorithm is called COPDAC-GQ.

Critic

Great reviews of policy gradient algorithms. Silver, David, et al. “Deterministic policy gradient algorithms.” ICML. 2014.