cherry.algorithms.a2c

Description

Helper functions for implementing A2C.

A2C simply computes the gradient of the policy as follows:

policy_loss

policy_loss(log_probs, advantages)

[Source]

Description

The policy loss of the Advantage Actor-Critic.

This function simply performs an element-wise multiplication and a mean reduction.

References

  1. Mnih et al. 2016. “Asynchronous Methods for Deep Reinforcement Learning.” arXiv [cs.LG].

Arguments

Returns

Example

advantages = replay.advantage()
log_probs = replay.log_prob()
loss = a2c.policy_loss(log_probs, advantages)

state_value_loss

state_value_loss(values, rewards)

[Source]

Description

The state-value loss of the Advantage Actor-Critic.

This function is equivalent to a MSELoss.

References

  1. Mnih et al. 2016. “Asynchronous Methods for Deep Reinforcement Learning.” arXiv [cs.LG].

Arguments

Returns

Example

values = replay.value()
rewards = replay.reward()
loss = a2c.state_value_loss(values, rewards)

cherry.algorithms.ppo

Description

Helper functions for implementing PPO.

policy_loss

policy_loss(new_log_probs, old_log_probs, advantages, clip=0.1)

[Source]

Description

The clipped policy loss of Proximal Policy Optimization.

References

  1. Schulman et al. 2017. “Proximal Policy Optimization Algorithms.” arXiv [cs.LG].

Arguments

Returns

Example

advantage = ch.pg.generalized_advantage(GAMMA,
                                        TAU,
                                        replay.reward(),
                                        replay.done(),
                                        replay.value(),
                                        next_state_value)
new_densities = policy(replay.state())
new_logprobs = new_densities.log_prob(replay.action())
loss = policy_loss(new_logprobs,
                   replay.logprob().detach(),
                   advantage.detach(),
                   clip=0.2)

state_value_loss

state_value_loss(new_values, old_values, rewards, clip=0.1)

[Source]

Description

The clipped state-value loss of Proximal Policy Optimization.

References

  1. PPO paper

Arguments

Returns

Example

values = v_function(batch.state())
value_loss = ppo.state_value_loss(values,
                                  batch.value().detach(),
                                  batch.reward(),
                                  clip=0.2)

cherry.algorithms.trpo

Description

Helper functions for implementing Trust-Region Policy Optimization.

Recall that TRPO strives to solve the following objective:

policy_loss

policy_loss(new_log_probs, old_log_probs, advantages)

[Source]

Description

The policy loss for Trust-Region Policy Optimization.

This is also known as the surrogate loss.

References

  1. Schulman et al. 2015. “Trust Region Policy Optimization.” ICML 2015.

Arguments

Returns

Example

advantage = ch.pg.generalized_advantage(GAMMA,
                                        TAU,
                                        replay.reward(),
                                        replay.done(),
                                        replay.value(),
                                        next_state_value)
new_densities = policy(replay.state())
new_logprobs = new_densities.log_prob(replay.action())
loss = policy_loss(new_logprobs,
                   replay.logprob().detach(),
                   advantage.detach())

hessian_vector_product

hessian_vector_product(loss, parameters, damping=1e-05)

[Source]

Description

Returns a callable that computes the product of the Hessian of loss (w.r.t. parameters) with another vector, using Pearlmutter's trick.

Note that parameters and the argument of the callable can be tensors or list of tensors.

References

  1. Pearlmutter, B. A. 1994. “Fast Exact Multiplication by the Hessian.” Neural Computation.

Arguments

Returns

Example

pass

conjugate_gradient

conjugate_gradient(Ax, b, num_iterations=10, tol=1e-10, eps=1e-08)

[Source]

Description

Computes using the conjugate gradient algorithm.

Credit

Adapted from Kai Arulkumaran's implementation, with additions inspired from John Schulman's implementation.

References

  1. Nocedal and Wright. 2006. "Numerical Optimization, 2nd edition". Springer.
  2. Shewchuk et al. 1994. “An Introduction to the Conjugate Gradient Method without the Agonizing Pain.” CMU.

Arguments

Returns

Example

pass

cherry.algorithms.sac

Description

Helper functions for implementing Soft-Actor Critic.

You should update the function approximators according to the following order.

  1. Entropy weight update.
  2. Action-value update.
  3. State-value update. (Optional, c.f. below)
  4. Policy update.

Note that most recent implementations of SAC omit step 3. above by using the Bellman residual instead of modelling a state-value function. For an example of such implementation refer to this link.

policy_loss

policy_loss(log_probs, q_curr, alpha=1.0)

[Source]

Description

The policy loss of the Soft Actor-Critic.

New actions are sampled from the target policy, and those are used to compute the Q-values. While we should back-propagate through the Q-values to the policy parameters, we shouldn't use that gradient to optimize the Q parameters. This is often avoided by either using a target Q function, or by zero-ing out the gradients of the Q function parameters.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

Returns

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_curr = q_function(batch.state(), actions)
loss = policy_loss(log_probs, q_curr, alpha=0.1)

action_value_loss

action_value_loss(value, next_value, rewards, dones, gamma)

[Source]

Description

The action-value loss of the Soft Actor-Critic.

value should be the value of the current state-action pair, estimated via the Q-function. next_value is the expected value of the next state; it can be estimated via a V-function, or alternatively by computing the Q-value of the next observed state-action pair. In the latter case, make sure that the action is sampled according to the current policy, not the one used to gather the data.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

Returns

Example

value = qf(batch.state(), batch.action().detach())
next_value = targe_vf(batch.next_state())
loss = action_value_loss(value,
                         next_value,
                         batch.reward(),
                         batch.done(),
                         gamma=0.99)

state_value_loss

state_value_loss(v_value, log_probs, q_value, alpha=1.0)

[Source]

Description

The state-value loss of the Soft Actor-Critic.

This update is computed "on-policy": states are sampled from a replay but the state values, action values, and log-densities are computed using the current value functions and policy.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

Returns

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_value = qf(batch.state(), actions)
v_value = vf(batch.state())
loss = state_value_loss(v_value,
                        log_probs,
                        q_value,
                        alpha=0.1)

entropy_weight_loss

entropy_weight_loss(log_alpha, log_probs, target_entropy)

[Source]

Description

Loss of the entropy weight, to automatically tune it.

The target entropy needs to be manually tuned. However, a popular heuristic for TanhNormal policies is to use the negative of the action-space dimensionality. (e.g. -4 when operating the voltage of a quad-rotor.)

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

Returns

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
target_entropy = -np.prod(env.action_space.shape).item()
loss = entropy_weight_loss(alpha.log(),
                           log_probs,
                           target_entropy)