# cherry.algorithms.a2c

Description

Helper functions for implementing A2C.

A2C simply computes the gradient of the policy as follows:

## policy_loss

policy_loss(log_probs, advantages)


[Source]

Description

The policy loss of the Advantage Actor-Critic.

This function simply performs an element-wise multiplication and a mean reduction.

References

1. Mnih et al. 2016. “Asynchronous Methods for Deep Reinforcement Learning.” arXiv [cs.LG].

Arguments

• log_probs (tensor) - Log-density of the selected actions.
• advantages (tensor) - Advantage of the action-state pairs.

Returns

• (tensor) - The policy loss for the given arguments.

Example

advantages = replay.advantage()
log_probs = replay.log_prob()
loss = a2c.policy_loss(log_probs, advantages)


## state_value_loss

state_value_loss(values, rewards)


[Source]

Description

The state-value loss of the Advantage Actor-Critic.

This function is equivalent to a MSELoss.

References

1. Mnih et al. 2016. “Asynchronous Methods for Deep Reinforcement Learning.” arXiv [cs.LG].

Arguments

• values (tensor) - Predicted values for some states.
• rewards (tensor) - Observed rewards for those states.

Returns

• (tensor) - The value loss for the given arguments.

Example

values = replay.value()
rewards = replay.reward()
loss = a2c.state_value_loss(values, rewards)


# cherry.algorithms.ppo

Description

Helper functions for implementing PPO.

## policy_loss

policy_loss(new_log_probs, old_log_probs, advantages, clip=0.1)


[Source]

Description

The clipped policy loss of Proximal Policy Optimization.

References

1. Schulman et al. 2017. “Proximal Policy Optimization Algorithms.” arXiv [cs.LG].

Arguments

• new_log_probs (tensor) - The log-density of actions from the target policy.
• old_log_probs (tensor) - The log-density of actions from the behaviour policy.
• advantages (tensor) - Advantage of the actions.
• clip (float, optional, default=0.1) - The clipping coefficient.

Returns

• (tensor) - The clipped policy loss for the given arguments.

Example

advantage = ch.pg.generalized_advantage(GAMMA,
TAU,
replay.reward(),
replay.done(),
replay.value(),
next_state_value)
new_densities = policy(replay.state())
new_logprobs = new_densities.log_prob(replay.action())
loss = policy_loss(new_logprobs,
replay.logprob().detach(),
advantage.detach(),
clip=0.2)


## state_value_loss

state_value_loss(new_values, old_values, rewards, clip=0.1)


[Source]

Description

The clipped state-value loss of Proximal Policy Optimization.

References

1. PPO paper

Arguments

• new_values (tensor) - State values from the optimized value function.
• old_values (tensor) - State values from the reference value function.
• rewards (tensor) - Observed rewards.
• clip (float, optional, default=0.1) - The clipping coefficient.

Returns

• (tensor) - The clipped value loss for the given arguments.

Example

values = v_function(batch.state())
value_loss = ppo.state_value_loss(values,
batch.value().detach(),
batch.reward(),
clip=0.2)


# cherry.algorithms.trpo

Description

Helper functions for implementing Trust-Region Policy Optimization.

Recall that TRPO strives to solve the following objective:

## policy_loss

policy_loss(new_log_probs, old_log_probs, advantages)


[Source]

Description

The policy loss for Trust-Region Policy Optimization.

This is also known as the surrogate loss.

References

1. Schulman et al. 2015. “Trust Region Policy Optimization.” ICML 2015.

Arguments

• new_log_probs (tensor) - The log-density of actions from the target policy.
• old_log_probs (tensor) - The log-density of actions from the behaviour policy.
• advantages (tensor) - Advantage of the actions.

Returns

• (tensor) - The policy loss for the given arguments.

Example

advantage = ch.pg.generalized_advantage(GAMMA,
TAU,
replay.reward(),
replay.done(),
replay.value(),
next_state_value)
new_densities = policy(replay.state())
new_logprobs = new_densities.log_prob(replay.action())
loss = policy_loss(new_logprobs,
replay.logprob().detach(),
advantage.detach())


## hessian_vector_product

hessian_vector_product(loss, parameters, damping=1e-05)


[Source]

Description

Returns a callable that computes the product of the Hessian of loss (w.r.t. parameters) with another vector, using Pearlmutter's trick.

Note that parameters and the argument of the callable can be tensors or list of tensors.

References

1. Pearlmutter, B. A. 1994. “Fast Exact Multiplication by the Hessian.” Neural Computation.

Arguments

• loss (tensor) - The loss of which to compute the Hessian.
• parameters (tensor or list) - The tensors to take the gradient with respect to.
• damping (float, optional, default=1e-5) - Damping of the Hessian-vector product.

Returns

• hvp(other) (callable) - A function to compute the Hessian-vector product, given a vector or list other.

Example

pass


## conjugate_gradient

conjugate_gradient(Ax, b, num_iterations=10, tol=1e-10, eps=1e-08)


[Source]

Description

Computes $x = A^{-1}b$ using the conjugate gradient algorithm.

Credit

Adapted from Kai Arulkumaran's implementation, with additions inspired from John Schulman's implementation.

References

1. Nocedal and Wright. 2006. "Numerical Optimization, 2nd edition". Springer.
2. Shewchuk et al. 1994. “An Introduction to the Conjugate Gradient Method without the Agonizing Pain.” CMU.

Arguments

• Ax (callable) - Given a vector x, computes A@x.
• b (tensor or list) - The reference vector.
• num_iterations (int, optional, default=10) - Number of conjugate gradient iterations.
• tol (float, optional, default=1e-10) - Tolerance for proposed solution.
• eps (float, optional, default=1e-8) - Numerical stability constant.

Returns

• x (tensor or list) - The solution to Ax = b, as a list if b is a list else a tensor.

Example

pass


# cherry.algorithms.sac

Description

Helper functions for implementing Soft-Actor Critic.

You should update the function approximators according to the following order.

1. Entropy weight update.
2. Action-value update.
3. State-value update. (Optional, c.f. below)
4. Policy update.

Note that most recent implementations of SAC omit step 3. above by using the Bellman residual instead of modelling a state-value function. For an example of such implementation refer to this link.

## policy_loss

policy_loss(log_probs, q_curr, alpha=1.0)


[Source]

Description

The policy loss of the Soft Actor-Critic.

New actions are sampled from the target policy, and those are used to compute the Q-values. While we should back-propagate through the Q-values to the policy parameters, we shouldn't use that gradient to optimize the Q parameters. This is often avoided by either using a target Q function, or by zero-ing out the gradients of the Q function parameters.

References

1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

• log_probs (tensor) - Log-density of the selected actions.
• q_curr (tensor) - Q-values of state-action pairs.
• alpha (float, optional, default=1.0) - Entropy weight.

Returns

• (tensor) - The policy loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_curr = q_function(batch.state(), actions)
loss = policy_loss(log_probs, q_curr, alpha=0.1)


## action_value_loss

action_value_loss(value, next_value, rewards, dones, gamma)


[Source]

Description

The action-value loss of the Soft Actor-Critic.

value should be the value of the current state-action pair, estimated via the Q-function. next_value is the expected value of the next state; it can be estimated via a V-function, or alternatively by computing the Q-value of the next observed state-action pair. In the latter case, make sure that the action is sampled according to the current policy, not the one used to gather the data.

References

1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

• value (tensor) - Action values of the actual transition.
• next_value (tensor) - State values of the resulting state.
• rewards (tensor) - Observed rewards of the transition.
• dones (tensor) - Which states were terminal.
• gamma (float) - Discount factor.

Returns

• (tensor) - The policy loss for the given arguments.

Example

value = qf(batch.state(), batch.action().detach())
next_value = targe_vf(batch.next_state())
loss = action_value_loss(value,
next_value,
batch.reward(),
batch.done(),
gamma=0.99)


## state_value_loss

state_value_loss(v_value, log_probs, q_value, alpha=1.0)


[Source]

Description

The state-value loss of the Soft Actor-Critic.

This update is computed "on-policy": states are sampled from a replay but the state values, action values, and log-densities are computed using the current value functions and policy.

References

1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

• v_value (tensor) - State values for some observed states.
• log_probs (tensor) - Log-density of actions sampled from the current policy.
• q_value (tensor) - Action values of the actions for the current policy.
• alpha (float, optional, default=1.0) - Entropy weight.

Returns

• (tensor) - The state value loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_value = qf(batch.state(), actions)
v_value = vf(batch.state())
loss = state_value_loss(v_value,
log_probs,
q_value,
alpha=0.1)


## entropy_weight_loss

entropy_weight_loss(log_alpha, log_probs, target_entropy)


[Source]

Description

Loss of the entropy weight, to automatically tune it.

The target entropy needs to be manually tuned. However, a popular heuristic for TanhNormal policies is to use the negative of the action-space dimensionality. (e.g. -4 when operating the voltage of a quad-rotor.)

References

1. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

• log_alpha (tensor) - Log of the entropy weight.
• log_probs (tensor) - Log-density of policy actions.
• target_entropy (float) - Target of the entropy value.

Returns

• (tensor) - The state value loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
target_entropy = -np.prod(env.action_space.shape).item()
loss = entropy_weight_loss(alpha.log(),
log_probs,
target_entropy)