cherry.algorithms

cherry.algorithms.a2c

Description

Helper functions for implementing A2C.

A2C simply computes the gradient of the policy as follows:

policy_loss(log_probs, advantages)

[Source]

Description

The policy loss of the Advantage Actor-Critic.

This function simply performs an element-wise multiplication and a mean reduction.

References

  1. Mnih et al. 2016. “Asynchronous Methods for Deep Reinforcement Learning.” arXiv [cs.LG].

Arguments

  • log_probs (tensor) - Log-density of the selected actions.
  • advantages (tensor) - Advantage of the action-state pairs.

Returns

  • (tensor) - The policy loss for the given arguments.

Example

advantages = replay.advantage()
log_probs = replay.log_prob()
loss = a2c.policy_loss(log_probs, advantages)

state_value_loss(values, rewards)

[Source]

Description

The state-value loss of the Advantage Actor-Critic.

This function is equivalent to a MSELoss.

References

  1. Mnih et al. 2016. “Asynchronous Methods for Deep Reinforcement Learning.” arXiv [cs.LG].

Arguments

  • values (tensor) - Predicted values for some states.
  • rewards (tensor) - Observed rewards for those states.

Returns

  • (tensor) - The value loss for the given arguments.

Example

values = replay.value()
rewards = replay.reward()
loss = a2c.state_value_loss(values, rewards)

cherry.algorithms.ppo

Description

Helper functions for implementing PPO.

policy_loss(new_log_probs, old_log_probs, advantages, clip = 0.1)

[Source]

Description

The clipped policy loss of Proximal Policy Optimization.

References

  1. Schulman et al. 2017. “Proximal Policy Optimization Algorithms.” arXiv [cs.LG].

Arguments

  • new_log_probs (tensor) - The log-density of actions from the target policy.
  • old_log_probs (tensor) - The log-density of actions from the behaviour policy.
  • advantages (tensor) - Advantage of the actions.
  • clip (float, optional, default=0.1) - The clipping coefficient.

Returns

  • (tensor) - The clipped policy loss for the given arguments.

Example

advantage = ch.pg.generalized_advantage(GAMMA,
                                        TAU,
                                        replay.reward(),
                                        replay.done(),
                                        replay.value(),
                                        next_state_value)
new_densities = policy(replay.state())
new_logprobs = new_densities.log_prob(replay.action())
loss = policy_loss(new_logprobs,
                   replay.logprob().detach(),
                   advantage.detach(),
                   clip=0.2)

state_value_loss(new_values, old_values, rewards, clip = 0.1)

[Source]

Description

The clipped state-value loss of Proximal Policy Optimization.

References

  1. PPO paper

Arguments

  • new_values (tensor) - State values from the optimized value function.
  • old_values (tensor) - State values from the reference value function.
  • rewards (tensor) - Observed rewards.
  • clip (float, optional, default=0.1) - The clipping coefficient.

Returns

  • (tensor) - The clipped value loss for the given arguments.

Example

values = v_function(batch.state())
value_loss = ppo.state_value_loss(values,
                                  batch.value().detach(),
                                  batch.reward(),
                                  clip=0.2)

cherry.algorithms.trpo

Description

Helper functions for implementing Trust-Region Policy Optimization.

Recall that TRPO strives to solve the following objective:

conjugate_gradient(Ax, b, num_iterations = 10, tol = 1e-10, eps = 1e-08)

[Source]

Description

Computes using the conjugate gradient algorithm.

Credit

Adapted from Kai Arulkumaran's implementation, with additions inspired from John Schulman's implementation.

References

  1. Nocedal and Wright. 2006. "Numerical Optimization, 2nd edition". Springer.
  2. Shewchuk et al. 1994. “An Introduction to the Conjugate Gradient Method without the Agonizing Pain.” CMU.

Arguments

  • Ax (callable) - Given a vector x, computes A@x.
  • b (tensor or list) - The reference vector.
  • num_iterations (int, optional, default=10) - Number of conjugate gradient iterations.
  • tol (float, optional, default=1e-10) - Tolerance for proposed solution.
  • eps (float, optional, default=1e-8) - Numerical stability constant.

Returns

  • x (tensor or list) - The solution to Ax = b, as a list if b is a list else a tensor.

Example

pass

hessian_vector_product(loss, parameters, damping = 1e-05)

[Source]

Description

Returns a callable that computes the product of the Hessian of loss (w.r.t. parameters) with another vector, using Pearlmutter's trick.

Note that parameters and the argument of the callable can be tensors or list of tensors.

References

  1. Pearlmutter, B. A. 1994. “Fast Exact Multiplication by the Hessian.” Neural Computation.

Arguments

  • loss (tensor) - The loss of which to compute the Hessian.
  • parameters (tensor or list) - The tensors to take the gradient with respect to.
  • damping (float, optional, default=1e-5) - Damping of the Hessian-vector product.

Returns

  • hvp(other) (callable) - A function to compute the Hessian-vector product, given a vector or list other.

Example

pass

[Source]

Description

Computes line-search for model parameters given a parameter update and a stopping criterion.

Credit

Adapted from Kai Arulkumaran's implementation, with additions inspired from John Schulman's implementation.

References

  1. Nocedal and Wright. 2006. "Numerical Optimization, 2nd edition". Springer.

Arguments

  • params_init (tensor or iteratble) - Initial parameter values.
  • params_update (tensor or iteratble) - Update direction.
  • model (Module) - The model to be updated.
  • stop_criterion (callable) - Given a model, decided whether to stop the line-search.
  • initial_stepsize (float, optional, default=1.0) - Initial stepsize of search.
  • backtrack_factor (float, optional, default=0.5) - Backtracking factor.
  • max_iterations (int, optional, default=15) - Max number of backtracking iterations.

Returns

  • new_model (Module) - The updated model if line-search is successful, else the model with initial parameter values.

Example

def ls_criterion(new_policy):
    new_density = new_policy(states)
    new_kl = kl_divergence(old_density, new_densityl).mean()
    new_loss = - qvalue(new_density.sample()).mean()
    return new_loss < policy_loss and new_kl < max_kl

with torch.no_grad():
    policy = trpo.line_search(
        params_init=policy.parameters(),
        params_update=step,
        model=policy,
        criterion=ls_criterion
    )

policy_loss(new_log_probs, old_log_probs, advantages)

[Source]

Description

The policy loss for Trust-Region Policy Optimization.

This is also known as the surrogate loss.

References

  1. Schulman et al. 2015. “Trust Region Policy Optimization.” ICML 2015.

Arguments

  • new_log_probs (tensor) - The log-density of actions from the target policy.
  • old_log_probs (tensor) - The log-density of actions from the behaviour policy.
  • advantages (tensor) - Advantage of the actions.

Returns

  • (tensor) - The policy loss for the given arguments.

Example

advantage = ch.pg.generalized_advantage(GAMMA,
                                        TAU,
                                        replay.reward(),
                                        replay.done(),
                                        replay.value(),
                                        next_state_value)
new_densities = policy(replay.state())
new_logprobs = new_densities.log_prob(replay.action())
loss = policy_loss(new_logprobs,
                   replay.logprob().detach(),
                   advantage.detach())

cherry.algorithms.sac

Description

Helper functions for implementing Soft-Actor Critic.

You should update the function approximators according to the following order.

  1. Entropy weight update.
  2. Action-value update.
  3. State-value update. (Optional, c.f. below)
  4. Policy update.

Note that most recent implementations of SAC omit step 3. above by using the Bellman residual instead of modelling a state-value function. For an example of such implementation refer to this link.

SAC (AlgorithmArguments) dataclass

[Source]

Description
Arguments
  • batch_size (int) - The number of samples to get from the replay.
Example

__eq__(self, other) special
__init__(self, batch_size: int = 512, discount: float = 0.99, use_automatic_entropy_tuning: bool = True, policy_delay: int = 2, target_delay: int = 2, target_polyak_weight: float = 0.01) -> None special
__repr__(self) special
action_value_loss(value, next_value, rewards, dones, gamma) staticmethod

[Source]

Description

The action-value loss of the Soft Actor-Critic.

value should be the value of the current state-action pair, estimated via the Q-function. next_value is the expected value of the next state; it can be estimated via a V-function, or alternatively by computing the Q-value of the next observed state-action pair. In the latter case, make sure that the action is sampled according to the current policy, not the one used to gather the data.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • value (tensor) - Action values of the actual transition.
  • next_value (tensor) - State values of the resulting state.
  • rewards (tensor) - Observed rewards of the transition.
  • dones (tensor) - Which states were terminal.
  • gamma (float) - Discount factor.

Returns

  • (tensor) - The policy loss for the given arguments.

Example

value = qf(batch.state(), batch.action().detach())
next_value = targe_vf(batch.next_state())
loss = action_value_loss(value,
                         next_value,
                         batch.reward(),
                         batch.done(),
                         gamma=0.99)
actions_log_probs(density) staticmethod
batch_size: int dataclass-field
discount: float dataclass-field
entropy_weight_loss(log_alpha, log_probs, target_entropy) staticmethod

[Source]

Description

Loss of the entropy weight, to automatically tune it.

The target entropy needs to be manually tuned. However, a popular heuristic for TanhNormal policies is to use the negative of the action-space dimensionality. (e.g. -4 when operating the voltage of a quad-rotor.)

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • log_alpha (tensor) - Log of the entropy weight.
  • log_probs (tensor) - Log-density of policy actions.
  • target_entropy (float) - Target of the entropy value.

Returns

  • (tensor) - The state value loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
target_entropy = -np.prod(env.action_space.shape).item()
loss = entropy_weight_loss(alpha.log(),
                           log_probs,
                           target_entropy)
policy_delay: int dataclass-field
policy_loss(log_probs, q_curr, alpha = 1.0) staticmethod

[Source]

Description

The policy loss of the Soft Actor-Critic.

New actions are sampled from the target policy, and those are used to compute the Q-values. While we should back-propagate through the Q-values to the policy parameters, we shouldn't use that gradient to optimize the Q parameters. This is often avoided by either using a target Q function, or by zero-ing out the gradients of the Q function parameters.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • log_probs (tensor) - Log-density of the selected actions.
  • q_curr (tensor) - Q-values of state-action pairs.
  • alpha (float, optional, default=1.0) - Entropy weight.

Returns

  • (tensor) - The policy loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_curr = q_function(batch.state(), actions)
loss = policy_loss(log_probs, q_curr, alpha=0.1)
state_value_loss(v_value, log_probs, q_value, alpha = 1.0) staticmethod

[Source]

Description

The state-value loss of the Soft Actor-Critic.

This update is computed "on-policy": states are sampled from a replay but the state values, action values, and log-densities are computed using the current value functions and policy.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • v_value (tensor) - State values for some observed states.
  • log_probs (tensor) - Log-density of actions sampled from the current policy.
  • q_value (tensor) - Action values of the actions for the current policy.
  • alpha (float, optional, default=1.0) - Entropy weight.

Returns

  • (tensor) - The state value loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_value = qf(batch.state(), actions)
v_value = vf(batch.state())
loss = state_value_loss(v_value,
                        log_probs,
                        q_value,
                        alpha=0.1)
target_delay: int dataclass-field
target_polyak_weight: float dataclass-field
udpate(self, replay, policy, action_value, target_value, log_alpha, target_entropy, policy_optimizer, features_optimizer, value_optimizer, alpha_optimizer, features = None, target_features = None, update_policy = True, update_target = False, update_value = True, update_entropy = True, device = None, **kwargs)
unpack_config(obj, config) inherited

Returns a DotMap, picking parameters first from config and if not present from obj.

Arguments
  • obj (dataclass) - Algorithm to help fill missing values in config.
  • config (dict) - Partial configuration to get values from.
use_automatic_entropy_tuning: bool dataclass-field

action_value_loss(value, next_value, rewards, dones, gamma)

[Source]

Description

The action-value loss of the Soft Actor-Critic.

value should be the value of the current state-action pair, estimated via the Q-function. next_value is the expected value of the next state; it can be estimated via a V-function, or alternatively by computing the Q-value of the next observed state-action pair. In the latter case, make sure that the action is sampled according to the current policy, not the one used to gather the data.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • value (tensor) - Action values of the actual transition.
  • next_value (tensor) - State values of the resulting state.
  • rewards (tensor) - Observed rewards of the transition.
  • dones (tensor) - Which states were terminal.
  • gamma (float) - Discount factor.

Returns

  • (tensor) - The policy loss for the given arguments.

Example

value = qf(batch.state(), batch.action().detach())
next_value = targe_vf(batch.next_state())
loss = action_value_loss(value,
                         next_value,
                         batch.reward(),
                         batch.done(),
                         gamma=0.99)

entropy_weight_loss(log_alpha, log_probs, target_entropy)

[Source]

Description

Loss of the entropy weight, to automatically tune it.

The target entropy needs to be manually tuned. However, a popular heuristic for TanhNormal policies is to use the negative of the action-space dimensionality. (e.g. -4 when operating the voltage of a quad-rotor.)

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • log_alpha (tensor) - Log of the entropy weight.
  • log_probs (tensor) - Log-density of policy actions.
  • target_entropy (float) - Target of the entropy value.

Returns

  • (tensor) - The state value loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
target_entropy = -np.prod(env.action_space.shape).item()
loss = entropy_weight_loss(alpha.log(),
                           log_probs,
                           target_entropy)

policy_loss(log_probs, q_curr, alpha = 1.0)

[Source]

Description

The policy loss of the Soft Actor-Critic.

New actions are sampled from the target policy, and those are used to compute the Q-values. While we should back-propagate through the Q-values to the policy parameters, we shouldn't use that gradient to optimize the Q parameters. This is often avoided by either using a target Q function, or by zero-ing out the gradients of the Q function parameters.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • log_probs (tensor) - Log-density of the selected actions.
  • q_curr (tensor) - Q-values of state-action pairs.
  • alpha (float, optional, default=1.0) - Entropy weight.

Returns

  • (tensor) - The policy loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_curr = q_function(batch.state(), actions)
loss = policy_loss(log_probs, q_curr, alpha=0.1)

state_value_loss(v_value, log_probs, q_value, alpha = 1.0)

[Source]

Description

The state-value loss of the Soft Actor-Critic.

This update is computed "on-policy": states are sampled from a replay but the state values, action values, and log-densities are computed using the current value functions and policy.

References

  1. Haarnoja et al. 2018. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv [cs.LG].
  2. Haarnoja et al. 2018. “Soft Actor-Critic Algorithms and Applications.” arXiv [cs.LG].

Arguments

  • v_value (tensor) - State values for some observed states.
  • log_probs (tensor) - Log-density of actions sampled from the current policy.
  • q_value (tensor) - Action values of the actions for the current policy.
  • alpha (float, optional, default=1.0) - Entropy weight.

Returns

  • (tensor) - The state value loss for the given arguments.

Example

densities = policy(batch.state())
actions = densities.sample()
log_probs = densities.log_prob(actions)
q_value = qf(batch.state(), actions)
v_value = vf(batch.state())
loss = state_value_loss(v_value,
                        log_probs,
                        q_value,
                        alpha=0.1)

cherry.algorithms.drq.DrQ (AlgorithmArguments) dataclass

[Source]

Description
Arguments
  • batch_size (int) - The number of samples to get from the replay.
Example

__init__(self, batch_size: int = 512, discount: float = 0.99, use_automatic_entropy_tuning: bool = True, policy_delay: int = 2, target_delay: int = 2, target_polyak_weight: float = 0.995) -> None special

update(self, replay, policy, action_value, target_action_value, features, target_features, log_alpha, target_entropy, policy_optimizer, action_value_optimizer, features_optimizer, alpha_optimizer, update_policy = True, update_target = False, update_value = True, update_entropy = True, augmentation_transform = None, device = None, **kwargs)