cherry.td

cherry.td.discount(gamma, rewards, dones, bootstrap = 0.0)

Description

Discounts rewards at an rate of gamma.

References
  1. Sutton, Richard, and Andrew Barto. 2018. Reinforcement Learning, Second Edition. The MIT Press.
Arguments
  • gamma (float) - Discount factor.
  • rewards (tensor) - Tensor of rewards.
  • dones (tensor) - Tensor indicating episode termination. Entry is 1 if the transition led to a terminal (absorbing) state, 0 else.
  • bootstrap (float, optional, default=0.0) - Bootstrap the last reward with this value.
Returns
  • tensor - Tensor of discounted rewards.
Example
rewards = th.ones(23, 1) * 8
dones = th.zeros_like(rewards)
dones[-1] += 1.0
discounted = ch.rl.discount(0.99,
                            rewards,
                            dones,
                            bootstrap=1.0)

cherry.td.temporal_difference(gamma, rewards, dones, values, next_values)

Description

Returns the temporal difference residual.

Reference
  1. Sutton, Richard S. 1988. “Learning to Predict by the Methods of Temporal Differences.” Machine Learning 3 (1): 9–44.
  2. Sutton, Richard, and Andrew Barto. 2018. Reinforcement Learning, Second Edition. The MIT Press.
Arguments
  • gamma (float) - Discount factor.
  • rewards (tensor) - Tensor of rewards.
  • dones (tensor) - Tensor indicating episode termination. Entry is 1 if the transition led to a terminal (absorbing) state, 0 else.
  • values (tensor) - Values for the states producing the rewards.
  • next_values (tensor) - Values of the state obtained after the transition from the state used to compute the last value in values.
Example
values = vf(replay.states())
next_values = vf(replay.next_states())
td_errors = temporal_difference(0.99,
                                replay.reward(),
                                replay.done(),
                                values,
                                next_values)