cherry.td¶
cherry.td.discount(gamma, rewards, dones, bootstrap = 0.0)
¶
Description¶
Discounts rewards at an rate of gamma.
References¶
- Sutton, Richard, and Andrew Barto. 2018. Reinforcement Learning, Second Edition. The MIT Press.
Arguments¶
gamma
(float) - Discount factor.rewards
(tensor) - Tensor of rewards.dones
(tensor) - Tensor indicating episode termination. Entry is 1 if the transition led to a terminal (absorbing) state, 0 else.bootstrap
(float, optional, default=0.0) - Bootstrap the last reward with this value.
Returns¶
- tensor - Tensor of discounted rewards.
Example¶
rewards = th.ones(23, 1) * 8
dones = th.zeros_like(rewards)
dones[-1] += 1.0
discounted = ch.rl.discount(0.99,
rewards,
dones,
bootstrap=1.0)
cherry.td.temporal_difference(gamma, rewards, dones, values, next_values)
¶
Description¶
Returns the temporal difference residual.
Reference¶
- Sutton, Richard S. 1988. “Learning to Predict by the Methods of Temporal Differences.” Machine Learning 3 (1): 9–44.
- Sutton, Richard, and Andrew Barto. 2018. Reinforcement Learning, Second Edition. The MIT Press.
Arguments¶
gamma
(float) - Discount factor.rewards
(tensor) - Tensor of rewards.dones
(tensor) - Tensor indicating episode termination. Entry is 1 if the transition led to a terminal (absorbing) state, 0 else.values
(tensor) - Values for the states producing the rewards.next_values
(tensor) - Values of the state obtained after the transition from the state used to compute the last value invalues
.
Example¶
values = vf(replay.states())
next_values = vf(replay.next_states())
td_errors = temporal_difference(0.99,
replay.reward(),
replay.done(),
values,
next_values)