Skip to contents

The mathematics behind TD

The temporal difference (TD) model (Sutton & Barto, 1990) is an extension of the ideas underlying the RW model (Rescorla & Wagner, 1972). Most notably the TD model abandons the construct of a “trial”, favoring instead time-based formulations. Also notable is the introduction of eligibility traces, which allow the model to bridge temporal gaps and deal with the credit assignment problem.

Implementation note: As of calmr version 0.6.2, stimulus representation in TD is based on complete serial compounds (i.e., time-specific stimulus elements entirely discriminable from each other), and the eligibility traces are of the replacing type.

General Note: There are several descriptions of the TD model out there, however, all of the ones I found were opaque when it comes to implementation. Hence, the following description of the model has a focus on implementation details.

1 - Maintaining stimulus representations

TD maintains stimulus traces as eligibility traces. The eligibility of stimulus ii at time tt, eite_i^t, is given by:

eit=eit1σγ+xit \tag{Eq. 1} e_i^t = e_i^{t-1} \sigma \gamma + x_i^t

where σ\sigma and γ\gamma are decay and discount parameters, respectively, and xitx_i^t is the activation of stimulus ii at time tt (1 or 0 for present and absent stimuli, respectively).

Internally, eie_i is represented as a vector of length dd, where dd is the number of stimulus compounds (not in the general sense of the word compound, but in terms of complete serial compounds, or CSC). For example, a 2s stimulus in a model with a time resolution of 0.5s will have a d=4d = 4, and the second entry in that vector represents the eligibility of the compound active after the stimulus has been present for 1s.

Similarly, xitx_i^t entails the specific compound of stimulus ii at time tt, and not the general activation of ii at that time. For example, suppose two, 2s stimuli, AA and BB are presented with an overlap of 1s, with AA’s onset occurring first. Can you guess what stimulus compounds will be active at t=2t = 2 with a time resolution of 0.5s?1

2 - Generating expectations

The TD model generates stimulus expectations2 based on the presented stimuli, not on the strength of eligibility traces. The expectation of of stimulus jj at time tt, VjtV_j^t, is given by:

Vjt=wjtxt=iKwi,jtxit \tag{Eq. 2} V_j^t = w_j^{t'} x^t = \sum_i^K w_{i,j}^t x_i^t

Where wjtw_j^t is a matrix of stimulus weights at time tt pointing towards jj, ' denotes transposition, and wi,jw_{i,j} denotes an entry in a square matrix denoting the association from ii to jj. As with the eligibility traces above, the entries in each matrix are the weights of specific stimulus compounds.

Internally, the wjtw_j^t is constructed on a trial-by-trial, step-by-step basis, depending on the stimulus compounds active at the time.

3 - Learning associations

Owing to its name, the TD model updates associations based on a temporally discounted prediction of upcoming stimuli. This temporal difference error term is given by:

δjt=λjt+γVjtVjt1 \tag{Eq. 3} \delta_j^t = \lambda_j^t + \gamma V_j^t - V_j^{t-1}

where λj\lambda_j is the value of stimulus jj at time tt, which also determines the asymptote for stimulus weights towards jj.

The temporal difference error term is used to update ww via:

wi,jt=wi,jt+αiβ(xjt)δjteit \tag{Eq. 4} w_{i,j}^t = w_{i,j}^t + \alpha_i \beta(x_j^t) \delta_j^t e_i^t

where αi\alpha_i is a learning rate parameter for stimulus ii, and β(xj)\beta(x_j) is a function that returns one of two learning rate parameters (βon\beta_{on} or βoff\beta_{off}) depending on whether jj is being presented or not at time tt.

4 - Generating responses

As with many associative learning models, the transformation between stimulus expectations and responding is unspecified/left in the hands of the user. The TD model does not return a response vector, but it suffices to assume that responding is the identity function on the expected stimulus values, as follows:

rjt=Vjt \tag{Eq. 5} r_j^t = V_j^t

References

Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory. (pp. 64–69). Appleton-Century-Crofts.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. W. Moore (Eds.), Learning and computational neuroscience (pp. 497–537). MIT Press.