The mathematics behind TD
The temporal difference (TD) model (Sutton & Barto, 1990) is an extension of the ideas underlying the RW model (Rescorla & Wagner, 1972). Most notably the TD model abandons the construct of a “trial”, favoring instead time-based formulations. Also notable is the introduction of eligibility traces, which allow the model to bridge temporal gaps and deal with the credit assignment problem.
Implementation note: As of calmr
version
0.6.2
, stimulus representation in TD is based on complete
serial compounds (i.e., time-specific stimulus elements entirely
discriminable from each other), and the eligibility traces are of the
replacing type.
General Note: There are several descriptions of the TD model out there, however, all of the ones I found were opaque when it comes to implementation. Hence, the following description of the model has a focus on implementation details.
1 - Maintaining stimulus representations
TD maintains stimulus traces as eligibility traces. The eligibility of stimulus at time , , is given by:
where and are decay and discount parameters, respectively, and is the activation of stimulus at time (1 or 0 for present and absent stimuli, respectively).
Internally, is represented as a vector of length , where is the number of stimulus compounds (not in the general sense of the word compound, but in terms of complete serial compounds, or CSC). For example, a 2s stimulus in a model with a time resolution of 0.5s will have a , and the second entry in that vector represents the eligibility of the compound active after the stimulus has been present for 1s.
Similarly, entails the specific compound of stimulus at time , and not the general activation of at that time. For example, suppose two, 2s stimuli, and are presented with an overlap of 1s, with ’s onset occurring first. Can you guess what stimulus compounds will be active at with a time resolution of 0.5s?1
2 - Generating expectations
The TD model generates stimulus expectations2 based on the presented stimuli, not on the strength of eligibility traces. The expectation of of stimulus at time , , is given by:
Where is a matrix of stimulus weights at time pointing towards , denotes transposition, and denotes an entry in a square matrix denoting the association from to . As with the eligibility traces above, the entries in each matrix are the weights of specific stimulus compounds.
Internally, the is constructed on a trial-by-trial, step-by-step basis, depending on the stimulus compounds active at the time.
3 - Learning associations
Owing to its name, the TD model updates associations based on a temporally discounted prediction of upcoming stimuli. This temporal difference error term is given by:
where is the value of stimulus at time , which also determines the asymptote for stimulus weights towards .
The temporal difference error term is used to update via:
where is a learning rate parameter for stimulus , and is a function that returns one of two learning rate parameters ( or ) depending on whether is being presented or not at time .
4 - Generating responses
As with many associative learning models, the transformation between stimulus expectations and responding is unspecified/left in the hands of the user. The TD model does not return a response vector, but it suffices to assume that responding is the identity function on the expected stimulus values, as follows: