Data analysis - Materials and Methods - 以小鼠探討紋狀體不同腦區在增強學習以及酬賞預測誤差中所扮演的角色

Chapter 2: Materials and Methods

4. Data analysis

4.1. Q learning model. A standard reinforcement learning model was applied to

estimate RPE in the 2-choice dynamic foraging task. As typically seen in other modeling work, the reinforcement learning model constitutes one value updating component (i.e. how information is updated) and one choice component (i.e. how choice is made). For the value updating rule, we used a simpliﬁed Q-learning model, which belongs to the family of temporal difference models, to characterize the

dynamic process of RPE in the 2-choice dynamic foraging task (Sutton & Barto, 1998;

Watkins & Dayan, 1992). Such a rule proposes that an RPE is updated whenever the subject’s expected reward changes on each trial. Thus, the value chosen from the

high-reward aperture for each trial was updated according to the following rule (Rutledge et al., 2009).

α where Qhigh(t) is the expected value associated with choosing the high-reward rate aperture on trial t and (t) is the RPE representing the discrepancy between expectation and the reward just received. Rhigh(t) denotes the actual outcome received from the high-reward rate aperture on trial t. The parameter α represents the learning rate, which determines how rapidly the reward prediction error signal is updated.

Because the onsets of stimuli and outcomes were modeled trial-by-trial as separate (t)

at the time of each feedback display during each trial, the magnitude of RPE was determined by the learning rate (α) from the trial-by-trial data in each testing section

of the 2-choice dynamic foraging task.

Reinforcement learning also requires a balance between exploration and exploitation. For the choice rule in the reinforcement learning model, it is assumed that the probability of choosing the high-reward aperture P_high(t + 1) was determined

by the so-called softmax rule or Boltzmann exploration (Kaelbling et al., 1996), a logistic form that assigned a weight to each of the actions according to their action value estimation:

The parameter β represents choice perseveration (or exploration/exploitation), a

term referring to the tendency of making actions guided by reward values. A zero value of β means the subject will choose the high-reward rate aperture at random. To estimate the learning rate (α) and the choice perseveration (β), we used a hierarchical

modeling approach called Markov Chain Monte Carlo (MCMC)-based Bayesian parameter estimation to ﬁt the reinforcement learning model to the trial-by-trial data from the 2-choice dynamic foraging task (Lee & Wagenmakers, 2014; Wetzels, Lee,

& Wagenmakers, 2010). The advantage of the Bayesian approach is that it can account for inter-subject variability and other random effects in a more rigorous and satisfactory way using latent parameters. In particular, from the Bayesian perspective, parameters are described by informative probability distributions instead of point

estimations. A probit transformation was used to make the construction of the

Bayesian hierarchical model easier. Because the Bayesian hierarchical model requires the number of input trials to be the same, we cut the cumulated trials into the same number by use of the smallest cumulated trials as a cutting point in the lesion and sham groups. The structure of this Bayesian hierarchical modeling is depicted in Figure 2. 3. As shown in Figure 2.3, the parameters α and β for subject i (αi and βi ) were each assumed normally distributed with respective means and standard deviations, which were from the group level of distributions (i.e. μa σa and μb σb , respectively). We used WinBUGS [the MS Windows operating system version of BUGS (Bayesian inference Using Gibbs Sampling)] and WinBUGS Development Interface (Lunn, Thomas, Best, & Spiegelhalter, 2000) to approximate the

distributions of parameters by sampling values using the MCMC technique. A chain consisted of 28000 iterations, of which the ﬁrst 8000 (burn-in) points were discarded to ensure that only samples from the stationary distribution were used and that the data were unaffected by the starting value. Thus, we obtained 60000 points of estimation from the three chains and collected samples at intervals of every ﬁve samples, which yielded 12000 points. All interpretations and tests were performed based on these 12000 samples. Parameters between lesion and sham groups were

compared by computing the difference between the values of the two posterior

distributions in each run obtained from the hierarchical Bayesian estimation. One way to evaluate the strength of evidence for differences in group-mean parameters is by checking whether the probability of the posterior distribution of differences is greater (or less) than zero (Fridberg et al., 2010). Another way is to use the Bayes factor (BF), an odd ratio of marginal likelihood of the two models (or hypotheses) of interest, to index the evidence strength of the alternative hypothesis against the null hypothesis

(Kass & Raftery, 1995; Raftery, 1995). A large BF value ( > 3) would (at least)

“positively” favor the alternative hypothesis and a BF value between 1 and 3 would

“weakly” favor the alternative hypothesis, as shown in Table 2. 1. To evaluate the

differences of group-mean parameters, a method based on the Savage-Dickey density ratio was used to compute the BF values (Wagenmakers, Lodewyckx, Kuriyal, &

Grasman, 2010).

4.2. Matching law analysis. To assess the degree to which animals in the

2-choice dynamic foraging task made their overall average choices in accord with the received rewards, a matching law analysis was also conducted (Baum, 1974; Rutledge et al., 2009), which provides a simple empirical quantiﬁcation between the rate of

response and the rate of reinforcement:

In the above formula, Cleft and Cright denote the number of choices to the left- and right apertures, respectively. Likewise, Rleft and Rright are the respective number of rewards received from the left and right apertures. The slope s is thought to be a measure of the sensitivity of choice allocation to reward frequency. In this study, we used least-squares regression to ﬁt the above formula to steady-state (last 30 trials of each testing phase) choice behavior in the 2-choice dynamic foraging task. Blocks in which one aperture was never rewarded (i.e. R _left or R_right = 0) were excluded from the analysis in order to ﬁt the data to the above formula.

4.3. Statistical analysis and software. The behavioral data were analyzed by

the Student’s t-test or the one-way analysis of variance (ANOVA) where appropriate.

Adjusted t-test was applied if the Levene’s test for equality of variances reached the

significant level. Statistic analyses were performed using SPSS 20.0 (SPSS Inc., Chicago, IL, USA).

在文檔中以小鼠探討紋狀體不同腦區在增強學習以及酬賞預測誤差中所扮演的角色 (頁 40-47)