Using Testlet Response Theory to analyze data from a survey of attitude change among breast cancer survivors

(1)

Received 1 October 2008, Accepted 4 March 2010 Published online 25 May 2010 in Wiley Interscience (www.interscience.wiley.com) DOI: 10.1002/sim.3945

Using Testlet Response Theory to analyze data from a survey of attitude change among breast cancer survivors

Xiaohui Wang,

^a^∗†‡

Su Baldwin,

^b^§

Howard Wainer,

^b,c^¶

Eric T. Bradlow,

^c^‡‡

Bryce B. Reeve,

^d^∗∗

Ashley W. Smith,

^d^††

Keith M. Bellizzi

^e^‡

and Kathy B. Baumgartner

^f^§§

In this paper we examine alternative measurement models for fitting data from health surveys. We show why a testlet-based latent trait model that includes covariate information, embedded within a fully Bayesian framework, can allow multiple simultaneous inferences and aid interpretation. We illustrate our approach with a survey of breast cancer survivors that reveals how the attitudes of those patients change after diagnosis toward a focus on appreciating the here-and-now, and away from consideration of longer-term goals. Using the covariate information, we also show the extent to which individual-level variables such as race, age and Tamoxifen treatment are related to a patient’s change in attitude.

The major contribution of this research is to demonstrate the use of a hierarchical Bayesian IRT model with covariates in this application area; hence a novel case study, and one that is certainly closely aligned with but distinct from the educational testing applications that have made IRT the dominant test scoring model. Copyright ©2010 John Wiley & Sons, Ltd.

Keywords: testlet; Bayesian; MCMC; local independence; breast cancer survey

1. Introduction

Surveys are typically constructed with two related but different goals in mind. The first, and the easiest to accomplish, is the retrieval of facts from the surveyed population (e.g. ‘How many cigarettes did you smoke in the last 24 hours?’). The second goal is to retrieve information about a more vague concept, which may be conceived as a ‘higher-order’ construct or factor as in [1], which cannot be directly addressed with a single, factual question. Instead we try to understand its nature with a sequence of questions, each of which, individually, sheds only a little light on the underlying variable of interest, but taken as a whole give us a fairly complete understanding of the construct of interest. For example, if we are interested in studying the level of health in a society we might ask questions that reflect differing amounts of robustness, and/or measure different aspects of health (e.g. physical, mental, short term versus long term, etc.).

aDepartment of Statistics, University of Virginia, Charlottesville, VA, U.S.A.

bNational Board of Medical Examiners, Philadelphia, PA, U.S.A.

cThe Wharton School of the University of Pennsylvania, Philadelphia, PA, U.S.A.

dNational Cancer Institute, Bethesda, MA, U.S.A.

eUniversity of Connecticut, Storrs, CT, U.S.A.

fUniversity of Louisville, Louisville, KY, U.S.A.

∗Correspondence to: Xiaohui Wang, Department of Statistics, University of Virginia, Charlottesville, VA, U.S.A.

†E-mail: [email protected]

‡Assistant Professor.

§Measurement Scientist.

¶Distinguished Research Scientist.

Professor.

∗∗Psychometrician and Program Director.

††Behavioral Scientist.

‡‡Co-Director.

§§Associate Professor.

Contract/grant sponsor: National Science Foundation; contract/grant number: DMS0631639 Contract/grant sponsor: National Security Agent; contract/grant number: GG11184

2028

(2)

In addition to questions about the issues of interest, surveys also usually try to characterize the surveyed population according to various demographic/background questions. Such information is included to help us understand any systematic variations in the responses to the questions asked as a function of those background variables (e.g. ‘Do Hispanics have more robust health than Blacks?’). These results can then be used for issues such as targeting interventions in underserved populations, or merely for descriptive purposes and reporting.

The last, and most technical issue that one may face (in surveys and otherwise) is the inevitable relationship between how the data are collected and how they are analyzed: the impact of data collection on the measurement model.

For example, survey questions are often administered as clumps of items with related content, that are termed ‘testlets’

in the educational testing literature by [2] and ‘blocks’ in the survey literature. These are groups of questions, each of which focuses on a single topic (e.g. pulmonary health, cardiac health, renal health). Instruments composed of testlets limit the choice of appropriate measurement models to those that do not depend on the assumption of conditional local independence among all of the items within the survey. Traditional models that comprise item response theory (IRT)^¶¶ rely on this assumption and if violated, tend to yield overly optimistic estimates of precision as indicated in [6].

Instead, models that take the nested item structure into account are likely to provide more realistic estimates of the true information that is obtained from the respondents’ answers. It is this aspect of survey design that we focus on here, and most importantly its application to a survey of significant individual-level and societal value, that of breast cancer survivors.

We highlight the importance and practicality of our approach via a data set consisting of survey responses from women enrolled in the Health, Eating, Activity, and Lifestyle (HEAL) Study, a population-based, multi-center, multi-ethnic, prospective study of women newly diagnosed with in situ or Stages I to IIIA breast cancer, see [7, 8]. In 2005, 858 women completed a health-related quality of life questionnaire approximately 39 months post cancer diagnosis [9].

We excluded 53 women diagnosed with recurrent breast cancer or a second primary breast cancer by the date of these analyses to focus on a population of individuals facing for the first time a potential life-threatening event. This defined a cohort of 805 women. Subjects were also screened to ensure that complete data were available, as dealing with missing data was not our primary goal. This led to the exclusion of an additional 87 participants, which yielded a final sample size of 718 women for the analysis.

This study will examine the women’s responses to the Post traumatic Growth Inventory (PTGI) developed by Tedeschi and Calhoun [10] that assesses the changes or transformation that people report experiencing following a traumatic event (i.e. breast cancer for this sample). The scale consists of 21 items with responses on a 0–5 point ordinal Likert scale, ranging from ‘no change’ to ‘very great change’.

In addition to the 21 PTGI survey items, there were also background questions asked about race, ethnicity, age, income, employment and marital status, whether they were taking the drug Tamoxifen (a drug commonly used to assist in the prevention and recurrence of breast cancer in women near or beyond menopause^∗∗∗), and how long it had been since they were diagnosed. These questions were anticipated to be useful in assessing the differing model-based item characteristics across varying backgrounds.

The survey was designed so that the 21 items from which it was composed fell into five broad categories (testlets):

I. Relating to others II. New possibilities III. Personal strength IV. Spiritual change

V. Appreciation of life

The exact survey questions and the testlets they fell into are shown in Table I below.

Because each of the 21 items falls into one of these five categories of change, each item is a manifest indicator of a latent factor of interest. The testlet items in this survey were not laid out contiguously (e.g. all items in testlet I are not administered as the first 7 items). Prior research has suggested that this may reduce the testlet dependence structure as in [11] compared to one that was so ordered. While on the one hand, this may decrease the need for a testlet-IRT model, as described here, it also makes the results we find conservative because most surveys lay out items that are measuring the same construct contiguously. Furthermore, it raises the question (and tradeoff) between a contiguous design where the factor structure is maintained with a likely increase in testlet variance, and a non-contiguous design are likely.

¶¶IRT is the dominant measurement paradigm in contemporary educational testing [3--5].

As the data were collected through three different clinic centers, it is important to account for any differences among them. We include dummy variables in our analyses for this purpose.

∗∗∗A detailed description can be found at http://www.breastcancer.org/tre_sys_tamox_idx.html.

2029

(3)

Table I. The survey items shown within their testlet structure.

Testlet number Item number Text of item

I 6 Knowing that I can count on people in times of trouble

7 A sense of closeness with others 9 A willingness to express my emotion 12 Having compassion for others 15 Putting effort into my relationships

18 I learned a great deal about how wonderful people are 20 I accept needing others

II 19 I developed new interests

21 I established a new path for my life 13 I am able to do better things with my life

14 New opportunities are available which would not have been otherwise 2 I am more likely to try to change things which need changing

III 4 A feeling of self-reliance

8 Knowing I can handle difficulties

10 Being able to accept the way things work out 17 I discovered that I am stronger than I thought I was

IV 5 A better understanding of spiritual matters

16 I have a stronger religious faith

V 1 My priorities about what is important in life

3 An appreciation for the value of my own life

11 Appreciating each day

1.1. Hypotheses

Despite the recent reports of reductions in breast cancer deaths due to better screening and treatments, breast cancer remains the most frequent cancer diagnosed in females and the second leading cause of deaths from cancer after lung cancer, see [12]. Diagnosis and treatment of breast cancer has been shown to have a great impact on a woman’s psychological, sexual, and physical functioning by [13, 14]. Despite the negative impact, studies have also shown in long-term, survivors of breast cancer that there have been some improvements in the quality of life in [15]. The PTGI studied here, was intended to be used to examine the changes in psycho-social functioning among the breast cancer survivors who were approximately three years post diagnosis.

The 21-item PTGI scale employed in this research, and its relationship with socio-economic and demographic factors has been of considerable recent interest. These studies, as described below, used a combination of factor analytic methods to assess the five subscale dimensions, and regression methods to assess covariate differences on posttraumatic change.

However, none of the approaches used in these studies allows for a coherent and simultaneous set of inferences while regarding the scale items, their inherent groupings, and the people it differentially impacts. The hierarchical Bayesian approach employed here addresses that concern.

For instance, in a post hoc evaluation of the psychometric properties of the PTGI using a sample of undergraduate students, it was identified as having five factors [10]: relationships with others, new possibilities, appreciation of life, spirituality, and personal strength (as given in Table I). The five-factor solution, however, did not hold up in validation studies by [16, 17]. In [17], whose data consists of middle- to old-aged cardiovascular disease patients, it was suggested that a single summary score of the PTGI’s 21 items was to be preferred for its parsimony because it did not seem to lose information.

Using our data, an exploratory factor analysis suggested a four factor solution when using the criterion of accounting for 90 per cent of total variation, and the maximum likelihood method showed that six factors were significant at p=.05.

This result of between four and six factors is consistent with the aforementioned extant research. The Bayesian testlet response model utilized here, we believe, can also shed further light on the issue of dimensionality as the testlet-specific variance components will indicate the extent of excess local-scale dependence above and beyond treating the scale as a one-factor 21-item battery. The testlet model we present can be re-parameterized as a bi-factor model by putting the loadings on each dimension of the bi-factor model proportional to the loadings on the general dimension, see [18].

In this manner, the testlet model is a special case of a general multidimensional IRT model in [19] under the condition of independence of all dimensions, which was supported by our exploratory factor analysis.

We anticipated, a priori, that an examination of the factor structure of the PTGI scale in this study will yield a similar result; that a single growth factor would dominate given the similarity in the characteristics of the traumatic event to

2030

(4)

those patients in [16, 17]. However, we were concerned that there would be local dependency among items that clustered in the original factors of [10]; especially the spirituality factor that has shown to have the greatest impact following breast cancer in [15]. Therefore, we postulate that:

H1 The testlet variances for the 5 underlying subscales will be substantial, yet most significant for the spirituality factor.

Previous studies have also found that younger women report more posttraumatic change following breast cancer than older women, see [20--22]. Thus, we hypothesize that:

H2 Younger women will report more posttraumatic change in this study.

Earlier research has also found a relationship between ethnicity and posttraumatic change. It was shown in [23, 24] an independent effect of ethnicity on reports of change in survivors of sexual assault and breast cancer, respectively. Specifically, African-American and Hispanic women reported higher levels of change than Non- Hispanic White women. Therefore, we expect that:

H3 A significant relationship between ethnicity and posttraumatic change exists; in particular higher levels for African American and Hispanics.

Previous studies have also examined the association between marital status and posttraumatic change, see [25, 26].

Breast cancer survivors who were married or in a committed relationship reported more change than women who were single or widowed, see [20]. Being married has also been found to be associated with more change among bereaved parents due to the loss of a child as indicated in [26]. In [20], it is suggested that women’s partners can offer a beneficial social support system to help cope with their disease. In the current analysis, we therefore hypothesized that:

H4 Having a long-term partner would be associated with more posttraumatic change.

There is no literature, however, that has examined the relationship between taking Tamoxifen (one of our measured variables as described previously) and posttraumatic change. Clinical variables were incorporated here as covariates to account for potential confounding due to cancer status and treatment effects. One could argue that those women who are on adjuvant^††† therapy (e.g. Tamoxifen) are more actively involved in the management of their disease than those who are not on Tamoxifen. This might mean that the process of coping with breast cancer is different for women who are on Tamoxifen versus women who are not because they are reminded of their disease every day due to their use of adjuvant therapy. These women may continue to experience more of an impact (both positive and negative) from their disease as a result. Thus whereas there was no strong research based hypotheses on the effects of Tamoxifen on posttraumatic change, we were interested in accounting for such clinical variables as adjuvant treatment. Hence, we posit that:

H5 Women taking Tamoxifen will report a larger posttraumatic change in their attitudes.

Within the IRT framework, each survey respondent’s position on a latent trait (i.e. change in attitudes following breast cancer) can be modeled without difficulty while dealing with the local dependence engendered by the survey construction.

The contribution of this research is the application of incorporating the background information as covariates into the measurement model to help explain the results and test our hypotheses without relying on the assumption of local independence. Research incorporating covariates into the measurement model to explain the differences in ability is abundant. For example, in [27], is proposed a linear two-level mixed effects model that included covariates of ability.

Owing to its linear nature, however, this model cannot be directly applied to the IRT models. In [28], a more general multi-level IRT model that is estimated through a Bayesian procedure is proposed. However, their model was based on the assumption of local independence. The measurement model used in this paper is a natural extension of the original TRT model as in [29], where covariates are also incorporated.

In the remainder of this paper we will describe both the method of analysis that we used and the results obtained.

In addition, at various points in the discussion, we will point out the extent to which methods that fail to account for the structure of the data yield different results, and the substantive benefits of using the background information to help understand the effects of demographic and other factors on the PTGI results.

2. The measurement model

The PTGI was designed to assess perceived changes in one’s life following a traumatic event. In our mathematical framework, this can be operationalized by estimating the position of those surveyed on an underlying latent dimension

†††Adjuvant: A substance, that when added to a medicine, speeds or improves its action, which aids another, such as an auxiliary remedy.

2031

(5)

(let us call this ‘changeability’^‡‡‡) where those with a higher value on the latent dimension (reflecting more change) are more likely to give higher ordinal response scores as in [30]. Thus it is sensible to analyze these data with a measurement model that assumes the existence of such a dimension, recognizes the ordinal nature of the measured responses and the item testlet structure as specified earlier, and then subsequently examine the extent to which this model yields a good fit to the data. The model we used in the analysis of the data from this survey instrument accomplishes this and is derived from a more general set of models called testlet response theory (TRT) originally introduced in [29].

TRT is a family of models akin to parallel models in IRT, but for which the fundamental unit of analysis is the item nested within a testlet. It assumes conditional local independence, conditional on the unidimensional latent trait, the item testlet structure, and a set of parameters that govern the potential extra dependence of items nested within the same testlet. For items that are in different item categories (e.g. I=relating to others and II=new possibilities), TRT behaves as IRT; however, for items within the same testlet, the model accounts for a likely stronger association. These models have been extended in many ways since they were first introduced and are described in detail in [31].

The specific TRT model that we use differs from standard IRT not just in the relaxation of the assumption of local independence but also in the incorporation of covariates directly into the analysis, as in Chapter 11 of [31]. Including covariates as an integral part of the analysis has two principal advantages: one is science-based and the other relates to estimation. First, this directly allows us to go beyond the standard IRT result of ‘how strongly was this question endorsed’

or ‘to what extent does this item reflect changeability’ to ‘why’. For the goals of science this is of obvious importance.

Second, by incorporating the covariates directly into the model, one does not have to do a potentially inappropriate post-analysis regression using point estimates of respondent latent locations as a dependent variable which ignores that they are estimates.

The estimation of the model is embedded within a fully Bayesian framework, where each parameter is associated with a prior and a hyperprior distribution. An important advantage of this framework is that it allows for sharing of information across items and people in a way that improves the precision. We next describe the details of our Bayesian model specification.

2.1. The model specification

The important feature of the model that we utilize is that it accounts for the ordinal nature of the data, by utilizing a polytomous probit response model of [30, 32] designed for TRT in [33]. In particular, we model the probability that respondent i (i=1, ...,718 here) gives response score r (r =0, ...,5) to survey item j ( j =1, ...,21 here), Yij, as

P(Y_ij=r)=(gr−tij)−(gr−1−tij), (1)

where denotes the cumulative standard normal distribution function, gr denotes a latent cutoff for score r such that Y_ij=r if gr−1<tij<gr, and t_ijis as given next and represents the TRT aspect of the work.

The model replaces the standard IRT linear predictor of changeability score, t_ij=aj(i−bj) with its corresponding TRT version t_ij=aj(i−bj−_{i d( j )}), wherei is the i th person’s location on the latent dimension and the a_j’s and b_j’s are analogous to the usual discrimination and difficulty parameters for item j in educational testing. In particular, b_j represents the general propensity for a given item (out of the 21 survey items) to receive higher ordinal scores than others, whereas a_j is akin to a factor loading describing the degree to which item j is correlated with the underlying changeability score. The additional term in the model,i d( j ), which incorporates the extra dependence when two items j and j ’ are in the same testlet, i.e. d( j )=d( j), is indeed a person-by-item effect that is the same for a given person i and all the items within one testlet. By utilizing this structure, two item responses for person i , j and j( j= j), which are both in testlet d( j )=d( j), share term_{i d( j )} in their latent linear predictor t_ijand hence are more highly correlated under the model than items j and jfor which d( j )=d( j) (asi d( j )is assumed to be independent ofi d( j)for different testlets). For instance, in this study d(1)=d(3), as both lie in category V , whereas d(11)=d(19) as survey item 11 is in category V while item 19 is in category II.

The likelihood of the complete data, Y , is then derived as

P(Y|)=ijrP(Y_ij=r)^{I (Y}^ij^=r) (2)

the product over all persons and items, where I (Y_ij=r)=1 if person i gave Likert scale score r to ‘change survey item’ (item j ) and 0 otherwise, and indicates the full parameter space. We note that (2) above is a model-based approach to local dependence identification which is related to the work of [34, 35], which develop methods to detect local independence.

‡‡‡This play on words, ‘changeability’ equal to the juxtaposition of ‘change’ as asked by the 0–5 Likert items in this survey, and ‘ability’ is intentional to highlight its similarity to the ability dimension commonly used in educational testing.

2032

(6)

To fully specify a Bayesian model, prior distributions need to be specified for the parameters that govern the likelihood given in (2). The Bayesian hierarchical structure that was employed here (and implemented in the computer program SCORIGHT^§§§—details are in [31]) was

i. i∼ N(x,1),

ii. [log(a_j),bj]∼ N2((a,b),), (3)

iii. i d( j )∼ N(0,²_{d( j )})

where N₂(x, y) denotes a bivariate normal distribution with mean x and covariance matrix y. (3.ii) is noteworthy because it explicitly allows a_j and b_j to be correlated, reflecting the common empirical pattern of items that are more discriminating (higher a_j) tending to be endorsed less (have a lower b_j).

The Bayesian TRT model specification was completed by using a conjugate prior distribution for the covariate slopes, a non-informative prior on the vector of means, an inverse-gamma prior on the testlet variances, and an inverse-Wishart hyperprior on , respectively, to ensure proper posteriors. Extensive testing suggests relatively little sensitivity to the exact hyperprior information for this project; yet, it is something (especially with sparse data for some testlets) that all researchers should work at with caution. Further details are available upon request.

Obviously, the term i d( j ) is what yields the TRT model and thus when i d( j )=0, we have a modified version of Samejima’s model in [36]; albeit a Bayesian one. Furthermore, if we allow ²_a→∞, ²_b→∞, and ²_{d( j )}→∞

we would obtain the standard two parameter probit model (without Bayesian shrinkage) and hence the standard (frequen- tist) IRT models are nested within this model.

2.2. Model identification

Without constraints, the model is in general non-identifiable. There are three sources of identification problems:

Additive aliasing I: a_j(i−bj−i d( j ))=aj((i−i d( j )−d)−(bj−d)) Multiplicative aliasing II: a_j(i−bj−i d( j ))=(^a_s^j)[s(i−bj−i d( j ))]

Additive aliasing III: gr−tij=(gr+d)−(tij+d)

To solve the additive aliasing problem I, we constrain to have a mean of 0. Therefore, we have to constrain x to be a set of mean-centered covariates with corresponding slopes and the variance of is set to 1. To solve the multiplicative aliasing problem, constraints also need to be imposed on the variance of eitheri or a_j. We handle this problem by applying a prior distribution toi with variance equal to 1 so that the posterior distribution is not subject to multiplicative aliasing.

To solve the additive aliasing problem II, for each item j , we fix g₀=0, i.e. g−1=−∞, gr=+∞. Therefore, we only need to estimate g₁, ..., gr−1, which need to be estimated for every item. By applying the non-informative prior on g_k, k=1, ...,r −1, the fully conditional density of gk given T_ij, yij,i,aj,bj,i d( j ) and{gm,m =k} can be seen (up to a proportional constant) as a uniform distribution.

2.3. Bayesian computation

As the computational aspects of Bayesian IRT and TRT models are well-documented elsewhere (e.g. [31, 32, 37]), and are not the main focus of this research, we provide only a brief description of our computational approach. Inferences for the unknown model parameters given in (1)–(3) are derived from posterior samples obtained using Markov chain Monte Carlo, MCMC, procedures as in [38, 39].

The model was fit to the responses to data from the 21-item breast cancer survivor survey using MCMC methods;

two chains having been run from overdispersed starting values. For each chain, after a burn-in period of M=10000 iterations which is consistent with other research that has demonstrated this to be sufficient (e.g. [40]). The next 20 000 iterations were used for inferences where we retained every twentieth draw (thinning) to reduce the high autocorrelation.

The convergence of our MCMC sampler was assessed by noting that the potential scale reduction factors (see [41, 42]) for the prior and hyperprior parameters are all very close to 1.0. All the reported inferences here are thus based on the 2000 combined MCMC draws after convergence.

§§§SCORIGHT is a general purpose computer program for doing Bayesian computation for item response data that can be dichotomous, polytomous or a mixed version of the two, where items can be nested within testlets. SCORIGHT is available at no charge from the authors upon request.

2033

(7)

Figure 1. Continuous residualεij versus linear predictor tij: the dotted lines are 99 per cent predictive bounds under the TRT model, i.e.εijshould have IID normal distribution.

2.4. Posterior predictive checks

We ran a series of posterior predictive checks as demonstrated in [43] to detect any systematic differences between the model and the observed data and to verify that inferences from our Bayesian TRT model would be appli- cable to the examination of posttraumatic change among breast cancer survivors. The general idea is to check the model based on some discrepancy variables T (y,) by comparing the observed values versus the replicated values with respect to the posterior distribution of . In most cases, the posterior predictive p-value can be calculated as follows:

P{T (y^rep,)>T (y,)}=

P{T (y^rep,)>T (y,)} f (|y)d.

In our framework, it can be estimated by ₂₀₀₀

l=1 I (T (y^rep,^l)>T (y,^l))/2000. We carried out posterior predictive checks using the following model-fit diagnostics.^¶¶¶

As Y_ij(the PTGI ordinal response score) is discrete, we first utilize the vector of latent continuous model residuals as in [44]. The latent continuous residualsε^l_ijare drawn conditionally on simulated values of and the replicated data y^rep. If the model fits the data adequately, then the residual plot ofε^l_ijversus E(t_ij^l) should show no pattern and the qq-normal plot ofε^l_ijshould fall on a straight line.

Figures 1 and 2 demonstrate for one simulated case that the residuals do not show any pattern with respect to the linear predictors and the qq-plot versus the normal distribution falls very close to a straight line, and well within confidence bands. By examining all the simulated cases, they follow the same pattern as the one shown in Figures 1 and 2.

Instead of looking at the single simulated case, Figure 3 shows 20 random draws from the 2000 examinee simulated case. The left panel of Figure 3 displays a qq-plot of 20 random draws of ε based on the original data; and the right panel displays a similar plot of the 20 random draws of residual ε based on the replicated data. The two plots are indistinguishable, indicating that the distribution of the realized residuals fit the assumptions of the model well.

Figure 4 shows the averaged continuous residuals over all patients for each question j to assess model fit not in aggregate, but rather at the individual-item level. The upper panel of Figure 4 shows such averaged continuous residual versus question number for 20 random draws of the observed data. The lower panel shows the similar plot for the replicated data. Again, the two are similar, indicating a reasonable fit.

¶¶¶A much larger set of diagnostics were run and were available upon request; the findings indicate excellent fit to the same degree as presented here.

2034

(8)

Figure 2. QQ-plot of εij in Figure 1. The envelope shows confidence bands.

Figure 3. The left panel shows a qq-plot of 20 random draws ofε for the observed data; and the right panel shows 20 random draws for the replicated data. The similarity supports the fit of our model.

Finally, Figure 5 gives the scatter plot of the deviance for the observed data and for the replicated data. The Bayesian p-value is 22 per cent under 2000 simulations, suggesting no lack of fit of our model. Clearly, there are many other aspects of model fit that could be checked, but all diagnostics we ran suggest no lack of fit, which might be due to the small sample size of data.

2035

(9)

Figure 4. The upper panel contains averaged values ofεijover all patients for each question j ; 20 plots correspond to 20 random draws of parameters from the posterior distribution; the dashed lines are 95 per cent predictive bounds under the model.

The lower panel is for draws from the predictive distribution.

Figure 5. Scatter plot of predictive versus realized deviance. The p-value is estimated about 22 per cent.

2.5. Model selection

As mentioned, due to the nested structure of the data, we utilized TRT models that capture excess local dependence instead of standard IRT models. Whether this more general mode is needed is an empirical question. To assess this, we resort to two standard criteria.

The first is based on the mean absolute prediction error (MAPE), i.e.|y − y^rep^,l| over all posterior draws, l =1, ...,2000.

We find that the posterior MAPE for the TRT model is 0.98 and for the IRT model it is 1.11 (Bayesian p-value<0.01).

Therefore, on this criterion, the TRT model is preferred by 13 per cent, a considerable amount given the associated standard error of 1.35 per cent.

2036

(10)

Table II. Deviance results for the cancer survey data.

Model ¯D D( ¯⁾ ^pD DIC

IRT 36 944 36 141 803 37 748

TRT 34 225 32 061 2 164 36 390

Figure 6. Posterior mean tracelines for the six response categories obtained for Item 1.

The second criterion is based on a Bayesian model selection rule: the Deviance Information Criterion (DIC) as in [45]. DIC combines the goal of improved model fitting with penalizing unnecessary model complexity. It is given by

DIC= ¯D+ pD, where

¯D = E_|Y[−2log P(Y |)],

p_D= E_|Y[−2log P(Y |)]+2log P(Y | ¯)

with ¯D serving as a Bayesian measure of model adequacy and p_D serving as the penalty term that measures the complexity of the model, which is the difference between the posterior mean deviance and the deviance of the posterior mean. Obviously, the better model would have the smaller value of DIC.

Table II shows the deviance for the breast cancer survivor data. A comparison of DIC for IRT and TRT models suggests that the TRT model is better in terms of the overall fit. If we consider an IRT model as a fixed effects model, we would expect that p_D should be approximately equal to the true number of independent parameters. Combining the

is of all patients, the item parameters, and the cutoffs for each item, the total number of parameters is equal to 823, which is close to the p_D (803) of the IRT model.

3. Model results

One standard inference that is obtainable from an IRT (or TRT) model for polytomous data are the probability curves (often called tracelines) describing how, for each item, the probability of giving response category r varies with latent trait . For example the probability of each possible response, r =0,1, ...,5, to item 1 can be repre- sented graphically by the six posterior (pointwise) mean tracelines shown in Figure 6. From this, we can see that

2037

(11)

Figure 7. The expected score curve for item 1 allows us to summarize the analog to difficulty for a polytomous item with the point on the x-axis at which the expected score reaches its 50 per cent point.

the responses to categories 0, 1, and 2 are all close together suggesting that these categories, as a function of latent changeability, do not really differentiate much among responders, whereas response categories 3, 4, and 5 are better differentiated.

We derived similar figures for all 21 items. The study of these figures yields insights into the functioning of the survey instrument (items that have good ‘separation’ in the observed range of changeability scores are ‘good’ items in some sense). They also provide information regarding the impact of that item on the respondents. For binary items this is easily characterized by the item discrimination parameter a_j, but for polytomous items the relationship is more complex.

Item tracelines also provide the raw material for an often useful summary. We can take the various component tracelines and combine them to yield an expected score curve for item j , E(Y_{. j})=rr^∗P(Y_{. j}=r)=0^∗P(0)+1^∗P(1)+ ···+5^∗P(5).

The expected score curve for Item 1 is shown in Figure 7. We have drawn on Figure 7 a horizontal line at the 50 per cent point of expected score (an expected score of 2.5) and indicated the value of the changeability score,=−0.09 to which it corresponds. We see that a respondent whose changeability score is approximately 0 will have an expected response score to item j of r=2.5. Items that are less frequently cited as characterizing change would be offset to the right (e.g.

a person would need to be more changeable to endorse it) and those items that are more frequently cited would be offset to the left. We will summarize all items by this parameter (value of which yields an expected score of 2.5) to ease comparisons among them (Table III).

Similar analyses were run for all of the items and the 50 per cent-points thus obtained are shown in Table III along with an abbreviated version of the text of the items. Table III is ordered by the 50 per cent points from the most highly endorsed to the least, and the items are spaced by leaving horizontal gaps in the table, based upon the apparent gaps in the item statistics, see [46]. We see that the items where respondents indicated the most change since being diagnosed (1, 3, and 11) are all from the ‘appreciation of life’ category. The items that were next in terms of change had to do with an increased self-awareness (2 and 17) and a greater appreciation for other people (6, 7, 12, 18). At the other end, we see that these breast cancer survivors acknowledge much less change in items that reflect moving on to ‘new activities’ (14, 19, 21), and only a little more change was acknowledged in topics of religion (16) and acceptance of one’s fate (10).

Showing these results as an annotated stem and leaf diagram (in Table IV) makes clearer both the separation among survey items and the differences in the location of the testlets, one of our primary goals.

With these insights in hand, and the methods used to obtain them, we now move on to the second major goal of this research, describing the ‘whys’.

2038

(12)

Table III. The survey items and their 50 per cent points on the changeability scale.

Item number Item 50 per cent Point

3 An appreciation for the value of my own life −0.58

11 Appreciating each day −0.31

1 My priorities about what is important in life −0.29

6 Knowing that I can count on people in times of trouble −0.09

18 I learned a great deal about how wonderful people are −0.05

12 Having compassion for others −0.01

17 I discovered that I am stronger than I thought I was 0.00

5 A better understanding of spiritual matters 0.01

8 Knowing I can handle difficulties 0.02

7 A sense of closeness with others 0.06

2 I am more likely to try to change things that need changing 0.08

4 A feeling of self-reliance 0.18

10 Being able to accept the way things work out 0.24

16 I have a stronger religious faith 0.29

13 I am able to do better things with my life 0.30

9 A willingness to express my emotion 0.36

15 Putting effort into my relationships 0.39

20 I accept needing others 0.56

21 I established a new path for my life 0.60

19 I developed new interests 0.75

14 New opportunities are available which would not have been otherwise 1.03

Table IV. The 50 per cent points on the changeability scale for each item within different testlets.

Testlet Item Two testlet

number 50 per cent point number names

V −0.6 3 Appreciation

−0.5 of

−0.4 life

V, V −0.3 11, 1

−0.2

I, I −0.1 6, 18

I, III, IV, III 0 12, 17, 5, 8

I, II 0.1 7, 2

III, III 0.2 4, 10

IV, II 0.3 16, 13

I, I, 0.4 9, 15

0.5

I, II 0.6 20, 21

II 0.7 19 New

0.8 possibilities

0.9

II 1 14

4. Using covariates to understand why

We can use the covariates associated with each person as one source of explanatory information to help us understand what underlying factors may account for why some breast cancer survivors responded the way they did.

Specifically, we might suspect the age (Hypothesis 2) and the type of medical treatment (Hypothesis 5) may be related to changeability. Our Bayesian analysis of i∼ N(x,1) as given in (3) allows us to provide these results directly.

This study was not designed to allow us to discover the direction of the causal arrow; for example we do not know whether patients changed more because they took Tamoxifen or whether changeable patients were more likely to take Tamoxifen.

2039

(13)

Figure 8. The posterior density of the regression weight on the covariate ‘Tamoxifen’. Virtually all of the mass is above zero.

Table V. The posterior means of coefficients for the covariates.

S.E. of Posterior

Variable Coefficient coefficient prob

Clinic center 1 −0.02 0.24 0.47

Clinic center 2 0.00 0.11 0.50

Age −0.03 0.01 0.00

Tamoxifen 0.31 0.08 0.00

Months since diagnosis −0.03 0.02 0.04

White −0.21 0.22 0.16

Working 0.16 0.10 0.04

Hispanic 0.19 0.13 0.06

Income 0.05 0.03 0.10

Married −0.07 0.10 0.23

4.1. Using the posterior distributions of the covariate coefficients

The posterior distribution of a parameter can be used to construct a histogram-based estimate for the unknown parameter.

In Figure 8, we show the posterior distribution of the coefficient associated with the covariate ‘Took Tamoxifen’.

The regression-based analysis shown in Table V strongly suggests that this is a significant predictor of changeability.

Looking at the posterior distribution confirms this conclusion by showing us that virtually all the mass of the posterior lies above the value of zero. That is, the p-value computed from the posterior distribution of ‘Took Tamoxifen’ is greater than 0 and confirms Hypothesis 5.

Examining the posterior distributions of the coefficients of three other covariates that were shown to be significant in exploratory regressions (age, months since diagnosis, white) shows a similar result.

Next, we look at the posterior distributions for two covariates (Hispanic and Married) that were not significant in an exploratory regression. Figure 9 contains the posterior for the coefficient associated with the covariate ‘Married’. We see immediately that the value of zero lies near the middle of the distribution and hence we can conclude that marital status has little to do with a woman’s ordinal Likert responses to the PTGI survey. Again the Bayesian analysis supports the inferences we made from the exploratory posthoc regression and disconfirms Hypothesis 4.

Finally, in Figure 10, we examine the posterior distribution for the covariate ‘Hispanic’ and we see that 91 per cent of the distribution lies to the right of zero. We would be remiss if we concluded that the event of being Hispanic was unrelated to survey responses. Here the conclusions from a Bayesian analysis are at odds with those drawn from simple traditional analysis.

The posterior distributions for ‘Working’ (not shown) closely resembles that of ‘Hispanic’ and suggests including it in our inferences, whereas the posterior for ‘Income’ resembles that of ‘Married’ and can likely be excluded.

2040

(14)

Figure 9. The posterior density of the regression weight on the covariate ‘Married’; 38 per cent of the mass is above zero.

Figure 10. The posterior density of the regression weight on the covariate ‘Hispanic’; 91 per cent of the mass is above zero.

To summarize our empirical findings, we can report that:

H2 Confirmed; young women do report more posttraumatic change than older women.

H3 Partially confirmed; Hispanic women report greater posttraumatic change than other women, but African-American women do not.

H4 Disconfirmed; married women do not report greater posttraumatic change than other women.

H5 Confirmed; Women taking Tamoxifen report greater posttraumatic change than other women.

5. What was the effect of local dependence?

The model we fit allowed dependence within testlets by incorporating the testlet effects,i d( j ), described in Section 2.1 but does not require the kind of all-or-nothing decision that usually follows the examination of eigenvalues that is the hallmark of principal components or factor analysis. Instead our model characterizes the size of the excess local dependence that would accompany a multiple factor structure. We can then examine the distribution of the parameter(s) that represents local dependence and see how large it is. In addition we can look at the posterior distribution of the parameters that represent the individual respondents and see how different it would be if we ignored the testlet structure.

This is precisely what we did.

2041

(15)

Figure 11. The posterior densities of the testlet parameter for all five testlets.

Figure 12. The posterior densities of the person parameter (changeability) for person 1, shown for both models—the usual IRT model that assumes local independence, and the testlet model that allows within-testlet dependence.

If these same data were to be fit with the analogous IRT (not TRT) model that assumes local independence it is sensible to ask how different would the results have been as an empirical demonstration of our approach. The answer, for point estimates of the parameters, is ‘not very different’. This is reassuring, as IRT is often used when its assumptions are clearly being violated. The place where unmodeled local dependence affects results is in the estimates of the parameter uncertainties. If we assume independence when it is not true we do not have as much information as we might have thought.

In Figures 11 and 12 we show the posterior distributions of the testlet variance parameters. If the variance of is zero there is no local dependence. The extent to which it is greater than zero is a measure of the local dependence. The way that the variance of is estimated ensures that it must always be positive but to be meaningful it must be substantially greater than zero. As we can see, whereas it is greater than zero for all five testlets, it is only substantially greater for testlet IV ( p<0.001). Thus we confirm Hypothesis 1 that the greatest amount of local dependence is found in the Spirituality testlet–testlet IV. Note that the scale of the variance of is the same as the scale of the variance of which equals one so that the effect of is between one-tenth and half of the effect of individual differences in changeability.

Obviously there is local dependence within testlets. But how much will this dependence affect our inferences if we neglect to model it? To illustrate this let us consider the posterior distributions of , the latent changeability trait, for a typical respondent for two models that are identical except that one assumes local independence (IRT) and one does not (TRT). As is evident (in Figure 11), they are both centered in the same place, but the TRT model is more platykurtic.

2042

(16)

The IRT model tells us that we have measured with greater precision than is, in fact, the case. For this survey the variance of the posterior is underestimated by about 50 per cent when averaged over all respondents. This result firmly supports one of the conclusions of [10] and the size of the effect is on a par with those found in the prior research by [33].

6. Discussion

This study describes in some detail how the results from a survey instrument can be understood using the most modern of contemporary statistical machinery. In a single coherent analysis we have incorporated:

(i) A method to estimate each person’s location on the latent dimension that underlies the 21 items on the survey;

(ii) The testlet structure of the survey instrument;

(iii) The covariates that were gathered to illuminate some of the reasons why the different respondents answered the way they did.

We found, unsurprisingly, that breast cancer survivors report that since their diagnosis they have become more focused on events and people temporally close at hand and fewer report following new paths in life or new interests. We cannot say that the diagnosis is a cause of this response without a proper control group, (and we have no access to data that would allow us to compute estimates of change prior to diagnosis), but that would seem to be a plausible working hypothesis. We also cannot tell which direction the causal arrow is pointing after uncovering the relationship between

‘taking Tamoxifen’ and reporting a change in attitudes. Such a causal link could be examined with suitable longitudinal data and a measurement model such as the one we described here. Our hope is that analyses such as this extend the toolbox that survey researchers consider when they would like to answer questions on survey design, factor structure, and the ‘whys’ simultaneously.

Furthermore, an area of important and on-going research is how to determine the correlates of prominent testlet effects.

If this effect was understood better, it could be controlled a priori during test development. SCORIGHT is designed to understand the relationship between testlet effects and its covariates by allowing for the testlet-variances to be predicted from covariates such as the number of items in a testlet. We believe that investigation of testlet effects is of substantive importance for any scientific endeavor.

Acknowledgements

This work was supported by the National Board of Medical Examiners, the Wharton Interactive Media Initiative. The research of Xiaohui Wang was partially supported by the National Science Foundation grant DMS0631639 and the National Security Agent grant GG11184. We are delighted to take this opportunity to express our gratitude. The data were supplied by the National Cancer Institute.

References

1. Spearman C. ‘General intelligence’ objectively determined and measured. American Journal of Psychology 1904; 15:201--293.

2. Wainer H, Kiely G. Item clusters and computerized adaptive testing: the case for testlets. Journal of Educational Measurement 1987;

24:189--205.

3. Load FM. Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum Association: Hillsdale, NJ, 1980.

4. Rasch G. Probabilistic Models for some Intelligence and Attainment Tests. Denmarks Paedagogiske Institute: Coenhagen, 1960 (Republished in 1980 by the University of Chicago Press of Chicago).

5. Thissen D, Wainer H. Test Scoring. Lawrence Erlbaum Associates: Hillsdale, NJ, 2001.

6. Wainer H, Thissen D. How is reliability related to the quality of test scores? What is the effect of local dependence on reliability.

Educational Measurement: Issues and Practice 1996; 15(1):22--29.

7. McTiernan A, Rajan KB, Tworoger SS, Irwin M, Bernstein L, Baumgartner R, Gilliland F, Stanczyk FZ, Yasui Y, Ballard-Barbash R.

Adiposity and sex hormones in postmenopausal breast cancer survivors. Journal of Clinical Oncology 2003; 21(10):1961--1966.

8. Irwin ML, McTiernan A, Bernstein L, Gilliland FD, Baumgartner R, Baumgartner K, Ballard-Barbash R. Physical activity levels among breast cancer survivor. Medicine and Science in Sports and Exercise 2004; 36(9):1484--1491.

9. Bowen DJ, Alfano CM, McGregor BA, Kuniyuki A, Bernstein L, Meeske K, Baumgartner KB, Fetherolf J, Reeve BB, Smith AW, Ganz PA, McTiernan A, Barbash RB. Possible socioeconomic and ethnic disparities in quality of life in a cohort of breast cancer survivors. Breast Cancer Research and Treatment 2007; 106(1):85--95.

10. Tedeschi RG, Calhoun LG. The Posttraumatic growth inventory: measuring the positive legacy of trauma. Journal of Traumatic Stress 1996; 9(3):455--472.

11. Bradlow ET, Fitzsimons GJ. Subscale distance and item clustering effects in self-administered surveys: a new metric. Journal of Marketing Research 2001; XXXVIII:254--261.

12. Jemal A, Siegel R, Ward E, Murray T, Xu J, Thun MJ. Cancer statistics. A Cancer Journal for Clinicians 2007; 57:43--66.

13. Arndt V, Stegmaier C, Ziegler H, Brenner H. A population-based study of the impact of specific symptoms on quality of life in women with brest cancer 1 year after diagnosis. Cancer 2006; 107:2496--2503.

2043

(17)

14. Hartl K, Janni W, Kastner R, Sommer H, Strobl BR, Stauber M. Impact of medical and demographic factors on long-term quality of life and body image of breast cancer patients. Annals of Oncology 2003; 14:1064--1071.

15. Ganz PA, Desmond KA, Leedham B, Rowland JH, Meyerowitz BE, Belin TR. Quality of life in long-term, disease-free survivors of breast cancer: a follow-up study. Journal of the National Cancer Institute 2002; 94:39--49.

16. Ho SMY, Chan CLW, Ho RTH. Posttraumatic growth in Chinese cancer survivors. Psycho-Oncology 2003; 13:377--389.

17. Sheikh AI, Marotta SA. A cross-validation study of the posttraumatic growth inventory. Measurement and Evaluation in Counseling and Development 2005; 38:66--78.

18. Li Y, Bolt DM, Fu J. A comparison of alternative models for testlets. Applied Psychological Measurement 2006; 30:3--21.

19. de la Torre J, Patz RJ. Making the most of what we have: a practical application of multidimensional IRT in test scoring. Journal of Educational and Behavioral Statistics 2005; 30:295--311.

20. Bellizzi KM. Expressions of generativity and posttraumatic growth in adult cancer survivors. International Journal of Aging and Human Development 2004; 58:247--267.

21. Cordova MJ, Cunningham LLC, Carlson CR, Andrykowski MA. Posttraumatic growth following breast cancer: a controlled comparison study. Health Psychology 2001; 20(3):176--185.

22. Bower JE, Meyerowitz BE, Desmond KA, Bernaards CA, Rowland JH, Ganz PA. Perceptions of positive meaning and vulnerability following breast cancer: predictors and outcomes among long-term breast cancer survivors. Annals of Behavioral Medicine 2005; 29(3):236--245.

23. Frazier P, Tashiro T, Berman M, Steger M, Long J. Correlates of levels and patterns of positive life changes following sexual assault.

Journal of Consulting, Clinical Psychology 2004; 72(1):19--30.

24. Tomich PL, Helgeson VS. Is finding something good in the bad always good? Benefit finding among women with breast cancer. Health Psychology 2004; 23(1):16--23.

25. Bellizzi KM, Blank TO. Predicting posttraumatic growth in breast cancer survivors. Health Psychology 2006; 25(1):47--56.

26. Polatinsky S, Esprey Y. An assessment of gender differences in the perception of benefit resulting from the loss of a child. Journal of Traumatic Stress 2000; 13:709--718.

27. Adams, Raymond J, Wilson, Mark, Wu, Margaret. Multilevel Item Response models: an approach to errors in variables regression. Journal of Educational and Behavioral Statistics 1997; 22:47--76.

28. Fox, Jean-Paul, Glas, Cees AW. Bayesian estimation of a multilevel IRT model using GIBSS sampling. Psychometrika 2001; 66:271--288.

29. Bradlow ET, Wainer H, Wang X. A Bayesian random effects model for testlets. Psychometrika 1999; 64:153--168.

30. Samejima F. Homogeneous case of the continuous response level. Psychometrika 1973; 38:203--219.

31. Wainer H, Bradlow ET, Wang X. Testlet Response Theory. Cambridge University Press: New York, 2007.

32. Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 1993;

88:669--679.

33. Wang X, Bradlow ET, Wainer H. A general Bayesian model for testlets: theory and applications. Applied Psychological Measurement 2002; 26(1):1090--1128.

34. Ip EH. Testing for local dependency in dichotomous and polytomous item response model. Psychometrika 2001; 66:109--132.

35. Zhang J, Stout W. The theoretical detect index of dimensionality and its application to approximate simple structure. Psychometrika 1999;

64:213--249.

36. Samejima F. Estimation of latent ability using a response pattern of graded scores. Psychometrika Monographs (Whole No. 17), 1969.

37. Patz RJ, Junker BW. A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics 1999; 24:146--178.

38. Gelfand AE, Hills SE, Racine-Poon A, Smith AFM. Illustration of Bayesian inference in normal data models using Gibbs sampling.

Journal of the American Statistical Association 1990; 85:972--985.

39. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman & Hall: London, 1995.

40. Sinharay S. Assessing fit of unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement 2005; 42:375--394.

41. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science 1992; 7:457--511.

42. Brooks S, Gelman A. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 1998; 7(4):434--455.

43. Gelman A, Meng X, Stern H. Posterior predictive model assessment via realized discrepancies. Statistica Sinica (with Discussion) 1996;

6:733--807.

44. Albert JH, Chib S. Bayesian residual analysis for binary regression models. Biometrika 1995; 78:637--644.

45. Spiegelhalter DJ, Best NG, Carlin BP, van der Linde. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B 2002; 64(4):583--639.

46. Wainer H, Schacht S. Gapping. Psychometrika 1978; 43:203--212.

Using Testlet Response Theory to analyze data from a survey of attitude change among breast cancer survivors