Chapter 3.Material and Methods
3.5 Statistical analysis
In this study, we use descriptive, inferential statistics, equivalence test, crossover model analysis, and Rasch analysis to assess difference between touch-screen and paper versions of the EORTC QLQ-C30 and the EORTC QLQ-PR25 only. For the urinary symptom domain, item “Has wearing an incontinence aid been a problem for you?”, patients answered this question only if when he wore an incontinence aid.
3.5.1 Sample size estimation
Sample size estimation was based on the hypothesis of no clinical difference between the domain scores of two administration modes (paper and touch-screen) under a crossover design study. The Minimum clinically important difference (MID) of the domain score for the EORTC QLQ-C30 was set to be 5 points, and the standard error of domain score was set to be 8 based on empirical data. In order to detect equivalence difference of 5 with
80% power for a 5% size, a sample size 80 was obtained by using the statistical software PASS.
3.5.2 Descriptive and inferential statistics
Descriptive statistics, equivalence test, crossover regression model analysis, and Rasch analysis were used to assess the equivalence of measure properties of two different modes, touch-screen and paper versions of the EORTC QLQ-C30 and the EORTC QLQ-PR25.
We assessed differences of demographic characteristics between two crossover groups using Chi-square for categorical data and independent t-test for continuous group. To assess feasibility of using the touch-screen versus paper administration modes of the HRQL questionnaires, time to completion was shown as mean and standard deviation. Patients’ acceptance and patients’
preference to the touch screen version were shown as count and percentage.
Results stratified by age (<= 70 years and > 70 years) and computer experience (yes and no) were demonstrated in the same way. Global agreement was defined as agreement within 1 response category in either direction137, 147.
3.5.3 Equivalence test of two modes – a minimum clinically important difference approach
According to scoring manual of the EORTC QLQ-C30 and the EORTC QLQ-PR25, items and scale scores of the EORTC QLQ-C30 and the QLQ-PR25 were linearly transformed to a 0–100 scale, with higher scores reflecting either more symptoms (e.g., urinary, bowel, hormonal treatment-related symptoms) or higher levels of functioning (e.g., sexual).
Based on the suggestions from the previous research, for the EORTC QLQ-C30, the range of changes about 5 to 10 denoted as “a little” change,
“moderate” change had changed about 10 to 20, and “very much” change
corresponded to a change greater than 2031, 148. Therefore, in our measurement equivalence test of two modes, we defined a minimum clinically important difference (MID) to be 5; and we used the symbol representing this five point score.
Equivalence test method was applied to test the equivalence test of two modes. The equivalence hypotheses are
Where represented five point score. Rejecting null hypothesis indicates the two modes is equivalent4, 149-152.
3.5.4 Mode effect assessment – a cross-over regression analysis
The crossover regression model recommended by Pocock was used to assess whether the measurement properties of two modes would be no difference. We first used the model with mode-effect, order-effect as well as their interaction. The interaction term is accounted for the carry-over effect if it exists; in addition, the gender and age effects were also put in the model for adjustment. After testing the mode-order interaction, we refit the model without interaction term, if the carry-over effect is not significantly shown.The mode effect was then assessed by using the t-test for regression coefficient, which accounted for the mode effect in the model 32, 153. In this analysis, all items and scale scores were linearly transformed to a 0–100 scale, with higher scores reflecting either more symptoms (e.g., urinary, bowel, hormonal treatment-related symptoms) or higher levels of functioning (e.g., sexual).
3.5.5 Equivalence test of two modes – a summated response
difference approach
Except that we derived the equivalence test of two modes by using a minimum clinically important difference (MID) approach, which was based on a linearly transform domain scores. To express our analysis more clearly and complete, we also exploited the equivalence properties base on the item level. The proportion of agreement for each item between two assessment modes was presented, and two kinds of agreement terms were defined. Exact agreement was defined as exact agreement between two modes. Global agreement was defined as agreement within 1 response category in either direction137, 147.
We also develop the other equivalence test approach. First, we calculate the possible difference score for each item, 0 indicative no difference between two modes, for example, if there is 4 responses for one item, the range of difference score for this item will be 0, 1, 2, 3. Second, we compute the possible total difference scores for each domain, for example, if one domain including 5 items with 4 responses for each, then range of the total difference score for this item will be from 0 to 15. Third, a 15% of the total difference score (denoted as ) for each domain is computed, for example, in the previous example, the value will be 2.25 (=15*0.15). We then use this value
as the maximum different range that allowed for equivalence to derive our test.
Based on above, the Equivalence hypotheses are
°¯
Where represented 15% of the total difference score for each domain.
Rejecting null hypothesis indicates the two modes is equivalent4, 149-152.
3.5.6 Intraclass correlation coefficient – reliability measurement
Lachin (2004) has demonstrated that a coefficient of variation does notmeasure reliability. The best measure of reliability for continuous data is the intraclass correlation coefficient (ICC)154. We had 99 subjects and measured 2 replicates from each subject. The correlation between two replicates from the same subject is referred to as the intraclass correlation coefficient, denoted by ȰI. Mixed model was used to estimate theȰI. The model was as followed.
)
Mixed model, which allowed including fixed effect factor and random effect factor as the independent variables, was used, where then it can be shown thatȰI=ȱA
2/(ȱA
2+ȱ2); i.e., ȰI is the ratio of the between-person variance divided by the sum of the between-person and the within-person variance.
The intraclass correlation ranges from 1.0 to 1.0. It is large and postive when there is little variation within the pairs but the means between the pairs differ. It is large and negative when the variation within a pair is much greater than that between the pairs. The present research will use the classification scheme as follows: Poor: 0–0.39, Fair: 0.40–0.59, Good: 0.60–0.79, Excellent 0.80–1.0. This scheme is a combination of the classification categories as used by Bartko (1976)155and Stokdijk (2000)156.
3.5.7 Differential item functioning analysis from Rasch model
We use a rating scale model, one of the Rasch series model to deal with the polytomous response data, to assess the equivalence of two modes. The differential item functioning (DIF) analysis approach was applied to achieve our purpose. DIF refers to an item lacking measurement equivalence in different groups or settings34. In this study, sets of item difficulties were compared between methods (paper-and-pencil vs. touch-screen) to detect DIF.A criterion of 0.5 logits between item difficulties in different groups was applied to determine whether an item exhibited DIF35-36.
All analyses were performed with the use of SAS 9.2 software and SPSS version 15.0. All Rasch analyses were performed using WINSTEPS software ver. 3.68157. A two-sided p-value of less than 0.05 was considered to indicate statistical significance