The HIV Study - Microarray Experiments - 多重假設檢定問題下t統計量的行為

2.2 Microarray Experiments

2.2.2 The HIV Study

The human immunodeficiency virus (HIV) study, described by van’t Wout et al. (2003), contained 8 samples, 4 from HIV-positive patients and 4 from HIV-negative controls.

Each samples measured a microarray of expression levels for the same m = 7680 genes.

Then, we have a m × n matrix X = (x_ij) for the HIV study, where m = 7680 rows denote genes and n = 8 columns denote microarrays. Each row of X (i.e., gene) yielded

a two-sample t-statistic t_icomparing HIV-positive patients with HIV-negative controls, which was then transformed to a zi.

z_i = Φ⁻¹(G₀(t_i)), i = 1, 2, ..., m,

where Φ is the standard normal c.d.f., and G₀ is the c.d.f. of a standard Student’s t distribution with 6 degrees of freedom. Hence, we get m = 7680 test statistic z_i’s and the distribution of the z_i’s are displayed in Figure 1(b) (Efron, 2004, 2005, 2006, 2007;

Gottardo et al., 2006).

The data from the breast cancer study and the HIV study were two-color cDNA microarrays and people make quality assessment and preprocessing (e.g. normaliza-tion) for the data before using them in multiple hypothesis testing (Dudoit et al., 2003;

Gottardo et al., 2006; Gentleman et al., 2005).

Efron (2007) described that we usually presuppose most of the genes to be null in microarray experiments, the goal being to identify some significant nonnull genes.

Therefore, we expect z_i to have closely a standard normal distribution for null genes (Efron, 2007). In other words, under null hypothesis, z_i should have a standard normal distribution if gene i has the same expression distribution for BRCA1 and BRCA2 pa-tients or for HIV-positive papa-tients and HIV-negative controls. Efron (2007) reported that heavy curves indicate N(0,1) theoretical null densities and light curves indicate empirical null densities fit to central z-values in Figure 1, as done by Efron (2004).

However, the histograms of z-values in Figure 1, where the distribution of the z_i’s from breast cancer is wider than N(0,1) and from HIV study is narrower than N(0,1) (Efron, 2006, 2007). Efron (2007) pointed out that the correlations in multiple hypothesis testing can make the observed all z_i’s behave as N(0, σ²), where σ is obviously dif-ferent than 1. Next section, we will discuss the correlation and other reasons for this phenomenon.

Figure 1: Histograms of z-Values From Two Microarray Experiments. (a) Breast cancer study, 3226 genes. (b) HIV study, 7680 genes. (This figure and descriptions are quoted from Efron (2007)).

3 The Empirical Distribution of the z

’s

In this section, we discuss the possible reasons which caused the distribution of the z_i’s that obviously differs from the N(0,1) in microarray experiments. First, Efron (2007) indicated that there were some gene correlations in the breast cancer data and in the HIV data. Besides, the disease is caused by abnormal genes and there are essential correlations between genes in biology. Hence we may say that there are gene correlation structures in the breast cancer data and the HIV data.

Secondly, Hedenfalk et al. (2001) pointed out that these patients with primary breast cancer and who had a family history of breast or ovarian cancer or both were asked to provide a blood sample for BRCA1 and BRCA2 mutations in the genetic breast cancer. If some of the patients are come from the same family, some of their gene may correlate. Hence the patients may correlate with the relationship of relatives.

Furthermore, Efron (2004) indicated that the first four and the last four microarrays in the BRCA2 patients were mutually correlated. Moreover, since the HIV is a rare disease, the HIV patients usually have the same features, for example, the patients are homosexuality, drug addicts and infected with mother. According to the above, we may safely say that there are the correlation structures among patients (i.e. microarrays ).

Finally, if the data (xij) are independent and identically distributed (i.i.d.) random variables from normal distribution, we may apply the two-sample t-statistic in multiple hypothesis testing. In other words, if the data (x_ij) are independent and identically distributed (i.i.d.) random variables from other distributions, the two-sample t-statistic may not have the t-distribution.

Hence, as mentioned above, we may consider the three possible reasons under the following items : (1) correlation between genes. (2) correlation among microarrays. (3) various distribution assumptions. In the next section, we discuss further the models of these possible reasons. Besides, we apply these models for simulating data and then compare the results of the simulation.

4 The Models and Simulation Study

For generating dependent data, we consider two kinds of time series models: the au-toregressive model (AR) and the moving average model (MA). We introduce the AR model and the MA model.

Definition 1 An autoregressive model of order p, abbreviated AR(p), is defined to be

X_t= φ₁X_t−1+ φ₂X_t−2+ ... + φ_pX_t−p+ Z_t,

where X_t is stationary, φ₁, φ₂, ..., φ_p (φ_p 6= 0) are constants, and Z_t is a Gaussian white noise series with mean 0 and variance σ² (Chan, 2001; Shumway, and Stoffer, 2005).

Definition 2 A moving average model of order q, abbreviated MA(q), is defined to be

X_t= Z_t+ θ₁Z_t−1+ θ₂Z_t−2+ ... + θ_qZ_t−q,

where there are q lags in the moving average, θ₁, θ₂, ..., θ_q (θ_q 6= 0) are constants, and Z_t is a Gaussian white noise series with mean 0 and variance σ² (Chan, 2001; Shumway, and Stoffer, 2005).

Suppose a microarray experiment includes n (n = n₁ + n₂) patients, n₁ from group 1 and n₂ from group 2. Each patient measures a microarray of expression levels for the same m genes. We want to identify those genes that are differentially expressed under the two group. Let X = (x_ij) represent gene expression and be a m × n matrix, where i = 1, ..., m denotes genes and j = 1, ..., n (n = n1+ n2) denotes microarrays.

In the simulation study, we choose m = 100000 genes and n = 14 (n₁ = n₂ = 7) micrarrays. Then we apply the data on the multiple testing procedures. Therefore, we get m = 100000 z_i’s. In Figure 2∼10, we plot the empirical distribution of the z_i’s of the model 1∼12 by dash lines and plot the distribution of N(0,1) by solid lines. Specific

characteristics of the data are described below.

4.1 Models of correlation between genes

In the following models, we consider that there is some correlation between genes, but there is no dependence between microarrays.

在文檔中多重假設檢定問題下t統計量的行為 (頁 11-16)