Conclusion and discussion - 使用晶片參考資料庫及簡易學生T檢定之基因表現晶片預測分析

Due to our study result we demonstrate that expressed genes are those genes with multiple-mode in distribution and simple t test also can be applicable in classification or build a classifier. Using simple t test to build up a classifier is easier than other classifiers and do not need to fit some complicated data selection rules. In future, we think that we can continue to investigate that why some arrays always been classified incorrectly by all classifier even by PAM. For example we can try to provide criteria for well separated genes and not well separated genes among 5005 2-or-more-mode genes and the difference between two kinds of genes show in Figure 5.1.

References

ACBB(applied computational biology and bioinformatics). Simpleaffy:easy analysis routines for Affymetrix data

http://bioinformatics.picr.man.ac.uk/research/software/simpleaffy/index.html

ACBB(applied computational biology and bioinformatics). Using Simpleaffy for Affymetrix QC

http://bioinformatics.picr.man.ac.uk/research/software/simpleaffy/qcstats.html

Affymetrix GeneChip 中文快速攻略本

http://ipmb.sinica.edu.tw/affy/document/user_guide_c.pdf

Bolstad B. (2008). Some FAQ about computing the RMA expression measure http://bmbolstad.com/misc/ComputeRMAFAQ/ComputeRMAFAQ.html

Bittner M, et al. (2000). “Molecular classification of cutaneous malignant melanoma by gene expression profiling.” Nature 406, 536-540.

Bolstad BM, Irizarry RA, Astrand M, and Speed TP. (2003). “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data”. Bioinformatics 19(2):185-193

Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S. (2005). “Bioinformatics and Computational Biology Solutions Using R and Bioconductor.” Springer. Chapters 12, 13 and 24

Harbron C, Chang KM, and South MC. (2007). “RefPlus: an R package extending the RMA Algorithm”. Vol. 23 no. 18 2007, pages 2493-2494, doi:

10.1093/bioinformatics/btm357.

Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B and Speed TP. (2003). “Summaries of Affymetrix GeneChip probe level data”. Nucleic Acids Research 31(4):e15

Irizarry RA, Hobbs B, Collin F, et al. (2003). “Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data.” Biostatistics .Vol. 4, Number 2:

249-264

Ivliev AE. (2008). “Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data.” Nucleic Acids Research, 2008

Katz S, Irizarry RA, Lin X, Tripputi M and Porter MW. (2006). “A summarization approach for Affymetrix GeneChip data using a reference traing set from a large, biologically diverse database”. BMC Bioinformatics 2006, 7:464

PAM: Prediction Analysis for Microarrays. Class Prediction and Survival Analysis for Genomic Expression Data Mining.

http://www-stat.stanford.edu/~tibs/PAM/

Prediction Analysis for Microarrays, for the R package.

http://www-stat.stanford.edu/%7Etibs/PAM/Rdist/index.html

Pawitan Y, et al. (2005). “Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts.” Breast Cancer Research 2005

Tibshirani R, Hastie T, Narashiman B and Chu G. (2002). "Diagnosis of multiple cancer types by shrunken centroids of gene expression". PNAS 2002 99:6567-6572 (May 14).

Wilson C, Pepper SD, Miller CJ. (2009). “QC and Affymetrix data”

http://bioinformatics.picr.man.ac.uk/downloads/QCandSimpleaffy.pdf

Zilliox MJ and Irizarry RA. (2007). “A gene expression bar code for microarray data”.

Nature Methods, 2007

Figure 2.1. The design of probes for Microarray HGU133A chip

Figure 2.2. The distribution of expression intensity from different genes.

Figure 2.3. The idea of cut-off point choosing.

Figure 3.1.The distributions of those delete arrays over all 2445 arrays in four quality assessment metrics. The red dots represent all deleted data. The block dots are the data still kept after all the quality assessment steps.

Figure 3.2.Summary of different combinations of “n” and “adjust” when fitting smoothing density function using R function density(n,adjust), different color lines represent different smooth curves with various “adjust” values.

Histogram of ref.j

ref.j

Frequency

4 6 8 10 12 14

050010001500200025003000

Figure 4.1. The histogram of the mean expression value of all genes for reference set.

Histogram of SD(ref.j)

SD(ref.j)

Frequency

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0500100015002000

Figure 4.2. The histogram of the sample standard deviation of expression value for all genes in reference set.

Histogram of |d.j-ref.j|

|d.j-ref.j|

Frequency

0 1 2 3 4 5

050010001500200025003000

Figure 4.3. The histogram of absolute different value of mean expression between disease group in testing set and reference set.

Histogram of |nd.j-ref.j|

|nd.j-ref.j|

Frequency

0 1 2 3 4 5 6

050010001500200025003000

Figure 4.4. The histogram of absolute different value of mean expression between non-disease group in testing set and reference set.

Figure 4.5. Summary of classification results, where a point means we classify successfully once and green points and black points were the results from doing simple t-test by 5005 multiple-mode genes and by all 22283 genes respectively.

Figure 5.1 The difference of well separated genes and not well separated genes.

Table 3.1.Summary of QC step

GEO AE Total

Before QC 1886 559 2445

Scale factor -359 -112 -471

Averages background -56 0 -56

3＇/5＇ ratios -302 -98 -400

Percent present calls -4 -13 -17

After QC 1165 336 1501

Remove same type 943 336 1279

Table 3.2.The distribution of the number of arrays in each tissue type number of arrays in one

tissue type

1 2~5 6~10 11~20 21~30 31~50 51~70 total

number of tissue types 8 14 9 16 15 8 4 74

Table 3.3.The number of arrays in each tissue type

Tissue n Tissue n Tissue n Tissue n

beta cell islets 1 Theca cell 4 umbilical cord blood 13 brain 29

medulla oblongata 1 Normal_Ovary 5 thymus 14

unknow tissue

substantia nigra 15 skeletal muscle 33 Normal Colon

1 adipose tissue 8 Undifferentiated human ES cells

lines 17 duodenal tissue 40 Normal Thalamus

1 prostate 8 Human optic nerve head astrocytes 18

line) 8 hypothalamus 22

peripheral

9 Bronchial Epithelium 23 lateralis muscle 48

Normal Heart

PBSC CD34

selected cells 10 T cells resting 23

Human

macrophages 11 cerebellum 24 bone marrow 56 spinal cord 2 Normal Bladder 11 Normal Kidney 25 lung 63 salivary gland 2 testis 11 uterus 25 whole blood 67

Pituitary 2 tonsil 11 esophageal epithelium 26 Normal Amygdala

3 synovial

membrane 11 Frontal Cortex 26

intestinal

xenograft tissue 3 B-cells 12

blood (cell type :

12 placental basal plate 27

occipital lobe 4

peripheral blood

CD8 T cells 12 blood CD4 T cells 27

Table 4.1The results of the leave-one-out cross validation, using various classification rules The number of

corrected classified arrays

Metho d 1

Metho d 2

Metho d 3

Metho d 4

Metho d 5

Metho d 6

Metho d 7 (PAM)

Total

Use all probes 74 80 80 80 70 80 88 108

(%) 68.52 74.07 74.07 74.07 64.82 74.07 81.84 100 Use 5005

probes

70 77 77 79 69 78 87 108

(%) 64.82 71.30 71.30 73.15 63.89 72.22 80.56 100

在文檔中使用晶片參考資料庫及簡易學生T檢定之基因表現晶片預測分析 (頁 32-0)