Due to our study result we demonstrate that expressed genes are those genes with multiple-mode in distribution and simple t test also can be applicable in classification or build a classifier. Using simple t test to build up a classifier is easier than other classifiers and do not need to fit some complicated data selection rules. In future, we think that we can continue to investigate that why some arrays always been classified incorrectly by all classifier even by PAM. For example we can try to provide criteria for well separated genes and not well separated genes among 5005 2-or-more-mode genes and the difference between two kinds of genes show in Figure 5.1.
References
ACBB(applied computational biology and bioinformatics). Simpleaffy:easy analysis routines for Affymetrix data
http://bioinformatics.picr.man.ac.uk/research/software/simpleaffy/index.html
ACBB(applied computational biology and bioinformatics). Using Simpleaffy for Affymetrix QC
http://bioinformatics.picr.man.ac.uk/research/software/simpleaffy/qcstats.html
Affymetrix GeneChip 中文快速攻略本
http://ipmb.sinica.edu.tw/affy/document/user_guide_c.pdf
Bolstad B. (2008). Some FAQ about computing the RMA expression measure http://bmbolstad.com/misc/ComputeRMAFAQ/ComputeRMAFAQ.html
Bittner M, et al. (2000). “Molecular classification of cutaneous malignant melanoma by gene expression profiling.” Nature 406, 536-540.
Bolstad BM, Irizarry RA, Astrand M, and Speed TP. (2003). “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data”. Bioinformatics 19(2):185-193
Gentleman R, Carey VJ, Huber W, Irizarry RA, Dudoit S. (2005). “Bioinformatics and Computational Biology Solutions Using R and Bioconductor.” Springer. Chapters 12, 13 and 24
Harbron C, Chang KM, and South MC. (2007). “RefPlus: an R package extending the RMA Algorithm”. Vol. 23 no. 18 2007, pages 2493-2494, doi:
10.1093/bioinformatics/btm357.
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B and Speed TP. (2003). “Summaries of Affymetrix GeneChip probe level data”. Nucleic Acids Research 31(4):e15
Irizarry RA, Hobbs B, Collin F, et al. (2003). “Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data.” Biostatistics .Vol. 4, Number 2:
249-264
Ivliev AE. (2008). “Microarray retriever: a web-based tool for searching and large scale retrieval of public microarray data.” Nucleic Acids Research, 2008
Katz S, Irizarry RA, Lin X, Tripputi M and Porter MW. (2006). “A summarization approach for Affymetrix GeneChip data using a reference traing set from a large, biologically diverse database”. BMC Bioinformatics 2006, 7:464
PAM: Prediction Analysis for Microarrays. Class Prediction and Survival Analysis for Genomic Expression Data Mining.
http://www-stat.stanford.edu/~tibs/PAM/
Prediction Analysis for Microarrays, for the R package.
http://www-stat.stanford.edu/%7Etibs/PAM/Rdist/index.html
Pawitan Y, et al. (2005). “Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts.” Breast Cancer Research 2005
Tibshirani R, Hastie T, Narashiman B and Chu G. (2002). "Diagnosis of multiple cancer types by shrunken centroids of gene expression". PNAS 2002 99:6567-6572 (May 14).
Wilson C, Pepper SD, Miller CJ. (2009). “QC and Affymetrix data”
http://bioinformatics.picr.man.ac.uk/downloads/QCandSimpleaffy.pdf
Zilliox MJ and Irizarry RA. (2007). “A gene expression bar code for microarray data”.
Nature Methods, 2007
Figure 2.1. The design of probes for Microarray HGU133A chip
Figure 2.2. The distribution of expression intensity from different genes.
Figure 2.3. The idea of cut-off point choosing.
Figure 3.1.The distributions of those delete arrays over all 2445 arrays in four quality assessment metrics. The red dots represent all deleted data. The block dots are the data still kept after all the quality assessment steps.
Figure 3.2.Summary of different combinations of “n” and “adjust” when fitting smoothing density function using R function density(n,adjust), different color lines represent different smooth curves with various “adjust” values.
Histogram of ref.j
ref.j
Frequency
4 6 8 10 12 14
050010001500200025003000
Figure 4.1. The histogram of the mean expression value of all genes for reference set.
Histogram of SD(ref.j)
SD(ref.j)
Frequency
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0500100015002000
Figure 4.2. The histogram of the sample standard deviation of expression value for all genes in reference set.
Histogram of |d.j-ref.j|
|d.j-ref.j|
Frequency
0 1 2 3 4 5
050010001500200025003000
Figure 4.3. The histogram of absolute different value of mean expression between disease group in testing set and reference set.
Histogram of |nd.j-ref.j|
|nd.j-ref.j|
Frequency
0 1 2 3 4 5 6
050010001500200025003000
Figure 4.4. The histogram of absolute different value of mean expression between non-disease group in testing set and reference set.
Figure 4.5. Summary of classification results, where a point means we classify successfully once and green points and black points were the results from doing simple t-test by 5005 multiple-mode genes and by all 22283 genes respectively.
Figure 5.1 The difference of well separated genes and not well separated genes.
Table 3.1.Summary of QC step
GEO AE Total
Before QC 1886 559 2445
Scale factor -359 -112 -471
Averages background -56 0 -56
3'/5' ratios -302 -98 -400
Percent present calls -4 -13 -17
After QC 1165 336 1501
Remove same type 943 336 1279
Table 3.2.The distribution of the number of arrays in each tissue type number of arrays in one
tissue type
1 2~5 6~10 11~20 21~30 31~50 51~70 total
number of tissue types 8 14 9 16 15 8 4 74
Table 3.3.The number of arrays in each tissue type
Tissue n Tissue n Tissue n Tissue n
beta cell islets 1 Theca cell 4 umbilical cord blood 13 brain 29
medulla oblongata 1 Normal_Ovary 5 thymus 14
unknow tissue
substantia nigra 15 skeletal muscle 33 Normal Colon
1 adipose tissue 8 Undifferentiated human ES cells
lines 17 duodenal tissue 40 Normal Thalamus
1 prostate 8 Human optic nerve head astrocytes 18
line) 8 hypothalamus 22
peripheral
9 Bronchial Epithelium 23 lateralis muscle 48
Normal Heart
2
PBSC CD34
selected cells 10 T cells resting 23
Human
macrophages 11 cerebellum 24 bone marrow 56 spinal cord 2 Normal Bladder 11 Normal Kidney 25 lung 63 salivary gland 2 testis 11 uterus 25 whole blood 67
Pituitary 2 tonsil 11 esophageal epithelium 26 Normal Amygdala
3 synovial
membrane 11 Frontal Cortex 26
intestinal
xenograft tissue 3 B-cells 12
blood (cell type :
12 placental basal plate 27
occipital lobe 4
peripheral blood
CD8 T cells 12 blood CD4 T cells 27
Table 4.1The results of the leave-one-out cross validation, using various classification rules The number of
corrected classified arrays
Metho d 1
Metho d 2
Metho d 3
Metho d 4
Metho d 5
Metho d 6
Metho d 7 (PAM)
Total
Use all probes 74 80 80 80 70 80 88 108
(%) 68.52 74.07 74.07 74.07 64.82 74.07 81.84 100 Use 5005
probes
70 77 77 79 69 78 87 108
(%) 64.82 71.30 71.30 73.15 63.89 72.22 80.56 100