• 沒有找到結果。

分析公開微陣列資料庫中癌症相關基因的基因表現

N/A
N/A
Protected

Academic year: 2021

Share "分析公開微陣列資料庫中癌症相關基因的基因表現"

Copied!
82
0
0

加載中.... (立即查看全文)

全文

(1)臺北醫學大學醫學資訊研究所 碩士論文. Gene Expression Analysis for Cancer-Related Genes Using Public Microarray Databases 分析公開微陣列資料庫中癌症相關基因的基因表現. 指導教授:CHIU, HUNG-WEN 邱泓文. 研究生: CHEN, LILLIAN YU-HSUAN 陳宇瑄 撰. 中華民國九十五年一月 January, 2006.

(2) Acknowledgements Two and half years passed without even noticed.. Although there were some. setbacks and frustrations along the way that almost made me to give up my study, I am glad that I did try to hang in there, or else I would definitely regret in the future.. When I looked back the days I spent to completing my master degree, I cannot ignore the greatest support from my beloved family members, including Daddy, Mommy, Brother Cliff, Aunties and Grandma.. Without your warm care and. continuous encouragement, I would not make to this point at all.. I cannot fully express my gratitude to the advisor, Dr. Hung-Wen Chiu, for his superb guidance and generosity throughout the past 2.5 years.. Without Dr.. Chiu’s generous assistance, I would not complete this master thesis at all.. My appreciation also to the dearest friends and colleagues, including Kiss, YiFen, Cool8, Ted, amebajoe, FHHUNG and ncrain from GIMI; Debby, Mike, Paul, Elisa and Jack from UBC; David, Sylvia, Grace, Kevin, Dr. Shau and Dr. Luke Lin from GSK.. Lastly, very special thanks to Linus, who believes in me. and is there for me all the time to share ideas and offer constant support.. Written on 27 Jan 2006 At Graduate Institute of Medical Informatics, Taipei Medical University. ii.

(3) Table of Contents __________________________________________________________________________ Page Title Page.................................................................................................................................... i Signature and Approval Page .................................................................................................... ii 上網授權書..............................................................................................................................iii 國科會授權書.......................................................................................................................... iv Acknowledgements .................................................................................................................. vi Table of Contents ....................................................................................................................vii List of Figures ........................................................................................................................ ix List of Tables .......................................................................................................................... xi 論文摘要.................................................................................................................................xii Abstract .................................................................................................................................. xiv I. Introduction ................................................................................................................... - 1 1.1 Background ....................................................................................................... - 1 1.2 Motivation ......................................................................................................... - 3 1.3 Objective ........................................................................................................... - 5 II. Literature Review.......................................................................................................... - 6 2.1 Cancer................................................................................................................ - 6 2.2 OMIM................................................................................................................ - 8 2.3 Cancer Genome Anatomy Project ..................................................................... - 9 2.4 Microarray........................................................................................................- 11 2.4.1 cDNA Microarray Technology............................................................ - 12 2.4.2 Oligonucleotide Microarray Technology ............................................ - 14 2.5 Microarray Databases...................................................................................... - 16 2.5.1 Stanford Microarray Database............................................................. - 16 2.5.2 Gene Expression Omnibus .................................................................. - 17 2.6 Biological Pathways........................................................................................ - 18 2.7 Gene Ontology ................................................................................................ - 22 III Materials and Methods ................................................................................................ - 23 3.1 Data Source ..................................................................................................... - 23 3.2 Classification of Cancer-Related Genes from OMIM Database ..................... - 23 3.3 Microarray Data Extraction from GEO Database ........................................... - 31 3.4 Microarray Data Extraction from SMD Database........................................... - 36 -. iii.

(4) 3.5. Datasets Analysis............................................................................................. - 41 3.5.1 Downloaded Datasets Processing........................................................ - 41 3.5.2 Search for Cancer-Related Gene Involvement in Biological Pathway - 44 IV. Result........................................................................................................................... - 45 4.1 Cancer-Related Genes from OMIM ................................................................ - 45 4.2 Microarray Gene Expression........................................................................... - 50 4.2.1 Cancer-Related Genes from GEO Database ....................................... - 50 4.2.2 Cancer-Related Genes from SMD Database ....................................... - 52 4.3 Expression Patterns for Cancer-Related Genes............................................... - 54 4.3.1 GEO-Downloaded Expressions for OMIM Cancer-Related Genes.... - 54 4.3.2 SMD-Downloaded Expressions for OMIM Cancer-Related Genes ... - 56 4.4 Gene Ontology for the Cancer-Related Genes ................................................ - 58 V. Discussion and Conclusion ......................................................................................... - 61 5.1 Biological Pathway for OMIM and Microarray Cancer-Related Genes......... - 61 5.2 Comparison between OMIM and Microarray Databases................................ - 63 5.3 Perspective ...................................................................................................... - 64 5.4 Conclusion....................................................................................................... - 65 VI. References ................................................................................................................... - 66 Appendix ............................................................................................................................. - 70 Appendix I – Twenty-three cancer-related genes and their associated pathways from the intersection of OMIM cancer-related gene lists for the colon, liver, pancreas and stomach tissues .......................................................................................................................... - 70 Appendix II – Nineteen cancer-related genes and their associated pathways from the intersection of OMIM cancer-related gene lists for the breast, prostate and cervical tissues. ......................................................................................................................... - 71 -. iv.

(5) List of Figures Figure 1: GeneFinder search result for TP53 gene.............................................................. - 10 Figure 2: Visualization of TP53 gene expression in NCI60_Stanford. ............................... - 10 Figure 3: Overview Process of Making cDNA Microarray Chips ...................................... - 13 Figure 4: Oligonucleotide microarray technology. ............................................................. - 14 Figure 5: Oligonucleotide Probe Pair Design. .................................................................... - 15 Figure 6: An overview of the metabolic pathways.............................................................. - 19 Figure 7: DNA sequences are transcription into RNA sequences in nucleus...................... - 20 Figure 8: Biosignaling transduction .................................................................................... - 21 Figure 9: Flowchart for extracting cancer-related gene lists from the OMIM database. .... - 30 Figure 10: Browse all the available datasets on GEO......................................................... - 31 Figure 11: Sort out all GEO datasets according to organisms............................................. - 31 Figure 12: GEO dataset download and information page ................................................... - 32 Figure 13: Process flowchart for extracting the datasets from GEO database.................... - 35 Figure 14: Query publications related to breast cancer using the search engine ................ - 36 Figure 15: Using SMD-incorporated data analysis and retrieval system to extract required information ........................................................................................................ - 37 Figure 16: Snapshot of the downloaded dataset format. ..................................................... - 39 Figure 17: Process flowchart to extract datasets from SMD database................................ - 40 Figure 18: Dataset format for the analysis .......................................................................... - 41 Figure 19: Interface of the analytical tool. .......................................................................... - 42 Figure 20: Microarray expression datasets analysis result.................................................. - 43 Figure 21: Intersection of OMIM cancer-related gene lists for the colon, liver, pancreas and stomach tissues.................................................................................................. - 47 Figure 22: Intersection of OMIM cancer-related gene lists for the breast, prostate and cervical tissues. ............................................................................................................... - 48 Figure 23: Intersection of GEO cancer-related gene lists for the breast, liver, lung, prostate and pancreatic tissues. ....................................................................................... - 51 Figure 24: Intersection of SMD cancer-related gene lists for the breast, liver, lung, prostate, pancreatic and stomach tissues.......................................................................... - 53 Figure 25: Gene expression levels from GEO database for OMIM cancer-related genes. . - 55 Figure 26: Gene expression levels from SMD database for OMIM cancer-related genes.. - 57 Figure 27: Gene ontology for OMIM cancer-related gene in terms of molecular function.- 58 -. v.

(6) Figure 28: Gene ontology for GEO cancer-related genes in terms of molecular function . - 59 Figure 29: Gene ontology for SMD cancer-related genes in terms of molecular function . - 59 Figure 30: Schematic presentation of the Wnt signaling pathway. ..................................... - 62 -. vi.

(7) List of Tables Table 1: Explanation of the name, content and search tips in different search field........... - 24 Table 2: Summary of the Synonyms for Ten Different Cancer Types Defined by ICD-O . - 27 Table 3: Summary of the Synonyms for Ten Different Cancer Types Defined by MeSH .. - 28 Table 4: Summary of cancer-related literatures used for analysis from the GEO database - 33 Table 6: Summary of cancer-related literatures used for the analysis from SMD database - 38 Table 7: Number of cancer-related genes in relation to each specific cancer type ............. - 45 Table 8: Summary of OMIM defined cancer-related genes in relation to the biological pathways............................................................................................................... - 46 Table 9: Combined summary for the intersection of gene lists from liver, stomach, colon and pancreas plus the gene lists intersection from breast, cervical and prostate in relation to the biological pathways.................................................................................... - 49 Table 10: Gene ontology for cancer-related genes from OMIM, GEO and SMD in the molecular function category................................................................................. - 60 -. vii.

(8) 論文摘要 論文名稱:分析公開微陣列資料庫中癌症相關基因的基因表現 臺北醫學大學醫學資訊研究所 研究生姓名: 陳宇瑄 畢業時間: 94 學年度 第 1 學期 指導教授:. 邱泓文 臺北醫學大學醫學資訊研究所. 副教授. 背景: 癌症是現今世界共通的疾病,發展有效治療癌症的藥物是當下醫藥學界主 要致力的目標。微陣列技術的應用已成為近幾年生醫研究的主流,許多研 究利用微陣列技術來觀測人類癌症細胞株或器官組織的基因表現,藉此有 效率得找出癌症相關基因。 材料和方法: 我們從 OMIM 資料庫中做資料探勘癌症相關的基因,整理得到一個癌症相 關基因名單。另一方面從 GEO 和 SMD 資料庫中蒐集人類癌症相關的微陣 列數據資料,並根據不同種類型癌症做分類,再利用統計原理篩選出可能 的基因名單。分析這些基因名單的生化反應路徑,和比較不同資料庫來源. viii.

(9) 所得到結果的差異。 結果與討論: 我們將由 OMIM 所得到十種不同癌症的基因名單做交集,得到三個共通的 癌症相關基因,分別為 APC、CDKN2A 和 PTEN。針對乳房、前列腺、肝、 肺、胰臟、胃等六種器官的癌症組織相關基因名單做交集,觀察到 APC、 CDKN2、APTEN、TP53 和 BRAF 是共通的癌症相關基因。另外,利用微 陣列資料庫中所得到的癌症相關基因名單做交集,寡核柑酸微陣列的資料 中我們得到九個基因,分別是 FOXM1、HNRPDL、BIN1、BUB3、CCNI、 PMS1、PRKCBP1、PURA 和 RPA3。在 cDNA 微陣列部分,得到六個共通 的癌症相關基因,分別為 ARGBP2、CD53、FCGBP、JUN、MME 和 VBP1。 在這些我們所觀察到的癌症共通基因都在 Wnt signaling pathway.中扮演重 要的角色。 結論: 將 OMIM 十種主要的癌症基因名單作交集,我們得到 3 個共通的癌症相關 基因。另外再交集六種癌症基因名單得到五個共通的癌症相關基因,並比 較這些基因在微陣列中的表現是否有所差異。結論是在 OMIM 中找到的癌 症相關基因與微陣列實驗數據的結果不一定會相符合。. 關鍵字:癌症、癌症相關基因、微陣列、OMIM、GEO、SMD ix.

(10) Abstract Title of Thesis:Gene Expression Analysis for Cancer-Related Genes Using Public Microarray Databases Author:Lillian Yu-Hsuan Chen Thesis advised by : Hung-Wen Chiu Taipei Medical University, Graduate Institute of Medical Informatics. Introduction: As cancer has drawn much of the attention worldwide these days, development of effective drugs is definitely the focus in today’s medical research field. Since microarray technologies have become a biological research trend over the last few years, using the microarray data to monitor gene expression in human cell lines and tissues is certainly the most efficient way to identify cancer-related genes. Materials and Methods: The cancer-related gene lists were obtained by reviewing literatures on the OMIM database. The microarray expression datasets were downloaded from the GEO and SMD websites. After having collected the cancer-related genes and microarray expression data, we would classify them according to each datum’s specific cancer-causing nature. Results and Discussion: When having intersected ten OMIM cancer-related gene lists, APC, CDKN2A. x.

(11) and PTEN were resulted as the three common cancer-related genes; when having performed the intersection of breast, prostate, liver, lung, pancreatic and stomach tissues, APC, CDKN2A, PTEN TP53 and BRAF were obtained. Based on microarray gene expressions, intersections of cancer-related genes among oligonucleotide arrays have found nine common genes, which are the FOXM1, HNRPDL, BIN1, BUB3, CCNI, PMS1, PRKCBP1, PURA and RPA3. Intersections of cancer-related genes among cDNA arrays have got six common genes, which are the ARGBP2, CD53, FCGBP, JUN, MME and VBP1.. Many. of those defined cancer-related genes were found to play important roles in Wnt signaling pathway. Conclusion: Three OMIM cancer-related genes across ten cancer types were defined while five OMIM cancer-related genes were obtained as a result of intersection of six cancer types.. OMIM mentioned cancer-related genes are not necessarily. supported by microarray gene expression patterns.. Keywords: Cancer, Cancer-Related Genes, Microarray, OMIM, GEO, SMD. xi.

(12) I.. Introduction. 1.1 Background. Nowadays cancer has become one of the deadly diseases affecting people’s life worldwide. Based on the statistical reports published by World Health Organization (WHO), cancers in lung, colorectal and stomach are the three major cancer types that affected lives in both sexes for many years globally.. Lung and stomach cancers are the life-threatening factors in male. populations as to that of breast and cervical cancers in female populations.. WHO has. estimated that approximately over ten million people will be diagnosed with cancer on an annual base. By year 2020, however, approximately fifteen million of new cancer patients will be reported annually (Yang et al., 2002).. According to the published data for top ten leading causes of death by Taiwanese Department of Health for year 2004, cancer is once again ranked the top leading cause of death for a consecutive of twenty-two years.. Among various cancer types, cancers in lung (19.67%),. liver (19.42%), colorectal (19.73%), breast (3.68%, which is only calculated based on female population) and stomach (6.88%) have ranked the top five deadly diseases in Taiwan.. In. addition to the humiliate statistics, on average approximately every fifteen minute would a Taiwanese lose his/her life due to cancer (DOH Website, 2004).. As cancer has drawn great attention and focus in the medical field due to its life-threatening nature, many researchers around the world have dedicated their time to look for a better cancer treatment.. Until today, however, there have been no perfect medications being -1-.

(13) developed yet.. Effective treatments on cancer patients basically rely on a better. understanding of the tumour genes in relation to the specific cancer type.. Scientists have. used high-throughput methods to define the relationship between cancer and genes.. Gel. electrophoresis, microarray technology and serial analysis of gene expression (SAGE) are the most popular ones known today.. Among them all, microarray technology is most. well-known for its ability to determine the cancer gene expressions in related to the cancer types (Liotta et al., 2000; Nelson et al., 2000; SEER’s Training Website, 2005).. -2-.

(14) 1.2. Motivation. The current cancer research trend favours the idea that genetic mutation has driven the initial formation of malignant tumours.. It is generally believed that cancer begins at the cellular. level, in which the disease actually initiates in a single cell that will eventually pass its acquired abnormality onto its progeny (Lu et al., 2003). Based on Aranda-Anzaldo’s view, those initiated cells must contain a few “caner-causing genes” in their DNA. It is very possible that those caner-causing genes may have remained in latent stages for a long time, and are waiting to be triggered by any cancer-promoting agents.. Even if caner-causing. genes are not activated at all, or do not transform into lethal cancerous cells, they still possess certain degrees of dangerous factors that might affect people’s life (Aranda-Anzaldo et al., 2001).. Within the OMIM database, many literatures have been reviewed and categorized into different groups based on their research topics and contents by scientists at John Hopkins University. Articles that include information on genes that have caused cancers can easily be sorted out by having limited the search result to “cancer”, “carcinoma” and “tumor”. Moreover, a list of cancer-related genes can be resulted from reading through these articles (OMIM, 2000).. On the other hand, microarray technologies have become a biological. research trend over the last few years for monitoring gene expression in human cell lines and tissues.. Previous understanding of gene expression levels in different cancer types by. microarray hybridization have provided an idea that this is indeed a useful and eventually will be an essential method to identify possible biomarkers as well as drug targets.. Nowadays, most microarray gene expressions are used by worldwide researchers to -3-.

(15) categorize different cancer types. Moreover, literatures found in OMIM database do reveal that different cancer types have possessed different microarray gene expressions. Based on those two understandings, we would like to find out whether there will be one or more genes that are related to various types of cancers at once by using OMIM literatures as our evident cancer-gene finders and combining with microarray gene expressions to confirm our thoughts.. -4-.

(16) 1.3. Objective. Our goal is to focus on the cancer-related genes mentioned in the literatures.. We would like. to identify the gene-tissue relationship as well as how those genes are expressed in normal and cancer tissues.. Based on the cancer-related gene list obtained from OMIM database, we. would match those genes with the gene expressions from the microarray datasets. At the same time, we would also determine cancer-related gene lists for microarray expression datasets.. Following the collection of all the relevant data, including cancer-related genes. from both OMIM database and publicly accessible microarray databases, and microarray expression datasets, we would further analyze and look for any relationships of those cancer-related gene lists with the biological pathways via KEGG database.. When the. determination of the relationship between cancer-related genes and pathways completed, we would like to see if there is one or more genes that are located in the upper stream of the pathway.. By this means, medical researchers can develop both prevention and more. effective cancer treatments that are specifically targeted on those genes.. -5-.

(17) II.. 2.1. Literature Review. Cancer. In every healthy human body, ten million cells will undergo normal cell division every minute. When a cell has undergone mutations in its deoxyribonucleic acid, or DNA, the genetic material which carries the hereditary codes for human body, it will become a cancerous cell which will reproduce without restraint.. In other words, cancerous cells not only divide faster. than that of normal cells, but also grow indefinitely and immaturely (Affymetrix et al., 2001; Lu et al., 2003).. As time passes, a single cancerous cell eventually grows into a microscopic collection of cells and ultimately begins to invade surrounding tissues. course.. Each cancer has its own distinctive. For example, in leukemia, the abnormal cells disperse throughout the body via. blood streams and bone marrow.. Most of the other cancer types, however, a mass of cancer. cells called tumours grow freely in their rate. Some tumours may double their size in a month while others may require two months or even more than a year to double (Nelson et al., 2000; SEER’s Training Website, 2005).. Tumours can be categorized into two types: benign and malignant tumours.. Benign tumours. remain localized to the tissue where they arise; they may grow large but will not spread to other parts of the body. If they are diagnosed in earlier stages, they can be cured by surgical removals or by radiation therapy. On the other hand, malignant, or cancerous tumours are a more serious matter.. Some of their cells might break off, invading and destroying -6-.

(18) surrounding tissue or traveling through the blood or lymph streams to distant parts of the body, where new tumours might form.. From these new tumours, malignant cells could break off. again and establish even more colonies, in which the invasive process is known as metastasis. For example, breast cancer and lung cancer have possessed many different characteristics. However, when metastatic breast cancer in the lungs is observed, the lung cancer characteristics are not easily observed under a microscope. The cancer in lung acts just like a cancer originated in the breast. Thus, it is worth noted that it is important to understand that cancer originating in one body organ takes its characteristics with it even if it spreads to another part of the body (Liotta et al., 2000; Nelson et al., 2000; SEER’s Training Website, 2005).. -7-.

(19) 2.2. OMIM. OMIM, short for Online Mendelian Inheritance in Man, is a constant updated catalog of human genes and inherited, genetic disorders authored and edited by Dr. Victor A. McKusick and his coworkers at John Hopkins University. The database, provided by the National Center for Biotechnology Information, can be publicly accessible through the World Wide Web at: http://www.ncbi.nlm.nih.gov/omim/.. The OMIM contains not only a variety of. textual information and references, but also links to records in the Entrez system and relevant resources at MEDLINE plus the NCBI databases. As to 25 Jun 2005, OMIM has included a total of 16,115 entries and 9,288 loci entries for the synopsis of the Human Gene Map (OMIM, 2000).. Each OMIM entry is assigned to a unique six-digit number in which the first number indicates the inheritance mode of the gene involved.. For example, 100000- and 200000- both refer to. autosomal loci or phenotypes created before 15 May 1994.. Numbering of 300000- means. the x-linked loci or phenotypes while 400000- means the y-linked loci or phenotypes. Mitochondrial loci or phenotypes are given the numbering of 500000- while autosomal loci or phenotypes created after 15 May 1994 are numbered starting with 600000-. The allelic variant is named after its parent entry, followed by a decimal point and a distinct four-digit variant number.. For example, beta-globin locus (HBB) is numbered 141900 as the sickle. hemoglobin is numbered 141900.0243.. -8-.

(20) 2.3. Cancer Genome Anatomy Project. The Cancer Genome Anatomy Project (http://cgap.nci.nih.gov/), also known as CGAP, is a huge task sponsored by the U.S. National Institutes of Health. CGAP is aimed not only to determine, catalog and annotate genes that are expressed during the cancer developmental process, but also to eventually improve detection, diagnosis and treatment for the cancer patients.. With the cooperative work from researchers worldwide, CGAP wants to both. increase the scientific expertise and enlarge its databases so that all cancer researchers can be benefited from it (CGAP Website, 2005).. CGAP has incorporated various searching tools, including tools to find genes, cDNA libraries, single nucleotide polymorphisms (SNPs), and tools to examine gene expressions and chromosomes, in order to meet each researcher’s need.. For example, if we would like to. check gene expression profile for TP53, we can first go to the “GeneFinder” function and key in “TP53” as my search item.. The search results have shown to have three choices for which. gene expressions can be viewed visually (Figure 1).. The expression data has displayed in. different array formats: “NCI60_Novartis” is the gene expression data of NCI 60 cell lines on oligonucleotide array; “NCI60_Stanford” is the gene expression data of NCI 60 cell lines on spotted arrays; “SAGE Summary” data is a 2-dimentional display of a common tissue and histology, such as brain cancer vs. brain normal, lung cancer vs. lung normal.. Having. clicked on any of the three choices will give the gene expression data for TP53 in different array format.. Figure 2 has shown the visualization of TP53 gene expression in. “NCI60_Stanford”.. The colouring scales have indicated that higher expression levels will be. in red colour while lower expression level will be in blue instead.. -9-.

(21) Figure 1: GeneFinder search result for TP53 gene. The red box has indicated the three visualization selections for gene expression of TP53. (CGAP Website, 2005). Figure 2: Visualization of TP53 gene expression in NCI60_Stanford. (CGAP Website, 2005). - 10 -.

(22) 2.4 Microarray. Transcription of DNA into RNA and the subsequent translation of messenger RNA into protein are the basic mechanisms by which cells mediate their growth, function and metabolism.. After the human genome has been sequenced and annotated successfully a few. years ago, the next step in functional genomics is to analyze the transcriptome, which can be defined as a complete collection of transcribed elements of the genome.. In addition to. messenger RNAs (mRNAs), the transcriptome can also represent non-coding regions of RNAs whose main functions are of structural and regulatory purposes. Alterations in the structure or expressions levels of any one of these RNAs or their proteins eventually will contribute to disease occurrences (Nelson et al., 2000).. The use of microarray technologies. to monitor gene expressions in organisms, cell lines, and human tissues has become very important in today’s biological research field (Schadt et al., 2000).. The most well-known. technologies developed to examine gene expressions of thousands of genes are the cDNA microarrays and oligonucleotide arrays.. These two techniques are most famous for their. ability to compare and contrast expression levels across various tissue types (Gibson et al., 2002).. There are a few major differences between cDNA and oligonucleotide microarrays.. One difference is that cDNA microarrays only provide gene expression data in relative values as to that of absolute data values provided by oligonucleotide arrays. Another variation would be the difference in the design of the array. since cDNA microarrays uses. PCR-amplified cDNA fragments (ESTs) extracted from a sequenced cDNA library compared to oligonucleotide microarrays uses a series of 25-mer oligonucleotides to represent known or predicted open reading frames (Gibson et al., 2002; Lipshutz et al., 1999; Wilson et al., 2003).. - 11 -.

(23) 2.4.1. cDNA Microarray Technology. cDNA microarrays is designed to monitor relative gene expression levels of thousands of genes in cells simultaneously.. In a typical cDNA microarray chips, PCR-amplified cDNA. fragments, also known as expression-sequence tags (ESTs), are spotted at high density, usually at 10-50 spots per mm2, onto a glass microarray slide (Gibson et al., 2002).. The two. different mRNA samples derived independently will be transcribed into reverse-cDNA and labeled using two different fluorescents, which usually are a red fluorescent dye Cy5 and a green fluorescent dye Cy3.. The labeled cDNA populations will then hybridized. simultaneously to the glass microarray slide (Yang et al., 2002).. Red and green laser beams. will scan the microarray slides separately, and the signal intensity values observed from the two scans are calculated for individual cDNA spots by having the intensity levels of the experimental samples (Cy5) divided by the intensity levels of the reference sample (Cy3). As a result, each derived gene expression level is a relative ratio for the cDNA spot in the sample (Figure 3).. The relative ratio obtained from cDNA microarrays has possessed a central idea that it is the change in relative level of expression that is of biological interesting. Genes with greater expression level do not mean that they have higher fluorescence intensities than genes with lower expression levels.. The reason is that the fluorescence intensity is dependent on the. length of the EST, the amount of label incorporated into the cDNA during the reverse transcription process, the preparation of DNA concentration for the particular clone and the efficiency of hybridization (GEO Website, 2005).. - 12 -.

(24) Figure 3: Overview Process of Making cDNA Microarray Chips (www.fao.org).. According to Claverie, the meaningful change in gene expression can be determined by the twofold induction or repression of experimental samples relative to the reference sample. This rule, however, does not meet the standard statistical definitions of significance. As a result, genes in cDNA microarrays will be classified as “differentially expressed” only if they have shown at least a 2-fold change in expression (Claverie et al., 1999).. GeneA ≥2 GeneB. or. GeneA 1 ≤ 2 GeneB. - 13 -.

(25) 2.4.2. Oligonucleotide Microarray Technology. High-density oligonucleotide arrays are built, or synthesized in situ on a silicon chip by Affymetrix. array.. Each gene is uniquely represented by 10 to 20 different nucleotides on a probe. Probe synthesis takes place in a parallel fashion, in which an A, T, C, or G nucleotide. will be added to multiple growing chains simultaneously. After having undergone through a series of photolithographic and combinatorial chemical process, each probe will reach its particular length of 25 nucleotides (Lipshutz et al., 1999; Schadt et al., 2000) (Figure 4).. Figure 4: Oligonucleotide microarray technology. (Lipshutz et al., 1999).. - 14 -.

(26) In order to prevent the possibility of having cross-hybridization with similar short sequences in transcripts rather than the one being probed, a partner probe is designed to be perfectly complementary to the target probe except that a single base in its centre will be purposely mutated, resulting in a mismatch probe (MM). As shown in Figure 5, each Mismatch (MM) probe, also known as partner probe, will be paired with a complementary Perfect Match (PM) probe, also known as reference probe, and these two probe pairs allow the quantization and subtraction of intensity signals caused by non-specific cross-hybridization (Gibson et al., 2002; Lipshutz et al., 1999; Schadt et al., 2000). In oligonucleotide arrays, the expression level of each gene is calculated based on the average of the differences between PM and MM, which means the derived value of each gene expression level is an absolute amount in oligonucleotide arrays rather than that of relative ratio in cDNA arrays (Schadt et al., 2000).. Figure 5: Oligonucleotide Probe Pair Design. Oligonucleotide probes are chosen based on composition design rules, whereas proves for eukaryotic organisms are chosen particularly from the 3’ end. The use of the PM–MM differences averaged across probe sets has reduced cross-hybridization problems and increased the quantitative accuracy (Lipshutz et al., 1999).. - 15 -.

(27) 2.5. Microarray Databases. Nowadays, a number of microarray databases are available for public access.. Each public. microarray database has its unique features and data sources, and two major ones in which most of the microarray data have been incorporated into would be introduced here.. 2.5.1. Stanford Microarray Database. The Stanford Microarray Database (http://genome-www.stanford.edu/microarray), also known as SMD, is a research tool designed for scientific people to study biomedical problems using multiple microarray platforms.. Today, SMD supports the research of more than 1,000 users. in over 260 laboratories worldwide.. Those users can input data generated from more than. 50,000 microarrays used to study the biology of thirty-four organisms, including but not limited to Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster and Escherichia coli. In addition, over a hundred of papers have already published and referred to data in SMD while complete raw data of more than 7,000 microarrays have become freely accessible via the SMD website.. In other word, SMD has offered users to upload or store. raw and/or normalized data for the microarray experiments.. Moreover, SMD also provides. functions such as data retrieval, data analysis and visualization interfaces for viewing gene expression patterns (SMD Database Website, 2004).. - 16 -.

(28) 2.5.2. Gene Expression Omnibus. The Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo) is designed to serve the scientific community a place to share, browse, query and retrieval the high-throughput gene expression / molecular abundance data repository.. The datasets have included single. and multiple channel microarray-based experiments measuring mRNA, genomic DNA and protein molecules.. Serial Analysis of Gene Expression (SAGE) datasets are also accepted. by the GEO even though SAGE is not an array-based high-throughput functional genomics and proteomics technology.. As to July 2005, GEO has archived 43,010 publicly released. samples, including but not limited to organisms of Homo sapiens, Mus musculus, Drosophila melanogaster and Rattus norvegicus across 1,446 publicly released platforms, such as the in situ oligonucleotides, spotted oligonucleotides and DNA/cDNA, etc. (GEO Website, 2005).. - 17 -.

(29) 2.6. Biological Pathways. Biological pathways are defined as having over thousands of enzyme-catalyzed chemical reactions in cells that are functionally organized into many different sequences of consecutive reactions (Nelson et al., 2000). In other words, the product of one reaction would become the reactant in the next.. Every biological reaction has its own unique feature and distinct role that have all worked together to maintain cellular functions in all living organisms.. For example: the main. functions of catabolic reaction pathways are to degrade organic nutrients into simple products to produce chemical energy and eventually convert this energy for cell use. On the other hand, anabolic reaction pathways would start with small molecules and convert them to relatively larger and more complex molecules, such as proteins (Nelson et al., 2000).. The. combination of catabolic and anabolic reaction pathways has formed the major metabolic pathways process in all living organisms (Figure 6).. - 18 -.

(30) Figure 6: An overview of the metabolic pathways (www.genome.ad.jp/kegg/ pathway/map/map01100.html). - 19 -.

(31) Beside metabolic pathways, regulatory pathways also have significant influences in living lives.. The cellular utilization of genetic information is one of the major regulatory pathways. known today. The process starts with DNA replication, the copying of double-helix DNA to form daughter DNA molecules with identical nucleotide sequences, followed by transcription, the process where the DNA will be copied into RNA, and ended with translation, whereas the genetic message encoded in messenger RNA is translated into protein (Figure 7).. Figure 7: DNA sequences are transcription into RNA sequences in nucleus. The RNA sequences are then moved to the cytoplasm and translated into linear protein chains. (campus.queens.edu/.../bio103/tests/TEST3Help.htm).. - 20 -.

(32) Another major category of regulatory pathways is the biosignaling process. cells to receive and act on signals is fundamental to life (Figure 8).. The ability of. If any defective. signaling proteins, which are brought along by oncogenes, keep continuing giving the signal for cell division, tumours will be formed as a consequence.. Moreover, when abnormal cell. development, growth and death occur in the cell regulatory processes, cancer is generally the result of the malfunctions of those fundamental biological processes (SEER’s Training Website, 2005).. Figure 8: Biosignaling transduction (http://webhost.bridgew.edu/fgorga/ras/signaling.htm).. - 21 -.

(33) 2.7 Gene Ontology. Gene ontology is a set of controlled vocabulary that can explain cell functions and biomedical knowledge of genes or proteins in eukaryotic organisms.. Those vocabularies will be. updated and changed accordingly as time goes. As to today, biological process, molecular function and cellular component have been developed to represent the three independent sets of vocabularies or ontologies.. Molecular function refers to the activities rather than the. entities that perform the actual actions at the molecular level.. An example of the molecular. function can be a hydrolase or enzyme inhibitor activity.. Biological process means a. biological goal achieved by one or more ordered assemblies of molecular functions, such as cell death. Cellular component describes where the gene product is located at the levels of subcellular structures and macromolecular complexes.. An example of the cellular. component can be nuclear inner member, or inner envelope (Harris et al., 2004).. - 22 -.

(34) III Materials and Methods. 3.1. Data Source. Nowadays, most of the experimental microarray data can be obtained from public websites and/or given by the authors upon request.. As for the research, we will focus my microarray. data source from the Stanford Microarray Database (SMD) and the Gene Expression Omnibus (GEO).. 3.2. Classification of Cancer-Related Genes from OMIM Database. OMIM has included a variety of articles, or records, on genetic diseases and inherited genes that have been read and briefly summarized into few sentences by scientists.. In other word,. OMIM acts as a miniature reading environment for readers to view a variety of the article sources at once.. In addition, OMIM is also a high-quality information source and considered. a key referencing database by the genetics community.. As a result, we have chosen OMIM. to be our major resource to derive the cancer-related gene list.. We have limited our search to the key word “cancer” on the OMIM search engine to extract all cancer-related gene records.. In order to narrow down the search field to only the text. portion, we would only focus on the information contained in the “title”, “text” and “clinical synopsis” (Table 1).. - 23 -.

(35) Table 1: Explanation of the name, content and search tips in different search field. (http://www.ncbi.nlm.nih.gov/Omim/omimhelp.html#SearchFields) Search Field All Fields Allelic Variant Chromosome. Description Contains all terms from all searchable database fields in the database. Describes a subset of disease-producing mutations. The chromosome onto which a gene or disorder has been mapped, as reported in the OMIM Gene or Morbid Map.. Qualifier [ALL] [AV] or [VAR] [CH]or [CHR]. Clinical. Clinical features of a disorder and the mode of inheritance (e.g.,. [CS] or. Synopsis. autosomal dominant, autosomal recessive, x-linked), if known.. [CLIN]. Contributor to an OMIM record. Names are in the format of Contributor. lastname followed by one or more initials (with no periods), e.g., Smith AB. Creation Date EC/RN Number. [CTRB]. The date on which an OMIM record was created, in the format. [CD] or. YYYY/MM/DD.. [CDAT]. Number assigned by the Enzyme Commission or Chemical Abstract Service (CAS) to designate a particular enzyme or chemical, respectively. Editor of OMIM record. Names are in the format of lastname. Editor. [AU] or. followed by one or more initials (with no periods), e.g., Smith AB. [EC] or [ECNO] [ED] or [EDTR]. Primarily used to retrieve subsets of records that contain crosslinks to other Entrez databases, and LinkOuts to external Filter. (non-Entrez) resources. There is a separate LinkOut Overview document which. [FI] or [FILT]. provides more detail about that service. [GM]or. Gene Map. Cytogenetic map location represented in the OMIM Gene Map. Gene Map. Text words appearing in the Disorder column of the OMIM. [DIS] or. Disorder. Gene Map.. [DI]. [MAP]. The official gene symbol, and alternate gene symbols, associated with a record. Currently limited to gene symbols Gene Name. present on the OMIM Gene Map. All gene symbols represented in OMIM (mapped or unmapped) can be searched in the Title Word field, described below.. - 24 -. [GN] or [GENE].

(36) For information on the numbering system, see the OMIM. [ID] or. FAQs.. [MIM]. Modification. Date on which the record was last modified, in the format. [MD] or. date. YYYY/MM/DD.. [MDAT]. Modification. All dates on which an OMIM record was updated, in the format. [MDH] or. History. YYYY/MM/DD.. [HIST]. MIM Number. An index containing various properties of OMIM records, identifying those which have attributes such as Allelic Variants, Clinical Synopsis, or Gene Map locus. Properties. The most commonly used attributes are presented as check boxes on the Limits page.. [PR] or [PROP]. To see a complete list of attributes, you can browse the index of the Properties field by use the Index option. Contains author names and title words from the articles cited in Reference. an OMIM entry.. [RE] or. Names are in the format of lastname followed by one or more. [REF]. initials (with no periods), e.g., Smith AB Contains terms from the main text-containing section of a Text Word. record, which begins under the title of a record and ends above. [TXT] or. the Allelic Variants section (if present), or above the. [WORD]. References section (if no Allelic Variants are described). Title Word. Words in title of an OMIM record. Includes words in the. [TI] or. primary title, alternative titles, and included titles.. [TITL]. In other word, if the key word “cancer” is nowhere to be found in any of the three sections, we would assume that the record does not consist of any cancer relevant information.. Next. step would be to review each gene’s OMIM record to confirm its role in different cancer types, which are defined based on the ten leading mortality rate in cancer among Taiwanese population by Department of Health for year 2004.. The ten cancer types are lung cancer,. hepatocellular carcinoma (HCC), colorectal carcinoma, female breast cancer, gastric carcinoma, oral cancer, cervical cancer, prostate cancer, esophageal cancer and pancreatic cancer.. - 25 -.

(37) Further to the key word “cancer” search in the OMIM database, we have also used the ten cancer types for individual search so that a more comprehensive cancer-related gene list would be obtained.. Since many different terms can be used to refer to one cancer type, all. the possibilities therefore have to be taken into the searching consideration.. For example,. breast cancer can be described as a breast carcinoma, mammary gland neoplasm etc.. As a. result, we have used both the synonyms for each cancer type based on the classification by the International Classification of Disease for Oncology (ICD-O) plus the synonyms, near-synonyms and closely related concepts for cancers defined by the Medical Subject Headings (MeSH).. ICD-O is used mainly for the cancer and/or tumour registries for coding. the histology and site of the neoplasms (ICD-O Website, 2005).. Table 2 has shown a. summary of the synonyms for ten cancer types defined by ICD-O while Table 3 has illustrated all related terms for the listed cancer types in MeSH.. - 26 -.

(38) Table 2: Summary of the Synonyms for Ten Different Cancer Types Defined by ICD-O Words. Synonyms by ICD-O. Cancer. cancer//carcinoma//leukaemia//leukemia//lymphoma//malignancy// melanoma//myeloma//neoplasm//tumor//tumour//. Lung. bronchiole//bronchogenic//bronchus//carina//hilus//lingula//lung//pulmonary//. Liver. liver//hepatocellular//hepatoma//. Colorectal. bowel//cecum//colon//colorectal//ileocecal//intestine//pelvirectal//rectal// rectosigmoid//rectum//sigmoid//. Female Breast areola//breast//mammary//nipple// Stomach. antrum//cardia//cardioesophageal//esophagogastric//fundus//gastric// "nos"//prepylorus//pyloric//pylorus//stomach//. Oral. alveolar//alveolus//buccal//cheek//frenulum//gingiva//"gum"//labial//linguae// molar//mouth//oral//palate//periodontal//retromolar//salivary//tongue//tonsil// tooth//uvula//. Cervical. cervical//cervix//endocervical//endocervix//exocervical//exocervix// internal os//nabothian//. Prostate. prostate//prostatic//. Esophageal. esophageal//esophagus//. Pancreatic. langerhans//pancreas//pancreatic//santorini//wirsung//. - 27 -.

(39) Table 3: Summary of the Synonyms for Ten Different Cancer Types Defined by MeSH Cancer Type. MeSH Headings. Lung Cancer. Lung Neoplasms. Lung Neoplasms//Cancer of Lung//Lung Cancer//Pulmonary Cancer//Pulmonary Neoplasms//Cancer of the Lung//Neoplasms, Lung//Neoplasms, Pulmonary//Non-Small-Cell Lung Carcinoma//Carcinoma, Non-Small Cell Lung//. Liver Cancer. Liver Neoplasms. Liver Neoplasms//Cancer of Liver//Hepatic Cancer//Liver Cancer//Cancer of the Liver//Hepatic Neoplasms//Neoplasms, Hepatic//Neoplasms, Liver//Carcinoma, Hepatocellular//Hepatocellular Carcinoma//Hepatoma//. Colorectal Cancer. Colorectal Neoplasms. Colonic Neoplasms//Cancer of Colon//Colon Cancer//Cancer of the Colon//Colon Neoplasms//Colonic Cancer//Neoplasms, Colonic//Colorectal Neoplasms, Hereditary Nonpolyposis//Hereditary Nonpolyposis Colorectal Cancer//Hereditary Nonpolyposis Colorectal Neoplasms//Lynch Syndrome//Colon Cancer, Familial Nonpolyposis//Lynch Cancer Family Syndrome I//Lynch Syndrome I//Lynch Syndrome II//. Breast Cancer. Breast Neoplasms. Breast Neoplasms//Breast Cancer//Breast Tumors//Cancer of Breast// Cancer of the Breast//Human Mammary Carcinoma //Mammary Carcinoma, Human//Mammary Neoplasm, Human//Mammary Neoplasms, Human//Neoplasms, Breast//Tumors, Breast/. Stomach Cancer. Stomach Neoplasms. Stomach Neoplasms//Cancer of Stomach//Gastric Cancer//Gastric Neoplasms//Stomach Cancer//Cancer of the Stomach//Neoplasms, Gastric//Neoplasms, Stomach//. Oral Cancer. Mouth Neoplasms. Mouth Neoplasms//Cancer of Mouth//Mouth Cancer//Oral Cancer//Oral Neoplasms//Cancer of the Mouth//Neoplasms, Mouth//Neoplasms, Oral//Oral Cavity//Cavitas Oris//Cavitas oris propria//Mouth Cavity Proper//Oral Cavity Proper//Vestibule Oris//Vestibule of the Mouth//. Cervical Cancer. Cervix Neoplasms. Cervix Neoplasms//Cancer of Cervix//Cervical Cancer//Cancer of the Cervix//Cervical Neoplasms//Cervix Cancer//Neoplasms, Cervical//Neoplasms, Cervix//Cervical Intraepithelial Neoplasia//Neoplasia, Cervical Intraepithelial//Cervical Intraepithelial Neoplasia, Grade III//Cervical Intraepithelial Neoplasms//Intraepithelial Neoplasia, Cervical//. Prostate Cancer. Prostatic Neoplasms. Prostatic Neoplasms//Cancer of Prostate//Prostate Cancer//Cancer of the Prostate//Neoplasms, Prostate//Neoplasms, Prostatic//Prostate Neoplasms//Prostatic Cancer//Prostatic Hyperplasia//Adenoma, Prostatic//Benign Prostatic Hyperplasia//Prostatic Adenoma//Prostatic Hyperplasia, Benign//Prostatic Hypertrophy//Prostatic Hypertrophy, Benign//Prostatism//. Esophageal Cancer. Esophageal Neoplasms. Esophageal Neoplasms//Cancer of Esophagus//Esophageal Cancer//Cancer of the Esophagus//Esophagus Cancer//Esophagus Neoplasm//Neoplasms, Esophageal//. Pancreatic Cancer. Pancreatic Neoplasms. Pancreatic Ductal Carcinoma//Duct-Cell Carcinoma of the Pancreas//Duct-Cell Carcinoma, Pancreas//Ductal Carcinoma of the Pancreas//Pancreatic Duct Cell Carcinoma//Pancreatic Neoplasms//Cancer of Pancreas//Pancreatic Cancer//Cancer of the Pancreas//Neoplasms, Pancreatic//Pancreas Cancer//Pancreas Neoplasms//Carcinoma, Pancreatic Ductal//. MeSH Synonyms. - 28 -.

(40) Ten gene lists for ten different cancer types would be derived as a result of the reviewing and categorizing process of each OMIM record.. We would define each individual gene within. each cancer-specific gene list as one cancer-related gene since the gene has been confirmed by the OMIM to be related to this particular cancer type.. In other word, we have had a total of. ten specific cancer-related gene lists from the OMIM database.. As ten cancer-related gene lists have been identified, we would perform a ten cancer-related gene lists interaction to look for any common genes that are present across ten cancer types. First step would be to use the Microsoft Access software to create tables individually for ten cancer types.. Each database table was created by using SQL language.. For example, the. script for creating the breast cancer table is as follows: Create table breast (genesymbol varchar(15)); After creating ten tables, the next step would be to do the cancer-gene list interactions. language is once again used to complete various cancer-gene list interactions.. SQL. For example,. the script can be seen below for the cancer-gene list interactions between the breast, cervical and prostate tissue: SELECT Breast.GeneSymbol FROM (Breast INNER JOIN Cervical ON Breast.GeneSymbol = Cervical.GeneSymbol) INNER JOIN Prostate ON Cervical.GeneSymbol = Prostate.GeneSymbol; The process flow leading to the completion of obtaining the ten cancer-related gene lists as well as the common cancer-related genes are summarized in Figure 9 below.. - 29 -.

(41) Figure 9: Flowchart for extracting cancer-related gene lists from the OMIM database.. - 30 -.

(42) 3.3. Microarray Data Extraction from GEO Database. Many researchers have deposited their precious microarray datasets onto GEO; thus, we can download the relevant cancer datasets from its website. We used the GEO incorporated function “Browse->DataSets” to list out all available datasets within GEO (Figure 10).. Figure 10: Browse all the available datasets on GEO (GEO Website, 2005).. Since we are only interested in human datasets, we would first use the function “Sort by Organisms” to have all human datasets listed together (Figure 11).. Figure 11: Sort out all GEO datasets according to organisms (GEO Website, 2005).. - 31 -.

(43) Next step would be to focus on the datasets using oligonucleotide chips and remove those that were uploaded by authors who have published their data on SMD database already.. In. addition, we were most interested in dataset record that consisted of either normal or cancer tissue samples, or both at the same time but no cell line samples (Figure 12).. Figure 12: GEO dataset download and information page (GEO Website, 2005).. As a result of the pre-filtering process, three published datasets for each breast cancer and gastric cancer, one for each prostate cancer, colorectal cancer, cervical cancer and hepatocellular carcinoma, and two for lung cancer were left for thorough reviewing. Moreover, the Su et. al’s two published datasets in which one consists of various types of human normal tissues, including lung, prostate, liver, etc., and the other includes a variety of tumour samples would also include into our reviewing process (Su et al., 2001; Su et al., 2002).. The final datasets for analysis use were the ones from the breast, prostate, liver, lung. and pancreas tissues. The details for the datasets we used for analysis are summarized in Table 4 below.. - 32 -.

(44) Table 4: Summary of cancer-related literatures used for analysis from the GEO database Cancer Type. Used for Author. Array Type. BH, et al.. Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene. 2 normal and 4 cancer state breast tissues. Yes. HG-U133A. 2 normal and 4 cancer state breast tissues. Yes. HG-U133B. 2 normal and 4 cancer state breast tissues. Yes. 2 cell line sets. No. expression measurements (G4100A). N/A. Our. HG-U95A. cDNA. N/A. Experiment Description. Analysis. Mecham Breast. Paper Title. Su et al.. Su et al.. Large-scale analysis of the human and mouse transcriptomes Molecular classification of human carcinomas by use of gene expression signature. HG-U95A. 2 breast normal, 2 prostate normal, 2 liver normal, 2 lung normal and 2 pancreatic normal. Yes. 31 prostate cancer samples, 7 liver cancer samples, HG-U95A. 28 lungcancer samples, 6 pancreatic cancer samples, 23 breast cancer samples. - 33 -. Yes.

(45) Upon obtaining the five specific cancer-related gene lists, we would again perform cancer-related gene lists intersection to look for any common genes that are present across five cancer types via Microsoft Access.. The detailed steps on how we achieved the gene. lists intersections have been mentioned in Section 3.2 previously.. The process flow from. collecting the microarray datasets from GEO database to receive the common cancer-related gene lists are summarized in Figure 13 below.. - 34 -.

(46) Figure 13: Process flowchart for extracting the datasets from GEO database.. - 35 -.

(47) 3.4. Microarray Data Extraction from SMD Database. SMD is another famous microarray database that consists mainly of cDNA microarray datasets.. Each SMD publication includes considerable amount of sample numbers which in. terms increase the reliability of the experiment.. We have used the search engine developed. by Stanford University to query all the cancer-related publications (Figure 14).. Figure 14: Query publications related to breast cancer using the search engine (SMD Database Website, 2004).. Next step was to filter out only those publications that matched to our ten cancer types and to download those microarray data from the SMD website.. In total, we have got twelve. experimental datasets for breast cancer, three for gastric cancer, one for hepatocellular carcinoma, six for prostate cancer and two for each lung cancer and pancreatic cancer. Since we want to filer out datasets containing cell line samples and include only those with both normal and cancer tissue samples at the same time, only one dataset for each of the breast - 36 -.

(48) cancer, lung cancer, gastric cancer, prostate cancer, pancreatic cancer and hepatocellular carcinoma would be analyzed (Table 6).. We then used the SMD incorporated data analysis. and retrieval system to pull out required information, such as the gene symbol, log base 2 of R/G normalized ratio of the mean, etc. (Figure 15).. Figure 15: Using SMD-incorporated data analysis and retrieval system to extract required information, such as gene symbol, log base 2 of R/G normalized ratio of the mean, etc. (SMD Database Website, 2004)... - 37 -.

(49) Table 6: Summary of cancer-related literatures used for the analysis from SMD database Cancer Type. Author. Breast. Sorlie T, et al.. Lung. Garber ME, et al.. Gastric. Chen X, et al.. Gastric. Gastric. Prostate. Array Type. Paper Title. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical cDNA implications. Diversity of gene expression in adenocarcinoma of cDNA the lung Variation in gene expression patterns in human gastric cancers. Expression profiling identifies chemokine (C-C Leung SY, et al. motif) ligand 18 as an independent prognostic indicator in gastric cancer Phospholipase A2 group IIA expression in gastric Leung SY, et al. adenocarcinoma is associated with prolonged survival and less frequent metastasis. Gene expression profiling identifies clinically Lapointe J, et al. relevant subtypes of prostate cancer. Exploration of global gene expression patterns in Iacobuzio-Dona pancreatic adenocarcinoma using cDNA Pancreatic hue CA, et al. microarrays. Liver. Chen X, et al.. Gene expression patterns in human liver cancers. - 38 -. Experiment Description. Used for Our Analysis. 4 normal breast tissues and 81 primary tumour. Yes. 6 normal tissue, 61 primary lung tumour. Yes. cDNA. 103 gastric cancer tissues and 29 non-neoplastic gastric tissues. Yes. cDNA. 23 non-tumour tissue and 103 primary tumour (Part of Chen X Data). Yes. cDNA. 23 non-tumour tissue and 103 primary tumour (Part of Chen X Data). Yes. cDNA. 62 primary prostate tumors, 41 normal prostate specimens and nine lymph node metastases. Yes. cDNA. 17 infiltrating pancreatic cancer tissues, and 5 samples of normal pancreas. Yes. cDNA. 102 primary HCC tumour tissues, 74 non-tumour liver tissues, 10 metastatic cancers, 3 adenoma tumour samples and 4 FNH tumour samples. Yes.

(50) The downloaded file is in text format containing much information about this publication’s datasets. An example of the download dataset format is shown in Figure 16.. Column A. contains the information for clone ID, column B indicates each gene’s name; column D and onwards include each experimental slide’s log base 2 of R/G normalized ratio of the mean. Before doing any further analysis, we would have to categorize each experimental slide manually into cancer and normal groups.. Figure 16: Snapshot of the downloaded dataset format.. As we had the six specific cancer-related gene lists on hand, we again performed cancer-related gene lists intersection to look for any common genes that are present across six cancer types via Microsoft Access.. The detailed steps on how we achieved the gene lists. interactions have been mentioned in Section 3.2 previously. An overall process flow is shown in Figure 17 to illustrate what we did step by step towards extracting the required data from the SMD database.. - 39 -.

(51) Figure 17: Process flowchart to extract datasets from SMD database.. - 40 -.

(52) 3.5. Datasets Analysis. 3.5.1. Downloaded Datasets Processing. The downloaded datasets would be analyzed via a designed tool to extract the cancer-related gene lists. The dataset format has to be organized into the format as shown in Figure 18 before having imported into the designed tool.. For example, column A has to be the. microarray unique ID, in which the oligonucleotide array data from Affymetrix would use the probe set ID while the cDNA data from SMD would use the clone image ID.. Column B has. to be the GenBank accession numbers then followed by a series of experimental slides’ data points.. Lastly, the final column has to insert the gene symbol.. Figure 18: Dataset format for the analysis. - 41 -.

(53) Next step would be to import the data into our tool for further analysis. The tool interface is shown below in Figure 19.. Before having submitted the data for analysis, we had to indicate. how many data columns contained normal values.. In other word, we had to insert the. normal experimental slides’ data right after Column B.. Figure 19: Interface of the analytical tool.. Throughout the analytical process, the first step would be to eliminate those genes that have more than half of the expression values are missing.. Then the tool would help us to. calculate each gene’s expression ratio between normal and abnormal tissues, the mean expression level as well as the standard deviation of the gene.. Since we would only want to. extract genes that have the ratio greater than 1.5 or less than 2/3 fold, we would use a solid circle to indicate genes that did meet the set criteria. Figure 20 has shown an example of the result format after the analysis is complete. of our analysis.. The last four columns have specified the result. One column would contain the calculated ratio of the cancer and normal. samples, followed by a column indicating the average expression levels.. One more column. would have the standard deviation values and the last column would point out which gene has fulfilled our criteria to be included into the cancer-related gene list.. - 42 -.

(54) Figure 20: Microarray expression datasets analysis result.. - 43 -.

(55) 3.5.2. Search for Cancer-Related Gene Involvement in Biological Pathway. Upon obtaining the common cancer-related genes after the gene list interactions, we would determine those genes’ functions and their locations in each biological pathway using KEGG database (http://www.genome.ad.jp/kegg-bin/mk_point_html).. KEGG is short for Kyoto. Encyclopedia of Genes and Genomes, a publicly available pathway database containing updated knowledge on molecular interaction networks, which includes metabolic pathways, regulatory pathways and molecular complexes (KEGG Website, 2005).. The KEGG has. provided a huge collection of biological pathways diagrams that can clearly view gene-to-pathway relationships.. In other words, each gene’s specific location in each. biological pathway can easily be seen on the diagram.. - 44 -.

(56) IV. Result. 4.1. Cancer-Related Genes from OMIM. Upon reviewing the OMIM queried gene list data for different cancer types, we have obtained ten individual specific cancer-related gene lists for our designated ten cancer types.. A. summary of number of genes that have associated with each cancer type is briefed in Table 7.. Table 7: Number of cancer-related genes in relation to each specific cancer type Cancer Type Breast Cervical Colon Esophageal Liver Lung Pancreas Oral Prostate Stomach Number of cancer-related. 388. 106. 287. 136. 610. 491. 193. 47. 236. 133. genes. Ten specific cancer-related gene lists then were imported into KEGG pathway database to obtain pathway lists. Cancer-related genes in relation to biological pathways have been summarized in Table 8.. Only twelve biological pathways in which cancer-related genes. present across all ten cancer types are extracted.. Moreover, although there are quite some. numbers of specific cancer-related genes mentioned in OMIM literatures, only APC, CDKN2A and PTEN genes are found to be present across ten different cancers.. Gene APC. is mainly involved in the environmental information processing – wnt signaling pathway and cellular processes regulation of actin cytoskeleton. Gene CDKN2A can be found to have a role in cellular process cell cycle.. Gene PTEN takes parts in both the environmental. information processing phosphatidylinositol signaling system and inositol phosphate metabolism.. - 45 -.

(57) Table 8: Summary of OMIM defined cancer-related genes in relation to the biological pathways. Each number represents how many genes have associated with each pathway individually. Pathway Name Environmental Information Processing MAPK signaling pathway Environmental Information Processing Wnt signaling pathway. Breast. Cervical Colon. Esophageal. Liver. Lung. Oral. Pancreas Prostate Stomach. Total. 22. 8. 16. 14. 26. 18. 3. 13. 8. 17. 145. 14. 5. 18. 4. 20. 23. 2. 10. 11. 9. 116. Cellular Processes Regulation of actin cytoskeleton. 24. 8. 15. 7. 16. 15. 3. 6. 4. 11. 109. Environmental Information Processing Cytokine-cytokine receptor interaction. 8. 2. 9. 8. 27. 25. 1. 4. 5. 10. 99. Cellular Process Cell Cycle. 7. 4. 13. 4. 10. 14. 1. 8. 7. 3. 71. Environmental Information Processing Neuroactive ligand-receptor interaction. 7. 3. 5. 2. 13. 17. 2. 6. 10. 3. 68. Cellular Processes Adherens junction. 6. 3. 12. 6. 10. 9. 2. 3. 2. 6. 59. Cellular Process Apoptosis. 4. 2. 8. 6. 12. 6. 1. 4. 3. 8. 54. Environmental Information Processing TGF-beta signaling pathway. 7. 1. 4. 3. 8. 10. 1. 4. 3. 3. 44. Cellular Processes Focal adhesion. 5. 3. 5. 2. 5. 4. 2. 4. 3. 5. 38. Environmental Information Processing Phosphatidylinositol signaling system. 11. 2. 4. 1. 5. 4. 1. 3. 2. 3. 36. Metabolism Inositol phosphate metabolism. 9. 2. 3. 1. 5. 3. 1. 3. 1. 2. 30. 124/388. 43/106. 112/287. 58/136. 68/193. 59/236. 80/133. Total Genes. - 46 -. 157/610 148/491 20/47.

(58) Furthermore, when we classified those cancer tissue types into different groups, we would get some quite intriguing results.. When having intersected the cancer-related gene lists obtained. from stomach, pancreatic, liver and colon tissues altogether, a total of twenty-three genes are considered as present in all four tissues (Figure 21; Appendix I). As having queried the associated pathways for those twenty-three genes on the KEGG system, we have found that fourteen of the twenty-three genes do not have any involvements in any of the pathways. The remaining nine genes have shown there are no common pathways present among them.. Figure 21: Intersection of OMIM cancer-related gene lists for the colon, liver, pancreas and stomach tissues. The grey coloured overlapping part in the middle represents the number of common cancer-related genes among those four tissues.. On the other hand, when putting together the cancer-related gene lists from breast, cervical and prostate tissues, nineteen genes are found present in those three tissues (Figure 22; Appendix II).. Again, when having checked each gene’s association with the pathways via. KEGG system, only eight out of the nineteen genes are considered to have a role in one or - 47 -.

(59) more pathways.. Those eight genes, still, do not share any roles in one or more common. pathways.. Figure 22: Intersection of OMIM cancer-related gene lists for the breast, prostate and cervical tissues. The grey coloured overlapping part in the middle represents the number of common cancer-related genes among those three tissues.. The combined summary for the intersection of gene lists from liver, stomach, pancreatic and colon along with the of gene lists integration from breast, cervical and prostate can be seen below in Table 9.. We have also completed one other intersection of cancer-related gene lists. for breast, cervical, prostate, liver, lung, pancreatic and stomach tissues. Genes APC, BRAF, CDKN2A, PTEN and TP53 are resulted from this intersection.. - 48 -.

(60) Table 9: Combined summary for the intersection of gene lists from liver, stomach, colon and pancreas plus the gene lists intersection from breast, cervical and prostate in relation to the biological pathways. Gene Symbol ( from Liver,. Gene Symbol (from. Stomach, Colon and. Breast, Cervical and. Pancreas Interactions). Prostate Interactions). Pathway Names Cellular Processes Adherens junction. CDH1, IQGAP1. Cellular Processes Apoptosis. TP53. Cellular Processes Axon guidance. DCC. Cellular Processes Cell cycle. CDKN2A, MADH4, TP53. CDKN2A, CHEK2, TP53. Cellular Processes Focal adhesion. BRAF, HRAS. BRAF. Cellular Processes Regulation of actin cytoskeleton Environmental Information Processing Hedgehog signaling pathway Environmental Information Processing MAPK signaling pathway. TP53. APC, BRAF, HRAS, IQGAP1 AFC, APC, BRAF IHH BRAF, HRAS, TP53. NF1, BRAF, TP53. PTEN. PTEN. Environmental information processing phosphatidylinositol signaling system and metabolism inositol phosphate metabolism Environmental Information Processing TGF-beta signaling pathway Environmental Information Processing Wnt signaling pathway Human Diseases Amyotrophic lateral sclerosis (ALS) Human Diseases Huntington's disease. MADH4 APC, FZD4, MADH4, TP53. APC. TP53. TP53. TP53. TP53. Metabolism Fatty acid biosynthesis (path 1). FASN. Metabolism Fatty acid biosynthesis (path 2). BASE, FASN. Prostaglandin and leukotriene metabolism. PTGS2. Reactome Event:Cell Cycle Checkpoints. CHEK2. 69620 Reactome Event:DNA Repair 73894. XPA , XPC. - 49 -. BRCA1, BRCA2.

(61) 4.2. 4.2.1. Microarray Gene Expression. Cancer-Related Genes from GEO Database. After thoroughly reviewing and analyzing the downloaded datasets from each publication, we have done individual analysis for each of the dataset.. A list of 1,471 breast cancer-related. genes has been obtained after the analysis of the breast carcinoma datasets. A total of 1,928 cancer-related genes for hepatocellular carcinoma have been found to express differently between normal and cancer tissues of liver tissues.. A total of 1,531 cancer-related genes for. lung carcinoma were resulted after the analysis. 2,613 cancer-related genes for pancreatic cancer have been confirmed to have quite different expression level between normal and cancer tissues while 829 cancer-related genes are resulted from the analysis of prostate cancer datasets.. As we derived the cancer-related genes from the analysis of microarray gene expression datasets for breast, liver, lung, pancreas and prostate tissue types, we performed an intersection among those five lists to examine if there are any common cancer-related genes. We have received a total of nine genes that are present across all five cancer types (Figure 23). Those nine common cancer-related genes are the FOXM1, HNRPDL, BIN1, BUB3, CCNI, PMS1, PRKCBP1, PURA and RPA3.. Unfortunately, when we look up the biological. pathway involvements via KEGG database for those nine genes, we could not locate any of those nine genes in any of the biological pathways.. - 50 -.

(62) Figure 23: Intersection of GEO cancer-related gene lists for the breast, liver, lung, prostate and pancreatic tissues. The grey coloured overlapping part in the middle of the right bottom diagram represents the number of common cancer-related genes among those five tissues. - 51 -.

(63) 4.2.2. Cancer-Related Genes from SMD Database. After comprehensively reviewing and analyzing downloaded datasets from each publication, we have done individual analysis based on the nature of the datasets.. Genes that have. expression level varies greatly between normal and cancer tissues have been filtered out for each cancer type.. A list of 1,849 prostate cancer-related genes has been extracted out from. Lapointe et al.’s prostate cancer datasets.. A total of 1,077 cancer-related genes for breast. carcinoma have been found to express differently between normal and cancer tissues using Sorlie et al.’s datasets for analysis.. Garber et al.’s datasets have got a list of 3,653. cancer-related genes for lung carcinoma after the analysis while Chen et al.’s datasets have obtained 2,888 cancer-related genes for gastric cancer.. Moreover, a list of 2,357. cancer-related genes from Chen et al.’s hepatocarcinoma data have been confirmed to have rather different expression level between normal and cancer tissues while 3,818 cancer-related genes are resulted from the analysis of Iacobuzio-Donahue et al.’s pancreatic cancer data.. Upon having derived the cancer-related genes from the analysis of microarray gene expression datasets for breast, liver, lung, pancreas, prostate and stomach tissue types, we intersected those six lists to see if there are any common cancer-related genes. We have determined a total of six genes that are present across all six cancer types (Figure 24).. Those. six common cancer-related genes are ARGBP2, CD53, FCGBP, JUN, MME and VBP1. Among those six genes, gene JUN is found to be actively involved in the environmental information processing MAPK signaling pathway, Wnt signaling pathway, Toll-like receptor signaling pathway, T cell receptor signaling pathway, focal adhesion and B cell receptor signaling pathway.. In addition, gene MME is also being recognized to have a role in both. the Alzheimer’s disease and the Hematopoietic cell lineage. - 52 -. Gene ARGBP2, CD53, FCGBP,.

(64) and VBP1, however, are not yet found to have participated in any of the biological pathways.. Figure 24: Intersection of SMD cancer-related gene lists for the breast, liver, lung, prostate, pancreatic and stomach tissues. The yellow coloured overlapping part in the middle of the bottom diagram represents the number of common cancer-related genes among those six tissues.. - 53 -.

參考文獻

相關文件

We find molar masses for the elements in the periodic table (inside front cover of the text). We use the molar mass and Avogadro’s number as conversion factors to convert from

In this paper, we provide new decidability and undecidability results for classes of linear hybrid systems, and we show that some algorithms for the analysis of timed automata can

The multi-task learning problem comes from our biological application: Drosophila gene expression pattern analysis (funded by NSF and

The multi-task learning problem comes from our biological application: Drosophila gene expression pattern analysis (funded by NSF and

The resulting color at a spot reveals the relative levels of expression of a particular gene in the two samples, which may be from different tissues or the same tissue under

- - A module (about 20 lessons) co- designed by English and Science teachers with EDB support.. - a water project (published

In our AI term project, all chosen machine learning tools will be use to diagnose cancer Wisconsin dataset.. To be consistent with the literature [1, 2] we removed the 16

• For some non-strongly convex functions, we provide rate analysis of linear convergence for feasible descent methods. • The key idea is to prove an error bound between any point