探討人類可轉錄假基因的調控及功能分析

(1)

國立交通大學

生物資訊及系統生物研究所

博士論文

探討人類可轉錄假基因的調控及功能分析

Exploring Regulations and Functions of Transcribed

Pseudogenes in Homo Sapiens

研究生：詹雯玲

指導教授：黃憲達教授

張建國教授

(2)

探討人類可轉錄假基因的調控及功能分析

Exploring Regulations and Functions of Transcribed

Pseudogenes in Homo Sapiens

研究生 : 詹雯玲 Student : Wen-Ling Chan

指導教授 : 黃憲達博士 Advisor : Dr. Hsien-Da Huang

張建國博士 Dr. Jan-Gowth Chang

國立交通大學

生物資訊及系統生物研究所

博士論文

A Thesis

Submitted to Institute of Bioinformatics and Systems Biology College of Biological Science and Technology

National Chiao Tung University In partial Fulfillment of the Requirements

for the Degree of Ph.D.

in

Bioinformatics and Systems Biology

November 2012

Hsinchu, Taiwan, Republic of China

(3)

探討人類可轉錄假基因的調控及功能分析

學生 : 詹雯玲

指導教授 : 黃憲達博士

張建國博士

國立交通大學生物資訊及系統生物研究所

摘要

假基因 (pseudogene)主要被認為是演化過程中所產生的垃圾 DNA 序列。然而，近年來研究發現假基因-特別是可轉錄的假基因，可能經由產生內生性小的干擾 RNA、反股 RNA 或抑制 RNA，以扮演調控基因的角色。在老鼠及果蠅已被証實轉錄假基因可以產生內生性小的干擾 RNA 進而調控蛋白質基因的表現，但是在人類此機制未知。因此，本研究主要目的是系統化探討人類可轉錄假基因的二個機制：產生內生性小的干擾 RNA 以及 miRNA 誘餌的功能。為了系統化的處理及更新這些結果及相關的資訊，我們也建立一個新頴整合性的資炓庫-pseudoMap，做為研究人類可轉錄假基因的研究平台。由生物資訊所導出的一個假基因PPM1K-經由 PPM1K 部分反轉錄所產生的轉錄假基因，因反向重覆序列建構成髮夾式結構而產生二個內生性小的干擾 RNA 進而調控許多細胞相關基因，也包含了NEK8。41 對肝癌及其周邊非癌化組織研究顯示，由PPM1K 所產生的二個內生性小的干擾 RNA 及其對應的正常蛋白質基因，在癌組織的表現均低於非癌化組織的表現，而此現象相反於所預測的 esiRNA1 的調控基因NEK8 的表現。除此之外，NEK8 及 PPM1K 在過度表現PPM1K 的載體比將 esiRNA1 刪除的載體表現還低。甚至於若將 NEK8 表現在已轉染PPM1K 的載體，顯示 NEK8 可抵消PPM1K 對細胞成長的抑制。就我們所知，這是第一個實驗証實人類可轉錄假基因所產生的內生性小的干擾 RNA 調控肝癌細胞的表現。也同時驗証生物資訊預測的可行性。經由此研究，可以對人類可轉錄假基因的功能、調控，以及與其他蛋白質基因的關係有更完整、深入的認識。

(4)

Exploring Regulations and Functions of Transcribed

Pseudogenes in Homo Sapiens

Student: Wen-Ling Chan Advisor : Dr. Hsien-Da Huang

Dr. Jan-Gowth Chang

Institute of Bioinformatics and Systems Biology,

National Chiao Tung University

Abstract

Pseudogenes have been mainly considered as “junk” DNA, failed copies of genes that arise during the evolution of genomes. However, recent studies indicate that pseudogenes, especially the transcribed ones, may function as gene regulators through generation of endogenous small interfering RNAs (esiRNAs), antisense RNAs, or RNA decoys. In mice and flies, pseudogene transcripts can be processed into esiRNAs that regulate protein-coding genes, but this mechanism in human remains unknown. Therefore, the aim of this work is systematically to demonstrate two mechanisms of human transcribed pseudogenes (TPGs): encoding esiRNAs and decoying miRNAs that target the parental gene. To enable the systematic compilation and updating of these results and additional information, we have developed a database, pseudoMap, an innovative and comprehensive resource to study human TPGs. A genome-wide survey revealed a TPG PPM1K, a partial retrotranscript from PPM1K (protein phosphatase, Mg2+/Mn2+ dependent, 1K), containing inverted repeats capable of folding into hairpin structures that can be processed into two esiRNAs; these esiRNAs potentially target many cellular genes, including NEK8. In 41 paired surgical specimens, we found significantly reduced expression of two predicted PPM1K-specific

esiRNAs, and the cognate gene PPM1K, in hepatocellular carcinoma (HCC) compared to matched non-tumor tissues, whereas the expression of target gene NEK8 was increased in

(5)

tumors. Additionally, NEK8 and PPM1K were down-regulated in stably transfected

PPM1K-overexpressing cells, but not in cells transfected with an esiRNA1-deletion mutant

of PPM1K. Furthermore, expression of NEK8 in PPM1K-transfected cells demonstrated

that NEK8 can counteract the growth inhibitory effects of PPM1K. These findings indicate

that a transcribed pseudogene can exert tumor suppressor activity independent of its parental gene by generation of esiRNAs that regulate human cell growth. To our knowledge, this is the first investigation of an esiRNA-mediated role of human pseudogenes in HCC as well as verification of computational prediction. This study provides further information for elucidating the human TPG function, regulation and the relationships between protein-coding genes.

(6)

誌謝

求學生涯終於畫下休止符！一開始的雄心萬丈，在博二即取得博士候選人資格，歷經撞牆期，研究停滯不前，到最後柳暗花明又一村，終於開花結果。很幸運的一路上，獲得許多貴人的幫忙與扶持。特別感謝中國醫藥大學的張建國教授，啟發我對生物研究的興趣進而激發對研究的熱忱，可謂是我的啟蒙老師！進入博士生涯，有幸獲得黃憲達教授的指導，其對資訊及生物並重的態度及活躍的思考，讓我得以整合二個不同的領域，也體恤我台中、新竹二地奔波。感謝中國醫藥大學的楊文光教授及已故的楊師母，帶領我進入 cell culture 這個聖殿，他們二位將細胞治療應用於治療癌症，不但對病患提供一道曙光且有顯著的成效。亦感謝高雄醫學大學的鐘育志教授及交通大學的廖光文教授撥冗指導我的博士論文，提供許多寶貴的建議，俾使本論文更加充實完整。此外，感謝宗夷、熙淵、博凱、勝達、豐茂及交大、中國實驗室的室友們，對研究及生活上的互相扶持，是我的良師益友！今日，若是我有些許的成就，歸功於家人強力的支援！感謝我的公婆、大姑子及其他家人體恤我長久的求學生涯，在必要時予以協助，讓我無後顧之憂。最後，幕後大功臣－我的老公，給予最大的包容與支持，以及二個小孩在我疲憊不堪時，適時幫媽媽加油打氣。在此致上深深的謝意，您們是我強而有力的後盾！僅以此論文，獻給來不及看到我畢業的母親！

(7)

List of Figures

Figure 1.1 The central dogma of molecular biology. ... 2

Figure 1.2 Human genome. ... 3

Figure 1.3 Schematic representation of the emerging ncRNA world... 4

Figure 1.4 Micro-RNA biogenesis. ... 5

Figure 1.5 The functions of lncRNA. ... 7

Figure 1.6 The mechanisms of pseudogene. ... 8

Figure 1.7 The mechanism of pseudogene-mediated production esiRNAs. ... 10

Figure 1.8 TPG-PTENP1 may act as a miRNA decoyed. ... 11

Figure 1.9 Our hypothesis of human TPGs production of esiRNAs to regulate protein-coding genes. ... 17

Figure 1.10 The system flow of this project. ... 18

Figure 2.1 Workflow for identification of TPG-derived esiRNA-target interactions. ... 22

Figure 2.2 Identification of miRNA-target interactions of TPG and its cognate gene. .... 25

Figure 3.1 Chromosome location of pseudogenes and protein-coding genes. ... 36

Figure 3.2 Gene expression profiles of TPGs in 79 human physiologically normal tissues. ... 37

Figure 3.3 Gene expression profiles of TPGs. (a) Expression heat-map of TPGs in pair of cancers compared to non-malignant tissues. (b) TPGs expression profiles in the tumor tissues/cells referenced by normal liver tissues. ... 38

Figure 3.4 Statistics of classification of TPG-NG pairs. ... 39

Figure 3.5 Enriched GO terms of NGs in classification of TPG-NG pairs. ... 39

Figure 3.6 Gene expression profiles of TPGs referenced by its paired NG in 79 human physiologically normal tissues. ... 40

Figure 3.7 Gene expression profiles of TPGs referenced by its paired NG in 61 human tumor tissues. ... 41

Figure 3.8 pseudoMap: a resource of exploring esiRNA-mediated mechanisms in human transcribed pseudogenes. ... 52

Figure 3.9 Web interface of pseudoMap. ... 53

Figure 3.10 Search interface of pseudoMap. ... 54

Figure 3.11 The genomic view of PPM1K. ... 58

Figure 3.12 Schematic representation of PPM1K and its parental gene, PPM1K. (a) PPM1K is located on chromosome 4 proximal to PPM1K. (b) Alignment of PPM1K and its cognate gene PPM1K. ... 58 Figure 3.13 Candidates PPM1K-derived esiRNAs and their targets. (a) Location and

read counts of transcribed PPM1K RNA from sRNA deep sequencing data. (b)

(11)

esiRNA3 mapping to PPM1K gene. (d) Hairpin structure prediciton of precursor esiRNA1. (e) Matches of esiRNA1 and esiRNA2 sequences with target gene NEK8 and parental gene PPM1K. (f) Expression profiles of PPM1K, PPM1K and NEK8 in HCC

tissues/cells. ... 60

Figure 3.14 Characterization of esiRNA1-targeted genes. The gene-ontology was categorized according to biological process (a), molecular function (b) and cellular component (c) to determine the common cellular functions affected by esiRNA1. (d) Common transcriptional factor binding sites (TFBS) of target genes. ... 62

Figure 3.15 Expression patterns of HCC tissues and cell lines. (a) RT-qPCR of two esiRNA precursors, PPM1K and NEK8, in paired HCC tissues (b) RNA levels of PPM1K and PPM1K in HepG2 and Huh-7 cells. ... 64

Figure 3.16 Effect of overexpressed PPM1K on cell growth and clonogenic activity in transfected Huh-7 cell clones. (a) HCC line Huh-7 and HepG2 cells were transfected with PPM1K-expressing recombinant plasmid to isolate stably transfected cell clones. (b) All three PPM1K-expressing cell lines have a slower proliferation rate than the vector control cell line. (c) Serial photographs of the same colonies at day 5, day 7 and day 9 showing the two-dimensional growth of mock2, TPG1, TPG2 and TPG7 transfected Huh-7 clones on plastic culture dishes. (d) Clonogenic activity of mock2, TPG1, TPG2 and TPG7. ... 64

Figure 3.17 Expression of target genes in HCC cell clones. ... 65

Figure 3.18 Expression of PPM1K-dervied esiRNAs. ... 66

Figure 3.19 Localization of pseudogene-derived esiRNAs by FISH analysis. ... 67

Figure 3.20 Expression of NEK8 in Huh-7 cells transfected with synthetic siRNA1. (a) The sequence of synthetic siRNA1. (b) Huh-7 cells were transfected with synthetic siRNA1 (c) Expression of NEK8 in an esiRNA1-deletion mutant cell line. (d) Growth of Huh-7 TPG7 cells transfected with either NEK8-overexpressing plasmid or empty vector analyzed by cell proliferation assay. ... 69

Figure 3.21 PPM1K alters PPM1K expression and mitochondrial function... 70

Figure 3.22 FACS analysis of mitochondrial Rh123 uptake and release from transfected mock2 (a), TPG1 (b), TPG2 (c), TPG7 (d) transfected HCC Huh-7 clones. ... 71

Figure 3.23 miRNA regulation of PPM1K and PPM1K. (a) Alignments of miRNAs target with PPM1K and PPM1K. (b) miR-3174 down-regulation of PPM1K and PPM1K. ... 72

(12)

List of Tables

Table 2.1 Summary of public deep sequencing data from various sRNA libraries and

gene expression profiles. ... 24

Table 2.2 Classification of TPG-NG pairs. ... 25

Table 2.3 Supported databases and tools in pseudoMap. ... 28

Table 2.4. The sequences of the probes and primers used in RQ-PCR/RT-PCR. ... 30

Table 3.1 Human TPG-derived miRNA-target interactions. ... 43

Table 3.2 The deep sequence data supported piRNAs-derived from TPGs. ... 44

Table 3.3 Comparisons of pseudoMap with currently public databases of pseudogenes. 56 Table 3.4 The top 20 candidates, that score more than 200, of predicted esiRNA1 targeted genes by modification of TargetScan, RNAhybrid and miRanda. ... 61

(13)

Chapter 1 Introduction

The human genome comprises more numbers of pseudogenes than corresponding functional genes [1]. Generally, pseudogenes are defined as nonfunctional copies of gene fragments incorporated into genome by either gene duplication of genomic DNA or retrotransposition of mRNA [2]. In apparent contradiction of the assumption that pseudogenes are genomic fossils, genome-wide investigations have recently provided evidences for actively transcribed pseudogenes (TPGs) with potential functional implications [3-10]. It implied that pseudogenes, especially those that are transcribed, may not be mere genomic fossils, but their biological significance remains unclear. This dissertation is concerning that what are the mechanisms involved in the human TPGs. To address this issue, we focus on two topics of that the relationships between TPG and its cognate gene with miRNA decoyed mechanisms and TPG-derived endogenous small interfering RNAs (esiRNAs)-target interactions. Additionally, to enable the systematic compilation and updating of these results and additional information, a database, pseudoMap, capturing various types of information, including sequence data, TPG and cognate annotation, deep sequencing data, RNA-folding structure, gene expression profiles, miRNA annotation and target prediction, will be constructed to study the human TPGs. Hepatocellular carcinoma (HCC) is one of the most common human cancers worldwide, particularly in Asia and Africa [11]. Therefore, we are interested in exploring the relationship between TPGs and HCC. Finally, consideration of prediction results and the functions of TPG, its cognate and target gene, a TPG-PPM1K, a partial

retrotranscript from PPM1K (protein phosphatase, Mg2+/Mn2+ dependent, 1K), is

(14)

1.1 Biological background

1.1.1 Central dogma

The biological central dogma (Figure 1.1) describes the flow of genetic information within a biological system. It was first stated by Francis Crick [12]. In briefly, four steps in this dogma: First RNA polymerase docks to the chromosome and slides along the gene, transcribing the sequence on one strand of DNA into a single strand of RNA. Next, all introns-noncoding parts of the initial RNA transcript-are spliced out, and the rests are joined together to make a messenger RNA. The RNA then moves out of the nucleus to the cytosol of the cell, where molecular machines translate it into chains of amino acids. Finally, each chain twists and folds into an intricate three-dimensional shape. Traditionally, the proteins are recognized as the main responsibility for biological function.

(15)

1.1.2 Non-coding RNA

Classically, proteins are recognized as having the main responsibility for biological function, with RNA merely a messenger that transfers protein-coding information from DNA [13, 14]. This concept has changed in recent years, however, while less than 2% of the genome encodes protein, over 80% of the genome produces non-protein coding RNA transcripts (Figure 1.2) [13-17] and these ncRNAs have important biological functions including gene regulation [18, 19], imprinting [20-24], epigenetic regulation [25, 26], cell cycle control [27], regulation of transcription, translation and splicing [19, 28-32] and others. There are many studies discoveries of non-coding RNAs (ncRNAs) to regulate protein-coding gene expressions. ncRNA is any RNA molecule that is not translated into a protein, such as piwi interacting RNA (piRNA), microRNAs (miRNAs), short interfering RNAs (siRNA), long ncRNAs (lncRNAs) and transcribed pseudogenes (TPGs) (Figure 1.3). Following, we will introduce the functions of these ncRNAs.

(16)

Figure 1.3 Schematic representation of the emerging ncRNA world.

1.1.3 Piwi interacting RNA (piRNA)

Piwi interacting RNA (piRNA), form RNA-protein complexes through interactions with piwi proteins, is the largest class of sRNA molecules that is expressed in animal cells [33]. The biogenesis of piRNAs is not yet fully understand, although possible mechanisms have been linked to both epigenetic and post-transcriptional gene silencing of retrotransposons and other genetic elements in germ line cells, particularly those in spermatogenesis [34].

1.1.4 MicroRNA (miRNA)

MicroRNAs (miRNAs) play important roles on development, oncogenesis and apoptosis by binding to mRNAs to regulate the post-transcriptional level of gene expression in mammals, plants and insects [35, 36]. The general biogenesis of the miRNA is shown in Figure 1.4. In briefly, microRNA is defined as single-stranded RNAs of ~22 nt in length generated from endogenous transcripts. It is transcribed by RNA polymerase II [8]、, and the primary miRNA

(17)

(pri-miRNA) is first processed by the nuclear RNase type III enzyme, Drosha, to release the hairpin-shaped intermediates, become precursor miRNA (pre-miRNAs) [37]. Pre-miRNA is typically 60-70 nt, is a hairpin structure, which contain an ~22 bp double-stranded stem and a ~10 nt terminal loop. The nuclear export factor, Exportin 5, export the pre-miRNA from the nucleus to the cytoplasm [38]. Then pre-miRNA is cleaved by another RNase III type enzyme, Dicer, to generate an ~22 nt RNA duplex that includes the mature miRNA which becomes part of the RNA-induced silencing complex (RISC) [39]. The mature miRNA then binds to complementary sites in the mRNA target to negatively regulate gene expression through two major mechanisms: one is mRNA degradation through perfect hybridization between miRNA and its target sites, another is translation repression with imperfect hybridization.

(18)

1.1.5 Small interfering RNA (siRNA)

Small interfering RNA (siRNA) is a class of double-stranded RNA molecules with 20-25 bp in length. siRNA plays many roles, but its most notable is in the RNA interference (RNAi) pathway, where it interference with the expression of specific genes with complementary nucleotide sequence [40].

1.1.6 Long non-coding RNA (lncRNA)

Long non-coding RNA (lncRNA) is in general considered as non-protein coding transcript with more than 200 bp in length. This limitation is due to practical considerations including the separation of RNAs in common experimental protocols. Large scale sequencing of cDNA libraries and more recently transcriptomic sequencing by next generation sequencing indicate that the number of lncRNAs is over than ten thousand in human genome [41]. The functions of lncRNA are showed in Figure 1.5 [42]. In briefly, lncRNA transcribed from an upstream non-coding promoter can negative (Figure 1.5 ①) or positively (Figure 1.5 ②) affect expression of the downstream gene by inhibiting RNA polymerase II recruitment and/or inducing chromatin remodelling, respectively. lncRNA is able to hybridize to the pre-mRNA and block recognition of the splice sites by the spliceosome, thus resulting in an alternatively spliced transcript (Figure 1.5 ③ ). Alternatively, hybridization of sense and antisense transcripts can allow Dicer to generate endo-siRNAs (Figure 1.5 ④). The binding of lncRNA to miRNA results in the miRNA silencing (Figure 1.5 ⑤). The complex of lncRNA and specific protein partners can modulate the protein activity (Figure 1.5 ⑥), structure (Figure

1.5 ⑦), localization (Figure 1.5 ⑧) or epigenetic regulation (Figure 1.5 ⑨). Finally,

(19)

Figure 1.5 The functions of lncRNA.

1.1.7 Pseudogene

Pseudogenes are DNA sequences in the genome that similarity to specific protein-coding genes, but are unable to produce functional proteins due to existence of frameshifts, premature stop codons or other deleterious mutations [2]. Pseudogenes have been denoted in several ways including the prefixed Greek symbol , for example PPM1K, or by a capital ‘P’ suffix,

for example ZNF355P. There are two major classes of pseudogene (Figure 1.6): one represents processed forms that contain poly-A tails, lack introns and arise through retrotransposition, while the other comprises nonprocessed pseudogenes resulting from gene duplication, which retain exon/intron structure, although occasionally incompletely [2]. Pseudogenes are usually considered to be junk DNA and genomic fossils, however, a number

(20)

of recently studies showed that pseudogenes, especially transcribed ones, may not mere genomic fossils, but function as gene regulators. The following section will introduce the functions of TPGs..

Figure 1.6 The mechanisms of pseudogene.

1.1.8 Transcribed pseudogene (TPG)

The transcribed pseudogenes (TPGs) are disabled but nonetheless transcribed. These TPGs may function as gene regulators through generation of endogenous siRNAs (esiRNAs), antisense RNAs, or RNA decoys. For instance, the NOS transcript acts as a natural antisense regulator of neuronal NOS protein synthesis in snails [44, 45]; and in mice, reduced expression of makorin1-p1 due to a transgene insertion caused mRNA instability of its

(21)

contradictory results were also reported [47]. Additionally, a transcript of PTEN/PTENP1, a

highly homologous processed TPG of tumor suppressor gene PTEN, not only interacts with its cognate sequence but also exerts a growth suppressor role as a decoy by binding to

PTEN-targeting miRNAs [48]. These findings clearly imply that TPGs may play active

regulatory roles in cellular functions.

1.2 Related works

RNA interference (RNAi) is a natural cellular process that defends cells against viruses and transposons, and also regulates gene expression in a sequence-specific manner [40]. Three RNAi pathways can be distinguished on the basis of the biogenesis and functional roles of the classes of small RNA involved, two of which are siRNA, resulting from processing of dsRNA, and miRNA, which derive from shRNA, respectively [39, 49]. The third category is piRNA: these are ssRNA sequences that interact with piwi protein and seem to be involved in transcriptional gene silencing of retrotransposons and other genetic elements in germ line cells [50]. By binding to mRNAs and thereby repressing protein synthesis, miRNAs may regulate cellular development, oncogenesis and apoptosis [35, 36]. Previously studies showed that pseudogenes may produce esiRNAs to regulate protein-coding genes through RNAi mechanism (Figure 1.7). In mice and fruit flies, dsRNAs arising from the antisense/sense transcripts of processed TPGs and their cognates (showed in Figure 1.7 b), or hairpin structures resulting from inversion and duplication (showed in Figure 1.7 a), are cut by Dicer into 21 nt siRNA that can bind RNA-induced silencing complex (RISC) and regulate the expression of the parental gene [49, 51-55]. Such regulatory mechanism in human remains unclear.

Another study indicated pseudogene may act as a miRNA-decoyed (Figure 1.8). A transcript of PTEN/PTENP1, a highly homologous processed TPG of tumor suppressor gene

(22)

PTEN, not only interacts with its cognate sequence but also exerts a growth suppressor role as

a decoy by binding to PTEN-targeting miRNAs [48].

In present, most of computational studies focus on evolution of pseudogenes. A number of public database and tools of studying pseudogenes are described as follows.

(23)

Figure 1.8 TPG-PTENP1 may act as a miRNA decoyed.

1.2.1 Public biological databases

In present, the computational analyses focus on the evolution of pseudogenes. There is no resource indeed providing the functional information in pseudogenes. Following sections will describe the pseudogene-related database.

1.2.1.1 Pseudogene.org

Pseudogene.org [56] is developed and maintained by Yale Gerstein Group. This site contains a comprehensive database of identified pseudogenes, utilities used to find pseudogenes, various publication data sets and a pseudogene knowledgebase.

1.2.1.2 Hoppsigen DB

Hoppsigen [57] is a nucleic database of homologous processed pseudogenes. The database is developped at the PBIL (Pôle Bioinformatique Lyonnais). The authors have identified 5,823 human retroelements and 3,934 mouse retroelements. These retroelements were annotated and

(24)

stored in the database HOPPSIGEN (Homologous processed pseudogenes). Sequences were grouped in families considering their homologies. The database contains 3,168 families of exclusively human (1,966) or mouse retroelements (1,202) and 323 families containing human and mouse retroelements. 5,206 human retroelements were annotated as processed pseudogenes. The database contains functional genes from ENSEMBL homologous to Hoppsigen retroelements.

1.2.1.3 UI Pseudogenes

UI Pseudogenes website [58] serves as a repository for all pseudogenes in the human genome. They also provide a ranked list of human pseudogenes that have been identified as candidates for gene conversion.

1.2.1.4 Human Pseudogenes

Torrents, et al. [1] have published “A genome-wide survey of human pseudognes” in Genome Research. In this paper, the authors screened all intergenic regions in the human genome to identify pseudogenes with a combination of homology searches and a functionally test using the ratio of silent to replacement nucleotide substitutions. Finally, they detected 19,537 pseudogenes which include 17,759 processed pseudogenes and 1,778 non-processed pseudogenes.

1.2.1.5 miRBase

The miRBase database [59] aims to provide integrated interfaces to comprehensive microRNA sequence data, annotation and predicted gene targets. miRBase takes over functionality from the microRNA Registry and fulfils three main roles: the miRBase Registry acts as an independent arbiter of microRNA gene nomenclature, assigning names prior to

(25)

publication of novel miRNA sequences.

1.2.1.6 Ensembl

Ensembl [60] is a joint project between European Molecular Biology Laboratory (EMBL) and the Wellcome Trust Sanger Institute (WTSI) to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes..

1.2.1.7 UCSC Genome Browser Database

UCSC Genome Browser Database [61] contains the reference sequence and working draft assemblies for a large collection of genomes. This database is optimized to support fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data.The Genome Browser displays a wide variety of annotations atall scales from single nucleotide level up to a full chromosome.The Table Browser provides direct access to the database tables and sequence data, enabling complex queries on genome-wide datasets.The Proteome Browser graphically displays protein properties. The Gene Sorter allows filtering and comparison of genes byseveral metrics including expression data and several gene properties.BLAT and In Silico PCR search for sequences in entire genomesin seconds.

1.2.1.8 fRNAdb

The Functional RNA Database (fRNAdb) [62], which hosts a large collection of known/predicted non-coding RNA sequences from public databases: H-invDB v5.0 [10], FANTOM3 [63], miRBase 17.0 [64], NONCODE v1.0 [65], Rfam v8.1 [66], RNAdb v2.0 [67] and snoRNA-LBME-db rel. 3 [68].

(26)

1.2.1.9 Gene Expression Omnibus (GEO)

The Gene Expression Omnibus (GEO) [69] at the National Center for Biotechnology Information (NCBI) is a gene expression/molecular abundance repository supporting MIAME [70] compliant data submissions, and an online resource for gene expression data browsing, query and retrieval.

1.2.2 Public analysis tools

1.2.2.1 BLAST

The basic local alignment search tool (BLAST) [71] finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

1.2.2.2 ClustalW

ClustalW [72] is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen.

1.2.2.3 RNAhybrid

RNAhybrid [73] is a variation of the classic RNA secondary structure prediction. It

determines the most favorable hybridization site between two sequences. RNAhybrid does not use any RNA folding or pairwise sequence alignment code, but implements an algorithm that was specifically designed for RNA hybridization.

(27)

1.2.2.4 TargetScan

Lewis et al. [74], predict regulatory targets of mammalian microRNAs (miRNAs) by identifying mRNAs with conserved complementarity to the seed (nucleotides 2-8) of the miRNA. They developed an algorithm called TargetScan, which combines minimum free energy of miRNA/mRNA duplex with comparative sequence analysis to predict miRAN targets conserved across multiple genomes.

1.2.2.5 miranda

miRanda [75] is a open-source software for miRNA target prediction. It was used to scan all available miRNA sequences for a given genome against 3’UTR sequences of that genome derived from the Ensembl database and –tabulated separately-against all cDNA sequences and coding regions. This algorithm uses dynamic programming to search for maximal local complementarity alignments, corresponding to a double-stranded antiparallel duplex.

1.2.2.6 Mfold

Mfold [76] is a tool for predicting the secondary structure of RNA and DNA by free energy minimization. The ‘m’ simply refers to ‘multiple’. A dynamic programming algorithm is used to predict a minimum free energy as well as minimum free energies for folding that must contain any particular base pair. Base-pairs within this free energy increment are chosen either automatically or else by the user. They also provided a web system [77] for prediction of the secondary structure of single stranded nucleic acids.

(28)

1.3 Motivation

In the last ten years, a variety of complex mechanisms such as gene silencing, gene transcription, DNA imprinting, DNA demethylation, chromatin structure dynamtics, RNA interference and others have already been connected to ncRNAs [78]. The TPGs are as like as lncRNAs, furthermore, some of them regulate the protein-coding genes. The previously study indicated TPGs are significantly over-represented near both the 5’ and 3’ ends of genes; this suggests that TPGs can be formed through gene-promoter co-option, or intrusion into untranslated regions [5]. However, roughly half of the TPGs are located away from genes in the intergenic DNA [5]. In accordance with the study of NOS transcript acts as a natural antisense regulator of neuronal NOS protein synthesis in snails [44, 45], in 2005, we have a hypothesis that human TPG may produce esiRNAs to regulate protein-coding genes (Figure

1.9). However, this project is broke down since we can’t predict the exactly esiRNA derived

from TPG. In 2008, the Next Generation Sequencing (NGS) studies showed that pseudogene-derived esiRNAs regulate transcripts in mice oocytes and drosophila somatic cells [49, 51-55]. These evidences supported our previously hypothesis. Recently years, the productions of NGS data from various sRNA libraries are quickly generated. This information supported us to predict the TPG-derived esiRNAs. Additionally, a transcript of

PTEN/PTENP1, a highly homologous processed TPG of tumor suppressor gene PTEN, not

only interacts with its cognate sequence but also exerts a growth suppressor role as a decoy by binding to PTEN-targeting miRNAs [48]. It implied that miRNA can co-regulate TPG and its parental gene, moreover, TPG may act as a miRNA-decoyed. Therefore, we also want to know how many miRNAs co-regulate TPG and its cognate gene.

(29)

Figure 1.9 Our hypothesis of human TPGs production of esiRNAs to regulate protein-coding genes.

1.4 Research goals

The purpose of this project is systematically identifying the regulations and functions of TPGs in Homo Sapiens. Figure 1.10 showed the system flow of this project. Four major works in this project, first, we want to prove our hypothesis that human TPGs may produce esiRNA/miRNA to regulate protein-coding genes. To address this issue, we develop a computational pipeline that including mapped the TPGs to small RNAs (sRNAs) that were supported by publicly deep sequencing data from various sRNA libraries, and constructed the TPG-derived esiRNA-target interactions. Second, we detect regulation of TPGs, specifically focus on miRNA co-regulated TPG and its cognate gene. Third, to enable the systematic

(30)

compilation and updating of these results and additional information, we develop a database, pseudoMap, capturing various types of information, including sequence data, TPG and cognate annotation, deep sequencing data, RNA-folding structure, gene expression profiles, miRNA annotation and target prediction. Finally, we selected PPM1K, a partial

retrotranscript from PPM1K (protein phosphatase, Mg2+/Mn2+ dependent, 1K), to verify our

in silico results.

Figure 1.10 The system flow of this project.

1.5 Dissertation Organization

In this dissertation, we will analyze the regulations and functions of TPGs in Homo Sapiens. For these objectives to be achieved, the article is structured as follows. The first we describes some background information on the ongoing research within which the present study was carried out and a statement of the specific research questions (Chapter 1 ). Following, a computational pipeline and subsequent experimental tests were constructed for the

(31)

identification of TPG-derived esiRNAs-target interactions and miRNA-mediated mechanism of TPG and its cognate gene (Chapter 2 ). To enable the systematic compilation and updating of these results and additional information, we developed a human TPGs resource, pseudoMap (Chapter 3 ), In addition, TPG-PPM1K, a partial retrotranscript from PPM1K

(protein phosphatase, Mg2+/Mn2+ dependent, 1K), was first collected for detailed model

studies to evaluate the performance and feasibility of this systematic approach (Chapter 3 ). A discussion of bio-experimental results and the perspective of our study showed in Chapter

4 . The briefly conclude about this study showed in Chapter 5 . Finally, the future works

(32)

Chapter 2 Materials and Methods

In this section, we presented the approaches of computational and experimental analyses. The system flow of this study indicated in Figure 1.10. The first, we obtained a number of public databases to create dataset. Following, various public tools and in-house programs were integrated into our study. Third, to enable the systematic compilation and updating of our results and additional information, we developed a database, pseudoMap. Finally, processed a serial experiments to verify our in silico results.

2.1 Data generation

In total, more than 20,000 human pseudogenes and their cognate genes were obtained from the Ensembl database (Ensembl 63, GRCH37) using BioMart (http://www.ensembl.org/index.html). Affymetrix GeneChip® Human Genome U133A or U133Plus2 is one microarray comprised of oligonucleotide probes to measure the level of transcription of each sequence represented, which included transcribed pseudogenes. Total of 1, 404 pseudogenes have been detectable by this chip, thus considered being transcribed and named transcribed pseudogenes (TPGs). Functional small RNAs (fsRNAs) with sequence length between 18 to 40 nt were collected from the Functional RNA Database (fRNAdb) [62], which hosts a large collection of known/predicted non-coding RNA sequences from public databases: H-invDB v5.0 [10], FANTOM3 [63], miRBase 17.0 [64], NONCODE v1.0 [65], Rfam v8.1 [66], RNAdb v2.0 [67] and snoRNA-LBME-db rel. 3 [68]. The sequences of human miRNAs were obtained from miRBase 18 [79]. Genomic sequences and conservation data were collected from UCSC hg19 [80] (http://hgdownload.cse.ucsc.edu/downloads.html).

(33)

2.2 System

flow

of

identifying

pseudogene-derived

esiRNA-target interactions

Figure 2.1 depicts the workflow for identifying pseudogene-derived esiRNA-target

interactions (eSTIs). After collection of pseudogenes, protein-coding genes and fsRNAs, the pseudogene-specific esiRNAs were examined by aligning the pseudogenes with fsRNAs, excluding alignments with parental genes. Candidate pseudogene-specific esiRNAs were validated by reference to publicly-available deep sequencing data from various sRNA libraries. Additionally, eSTIs were analyzed by three target prediction tools and verified with gene expression profiles. Detailed procedures are described below.

2.2.1 Identification of pseudogene-derived esiRNAs

To predict candidate pseudogene-derived esiRNAs, we aligned the sequences of pseudogenes and fsRNAs with sequence length 18-40 bp which obtained from fRNADB. Deep sequencing data of sRNA libraries derived from human embryo stem cells or HCC/liver tissues were used to verify these candidates [81-83]. Then, the extended sequences of these candidate esiRNAs were used to predict hairpin structure by Mfold [77]. Details of publicly-available deep sequencing data are shown in Table 2.1.

(34)

Figure 2.1 Workflow for identification of TPG-derived esiRNA-target interactions.

2.2.2 Identification of esiRNA-target interactions (eSTIs)

Based on experimentally supported data sets, Sethupathy et al. [11] and Baek et al. [64] have shown that the intersection of miRNA target prediction tools can yield improved specificity with only a marginal decrease in sensitivity relative to any individual algorithm. We modified our previous approach [84] for identifying pseudogene-derived esiRNA targets. Briefly, three previously developed computational approaches, TargetScan [85-87], miRanda [75] and RNAhybrid [73] were employed to identify esiRNA target sites within the conserved regions of the 3’-UTR of genes in 12 metazoan genomes. The minimum free energy (MFE) threshold was -20 kcal/mol with score ≥ 150 for miRanda; default parameters were used for TargetScan and RNAhybrid. The three criteria for identifying targets were: (1) potential target sites must be predicted by at least two tools; (2) hits with multiple target sites are prioritised; (3) target sites must be located in accessible regions. Finally, three gene expression profiles were obtained from NCBI GEO [88] to verify those eSTIs with pseudogene expression higher than

(35)

their target genes. Gene expression profiles included GDS596 [89], GSE5364 [90] and GSE6222 [91]; detailed experimental conditions are described in Table 2.1. The Pearson correlation coefficient was computed for pseudogenes and their target genes.

2.3 miRNA-target interactions (MTI)

In this process, we focus on miRNA-target interactions (MTIs) of TPG and its cognate gene (NG). The schema of identification of MITs is shown in Figure 2.2. To obtain the pairs of TPG-parental gene (TPG-NG), TPGs were mapped to all of human protein-coding genes with BLAST (P-value≤0.00001) [71]. We calculated the TPG-similarity and NG-similarity with alignment-length/TPG-length and alignment-length/NG-length, respectively. In accordance with the similarity score, we classified the TPG-NG pairs in Table 2.2. Following, to explore the relationships of TPG and its cognate gene with miRNA target, we obtained the MTIs from miRTarBase [91], a database curates experimentally validated miRNA-targtet interactions. If there is no experimental data supported to TPG and its cognate gene, we used the eSTIs approach to predict the MTI.

(36)

Table 2.1 Summary of public deep sequencing data from various sRNA libraries and gene expression profiles.

Experiment GEO ID Reads Count/Samples Human embryo stem cell-hB* - 5,026,203

Human embryo stem cell-hESC* - 5,031,920

Human embryo stem cell -hues6 GSM339994 13,869 Human embryo stem cell -hues6NP GSM339995 11,883 Human embryo stem cell - hues6Neuron GSM339996 1,786 HBV(+) Adjacent Tissue Sample 1 GSM531980 358,994 HBV(+) Adjacent Tissue Sample 2 GSM531983 652,641 HBV(+) Distal Tissue Sample 1 GSM531979 372,454 HBV(+) HCC Tissue Sample 1 GSM531982 372,454 HBV(+) HCC Tissue Sample 2 GSM531984 423,338 HBV-infected Liver Tissue GSM531977 354,524 HBV(+) Side Tissue Sample 1 GSM531981 315,838 HCV(+) Adjacent Tissue Sample GSM531985 2,503,533 HCV(+) HCC Tissue Sample GSM531986 2,543,980 HBV(-) HCV(-) Adjacent Tissue Sample GSM531987 369,926 HBV(-) HCV(-) HCC Tissue Sample GSM531988 618,689 Human Normal Liver Tissue Sample 1 GSM531974 328,915 Human Normal Liver Tissue Sample 2 GSM531975 282,476 Human Normal Liver Tissue Sample 3 GSM531976 238,522 Severe Chronic Hepatitis B Liver Tissue GSM531978 259,727 Gene expression profiles of 79 human

physiologically normal tissues

GDS596 158

Gene expression profiles of tumor and adjacent non-tumor tissues from colon, liver, lung, oesophagal and thyroid cancers

GSE5364 341

Gene expression profiles of Huh-7 cells, normal liver and cancer tissues

GSE6222 13

(37)

Figure 2.2 Identification of miRNA-target interactions of TPG and its cognate gene.

Table 2.2 Classification of TPG-NG pairs.

Class Description Definition

1 TPG highly similarity with NG TPG-similairty≥0.8 & NG-similairty≥0.8 2 TPG similarity with NG 0.5≤TPG-similarity<0.8 &

0.5≤NG-similarity<0.8 3 TPG highly similarity with partial

NG

TPG-similairty≥0.8 & NG-similarity<0.5

4 NG highly similarity with partial TPG

TPG-similarity<0.5 & NG-similairty≥0.8

(38)

2.4 Gene expression analysis

The mRNA abundances of TPGs and protein-coding genes were obtained from Gene Expression Omnibus [88], such as GDS596 examined from 79 human physiologically normal tissues [89], GSE2109 examined from 2158 samples with 61 tumor tissues, GSE3526 examined from 353 samples with 65 normal tissues [35] and GSE5364 examined from primary human tumors and adjacent non-tumor tissues, which include 270 tumors and 71 normal-cancer pairs from patients with breast, colon, liver, lung, oesophagal and thyroid cancers [90]. Moreover, the Pearson correlation coefficient was computed from TPGs and protein-coding genes.

2.5 GO and KEGG enrichment analyses

The function of target genes was examined by performing GO terms and KEGG pathway enrichment annotation [92] using the DAVID gene annotation scheme [93]. DAVID is a biological knowledgebase which integrates up to 60 published resources for functional annotations of large gene/protein lists. To utilize DAVID for GO terms and KEGG pathway enrichment analysis of esiRNA targets, the Fisher’s exact P-value were calculated to ensure that the enrichments possess the property of statistical significance rather than random occurrence. The associated biological meanings of esiRNA-mediated genes allow investigators comprehensively understanding the molecular functions of these esiRNAs.

2.6 Construction of pseudoMap

To enable the systematic compilation and updating of these results and additional information, a database, pseudoMap, capturing various types of information, including sequence data, TPG and cognate annotation, deep sequencing data, RNA-folding structure, gene expression

(39)

profiles, miRNA annotation and target prediction, will be developed. In pseudoMap, various databases and tools (Table 2.3) for mining potential regulators and functions of human TPGs are integrated and maintained with MySQL (http://www.mysql.com/) relational database management system. While operating on an Apache HTTP server (http://www.apache.org/) and PHP (http://www.php.net/) on a Linux operation system (http://www.linux.com/), pseudoMap was constructed using the Smarty template engine (http://www.smary.net/). Based on PHP, JavaScript (http://www.javascriptsource.com/), CSS (http://www.w3schools.com/css/) and HTML languages (http://www.w3schools.com/html/), the web interface enables dynamic MySQL queries with user-friendly graphics.

2.7 Bio-experiments

In accordant with predicted distinct esiRNAs, RNA structure folding, multiple target sites, gene expression profiles as well as the functions of both the cognate and target gene, the

PPM1K, a partial retrotranscript from PPM1K (protein phosphatase, Mg2+/Mn2+ dependent, 1K), was first collected for detailed model studies. The detailed experiments described following sections.

2.7.1 Samples

Resected primary HCC and nearby non-cancerous tissue samples (n=41) were obtained from 41 patients at the Changhua Christian Hospital. The tumor tissues were composed of 90-100% tumor cells and frozen immediately after surgical resection, then stored in liquid nitrogen until extraction of either RNA or DNA. All studies were approved by the Institutional Review Board of Changhua Christian Hospital.

(40)

Table 2.3 Supported databases and tools in pseudoMap. Integrated

database or tools

Dataset Description

miRBase [79, 94] miRNA annotation This database not only provides published miRNA sequences and annotations but also supplies known/predict targets.

fRNAdb [62] sRNA annotation A database to support mining and annotation of functional RNAs.

Ensembl Genome Browser [95]

Pseudogene,

protein-coding gene

It produces genome databases for vertebrates and other eukaryotic species.

UCSC Genome Browser [80]

Conserved region Genomic view of genes

This browser provides a rapid and reliable display of any requested portion of genomes at any scale, together with dozens of aligned annotation tracks.

GeneCards [96] Gene annotation GeneCards is a searchable, integrated, database of human genes that provides concise genomic related information, on all known and predicted human genes

Mfold [77] RNA folding tool Folding RNA structure GEO [88] Gene expression profiles

and deep sequencing data

A public functional genomics data

BLAST [97] Sequence alignment tool BLAST finds regions of similarity between biological sequences.

2.7.2 Cell culture

Human hepatoma Huh-7 and HepG2 cells were grown using standard procedures for all experiments. Cells were maintained in DMEM supplemented with 10% FBS, 2 mM glutamine, and antibiotics (100 units/ml penicillin and 100 μg/ml streptomycin) at 37°C in a humidified atmosphere of 5% CO2 incubator.

(41)

2.7.3 RNA isolation, reverse transcription and real-time

quantitative PCR (RT-qPCR) analysis

RNA isolation from specimens or cultured cells and reverse transcription were performed as described [98, 99]. RT-qPCR analysis of PPM1K and in HepG2 and Huh-7 cells, and of PPM1K, NEK8, TBRG1 and BMPR2 in PPM1K-expressing Huh-7/HepG2 stable cell

lines, was performed using SYBR Green with the ABI 7500 Real-Time PCR System (Applied Biosystems). RT-qPCR of precursor esiRNA1 (24-144 nt), precursor esiRNA2 (170-273 nt),

PPM1K, and NEK8 in paired HCC tumor and non-tumor tissues was performed using a

LightCycle 480 (Roche, Mannheim, Germany) with a primer/probe system. The specific primer/probe sets are shown in Table 2.3. All RNA expression levels were normalized to

GAPDH (glyceraldehyde-3-phosphate dehydrogenase) RNA with the ΔCt method according

to Liu et al [100].

RT-PCR of mature esiRNA1 levels in Huh-7 stable cell lines was performed using a TaqMan MicroRNA Assay designed for esiRNA1 according to the manufacturer’s instructions (Applied Biosystems) following isolation of small RNA with the mirVana miRNA Isolation Kit. U6 small nuclear RNA was used as an internal control.

2.7.1 Northern blot of pseudogene-derived esiRNAs

Northern blotting was performed according to a previous study [101] with minor modifications. Briefly, 10 μg of total RNA from human hepatoma cell line Huh-7 were dissolved in loading buffer (50 mM EDTA, 8 M urea, 20% formamide, xylene cyanol), loaded onto a 2% agarose gel, then run for 1.5 h at 120 V at room temperature. The biotin-labeled esiRNA probes (5′- GTGGCACGCGCCTGTAGTCCCAGC-3′ for esiRNA1 and 5’-GAGGCAGGAGAATGGCGTGAACC-3’ for esiRNA2, Genomics BioSci & Tech Co.,

(42)

Taipei, Taiwan) were used as the positive control for the avidin-biotin reaction and the size control for esiRNAs. The agarose gel was incubated sequentially in 0.05 M NaOH/NaCl, 0.05 M Tris/NaCl and 2x sodium citrate. Then RNA was transferred to a nitrocellulose membrane (Pall Corporation, East Hills, NY, USA) followed by cross-linking with 254-nm UV radiation. The membrane was hybridized with the biotinylated esiRNAs overnight, then membranes were washed sequentially with 2x SSC/0.1% SDS, 1x SSC/0.1% SDS and 0.5x SSC/0.1% SDS at 42°C. The membrane was incubated with horseradish peroxidase (HRP)-conjugated avidin (Biolegend, SanDiego, CA, USA) and probe detected by chemiluminescence with the WesternBrightTM-ECL kit (Advansta, Menlo Park, CA, USA).

Table 2.4. The sequences of the probes and primers used in RQ-PCR/RT-PCR. Gene Probe seq. Primer (5’→3’)

siRNA1 (24-144 nt) CTCTGCCT F: GGAGTACAGTGGTGCGGTCT R: GCTGAGGCAGAAGAATCGTT siRNA2 (170-273 nt) GGCTGGAG F: GAGAAGGAGTCTCTCTCTGTCACC

R: CAGTGAGCCGAGATCGTG

PPM1K TGGGGCAG F: TGACCATTGACCATACTCCAGA R: CAAGCCTGCCATTTACGTG

NEK8 (for tissues) CTGGGGCC F: TGTCCACTGAGCGAGAACTATT R: TCTGACCCCGATCCGTGG

GAPDH TGGGGAAG F: AGCCACATCGCTCAGACAC R: GCCCAATACGACCAAATCC

TPG_PPM1K - F: GGAATTCCTCCATCAGCTGTTCGTTTG R: GCTCTAGATGGCAAAACCCCATCTCTAC

NEK8 (for cell) - F: TCCACTGAGCGAGAACTATTTGC R: GGATCATGGAGGAATCGATACC

TBRG1 - F: CCGTGGGCTATTGCAGTACTC R: AAGAGCTGACAATGGCATTCTG

BMPR2 - F: GGCCATCAAAGCCCAGAAG R: CTGATCCTGATTTGCCATCTTG

(43)

2.7.2 Fluorescent in situ hybridization (FISH)

Nocodazole and colchicine were added to cell lines before FISH was performed. Interphase and metaphase spreads were prepared for FISH using standard methods [102]. DNA probes (S1: GTGGCACGCGCCTGTAGTCCCAGC, antisense of esiRNA1; S2: GAGGCAGGAGAATGGCGTGAACC, antisense of esiRNA2; scramble 1: GTGGCTCATGCCTGTAATCCCAGCACTTTG; and scramble 2: TTAAGACATACAAAGATCTGGCCAGGTGCG) were mixed with hybridization buffer, centrifuged, and heated to 73°C for 5 min in a water bath. Slides were immersed in 70% formamide/2× standard saline citrate for 5 min at 73°C, followed by dehydration, dried, and hybridized with probe mix in a 42°C incubator for 30 min. Slides were then washed in 0.4× standard saline citrate/0.3% NP-40 for 2 min, air dried in the dark, and counterstained with DAPI (4, 6-diamidino-2-phenylindole) (1 μg/ml, Abbott, Illinois, USA). Imaging was performed on a Nikon E600 microscope with cytovision software.

2.7.3 Transduction of the pseudogene transcript in Huh-7

and HepG2 stable cell lines

TPG-expressing (and vector control) Huh-7/HepG2 stable cell lines were established by G418 selection after transfection with the PPM1K-expression plasmid or blank vector. Total RNA

was isolated from cells and subjected to RT-PCR analysis to amplify PPM1K mRNA

(primers showed in Table 2.4). The PCR was performed with a denaturing step at 95°C for 2 minutes, then 30 cycles of 30 s at 95°C, 1 min at 60°C and 1 min at 72°C, followed by a final 7 min at 72°C.

(44)

2.7.4 Cell proliferation assay

To investigate the proliferation of PPM1K-expressing Huh-7 stable cell lines, 2.5 × 104 cells were plated in each well of a 12-well plate. Cells were trypsinized and counted with a hematocytometer every day until day 4. Each experiment was repeated twice in triplicate wells, separately. Huh-7 TPG7 cells were transfected with NEK8-overexpressing plasmid or empty vector using Lipofectamine 2000 (Invitrogen, Carlsbad, CA, USA). The

NEK8-overexpressing plasmid, which contains the human NEK8 ORF without 3’-UTR in the

pCMV6-Entry vector, was obtained from Origene (Rockville, MD, USA). Twenty-four hours after transfection, 2.5 × 104 cells were plated in each well of a 12-well plate. Cells were

trypsinized and counted with a hematocytometer every day until day 4.

2.7.5 Clonogenic activity

For determination of clonogenic activity, we plated 1000 cells of mock2, TPG1, TPG2 and TPG7 in 10 ml growth medium in 100 mm dishes. Soft agar culture was also performed by inoculating 500 cells/ml in 0.3% agar-growth medium over 0.5% agar-growth medium in 6-well culture dishes. The dishes were incubated under normoxic 19% O2 and hypoxic 3% O2

in 5% CO2 incubators for 12 days and fixed/stained for counting colony formation. We also

took serial photographs of the same colonies at day 5, day 7 and day 9 to visualize the two-dimensional growth of mock2 and three transfected clones.

2.7.6 Transfection of synthetic siRNA1 into Huh-7 cells

siRNA1 was chemically synthesized by Invitrogen (Carlsbad, CA, USA). Oligonucleotides were annealed before use in annealing buffer containing 100 mM potassium acetate, 30 mM Hepes-KOH (pH 7.4), and 2 mM magnesium acetate. Negative Control #1 siRNA was also

(45)

obtained from Invitrogen. Huh-7 cells in 6 cm culture plates were transfected with 200 pmol siRNA using 10 l Lipofectamine 2000 according to manufacturer’s instructions.

2.7.7 Construction

of

the

esiRNA1-deleted



PPM1K-expressing plasmid

To delete the esiRNA1 sequence from PPM1K, the overlap extension method of PCR-based

mutagenesis was used. First, two complementary mutagenic primers, forward 5’-CCTCAGCCTCCTGAGTACACCCCTGGCTAATTTT-3’ and reverse 5’-AAAATTAGCCAGGGGTGTACTCAGGAGGCTGAGG-3’, were synthesized. Two PCRs using the mutagenic forward primer/outer PPM1K reverse primer pair and the

mutagenic reverse primer/outer PPM1K forward primer pair were performed to amplify the

right and left PPM1K fragments, respectively. The two fragments were then mixed and

further amplified using the outer PPM1K primers to generate the esiRNA1-deleted

PPM1K fragment. Finally, this fragment was inserted between the EcoRI and XbaI sites of

the pCI-neo vector to generate the esiRNA1-deleted PPM1K-expressing plasmid.

2.7.8 Mitochondrial activities

For indirect assay of mitochondrial membrane potential and permeability transition pore activity [103], overnight-plated monolayer cells on 100 mm dishes were exposed to 0.5 g/ml or 1.0 g/ml Rh123 in the growth medium [104]. The kinetics of dye uptake were determined by harvesting the cells after 10 min, 30 min, 5 h and 24 h incubation in Rh123-containing media. To determine Rh123 retention activity of cancer cells [103-105], monolayer cells exposed to Rh123-containing medium for 30 min were rinsed 3x with Hanks’ balanced salt solution (HBSS) to remove the dye, and replenished with fresh medium for a further 18 h

(46)

incubation before harvest. Cells harvested by trypsinization were washed 2x with cold PBS and collected by centrifugation at 300x g at 4°C for 5 min. With or without further reaction with fluorescent monoclonal antibody against cell surface markers, e.g. CD133/Prominin (BD Pharmingen), the doubly- or triply-labeled washed cells were analyzed in a fluorocytometer.

2.7.9 miRNA-mediated knockdown of PPM1K and



PPM1K

The stable negative control (siCon: 5’FAM-UUC UCC GAA CGU GUC ACG UTT), has-miR-650 (miR-650: 5’-AGG AGG CAG CGC UCU CAG GAC) and has-miR-3174 (miR-3174: 5’-UAG UGA GUU AGA GAU GCA GAG CC-3’ ) miRNAs were purchased from GeneDireX, Inc. (Las Vegas, NV, USA) and transfected into cells by LipofetamineTM

RNAiMax (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s protocol. The efficacy of mRNA knockdown after miRNA transfection for 48 h was determined by RT-qPCR.

2.7.10 Statistical analysis

Student’s t-test was used for analysis of the cell assays. Significance was accepted at P-value < 0.05.

(47)

Chapter 3 Results

In this chapter, we described the results of computational analyses and subsequent described the experimental results. The detailed results were described in following sections.

3.1 Overview of TPGs

To verify the hypothesis that human pseudogenes may generate esiRNA to regulate protein-coding genes, we developed a computational pipeline (Figure 2.1). A total of 16,524 genes including 15,003 pseudogenes and 1,521 processed transcripts were collected from BioMart and integrated in the Ensembl Genome Browser, filtering by gene type as “pseudogene” or “pseudogene-related gene”. The percentage of pseudogenes on each chromosome is similar to that of protein-coding genes (Figure 3.1). Of these pseudogenes, 1,404 are detectable by the Affymetrix Human Genome U133A/U133Plus2 microarray, and are thus considered to be transcribed pseudogenes (TPGs). Consequently, we determined the gene expression profiles (GDS596) by log2 average intensity of TPGs in 79 normal human

tissues. Hierarchical clustering showed that most TPGs were highly expressed in 19 tissues, especially in blood-related cells where expression levels were ≥ 4-fold higher than average (Figure 3.2). Another assessment of gene expression in six different tissues (GSE5364) showed that TPGs were often expressed differently in tumor material compared to paired non-tumor tissue (Figure 3.3a). In particular, most TPGs in HCC samples were under-expressed in comparison to normal tissues (GSE6222) (Figure 3.3b).

3.2 TPGs v.s. NGs

According to the sequence mapping results, total of 1313 TPG-NG pairs were obtained. The statistics showed that the classifications of highly similar (TPG-similarity≥0.8), similar

(48)

(0.5<TPG-similarity<0.8), and low similar (TPG-similarity≤0.5) are 1004, 179, 130 TPGs, respectively (Figure 3.4). To further understand the biological functions of these TPGs, enriched GO terms and KEGG pathways of TPG-paired NGs were annotated by using DAVID. The enriched KEGG pathways and GO terms with P-value small than 0.001 are presented in Figure 3.5. The results showed that the GO terms in 3 groups of TPG-paired NGs are alike involving in protein binding, nucleic acid binding and nucleotide binding in molecular function-term and cellular metabolic process, biosynthetic process and cellular component assembly in biological process-term, respectively (Figure 3.5). Total of 1000 TPG-NG pairs, there are 288 TPGs with the same probesets of paired NGs, 264 paired NGs with no probesets, and 448 TPGs with the unique probesets in Affymetrix U133A chip. There are 236 TPGs with the same probesets of paired NGs, 262 paired NGs without probesets, and 619 TPGs with the unique probesets in Affymetrix U133Plus2 chip. To detect the expression profiles of TPGs and paired NGs, the intensity of TPG referenced by paired NG with log2

ratio in a number of public gene expression profiles. Hierarchical clustering showed that about half of TPGs (200/448) were highly expressed than its paired NG. in 79 normal tissues (Figure 3.6). Another assessment of gene expression in 61 tumor tissues (GSE2109) showed that nearly half of TPGs (250/619) were highly expressed than its paired NG. (Figure 3.7).

(49)

(50)

Figure 3.3 Gene expression profiles of TPGs. (a) Expression heat-map of TPGs in pair of cancers compared to non-malignant tissues. (b) TPGs expression profiles in the tumor tissues/cells referenced by normal liver tissues.

(51)

(52)

Figure 3.6 Gene expression profiles of TPGs referenced by its paired NG in 79 human physiologically normal tissues.

(53)

探討人類可轉錄假基因的調控及功能分析

國 立 交 通 大 學

生物資訊及系統生物研究所

博士論文

探討人類可轉錄假基因的調控及功能分析

Exploring Regulations and Functions of Transcribed

Pseudogenes in Homo Sapiens

研 究 生：詹雯玲

指導教授：黃憲達 教授

張建國 教授

探討人類可轉錄假基因的調控及功能分析

Exploring Regulations and Functions of Transcribed

Pseudogenes in Homo Sapiens

研 究 生 : 詹雯玲 Student : Wen-Ling Chan

指導教授 : 黃憲達 博士 Advisor : Dr. Hsien-Da Huang

張建國 博士 Dr. Jan-Gowth Chang

國立交通大學

生物資訊及系統生物研究所

博士論文

探 討 人 類 可 轉 錄 假 基 因 的 調 控 及 功 能 分 析

學生 : 詹雯玲

指導教授 : 黃憲達 博士

張建國 博士

國立交通大學 生物資訊及系統生物研究所

摘要

Exploring Regulations and Functions of Transcribed

Pseudogenes in Homo Sapiens

Student: Wen-Ling Chan Advisor : Dr. Hsien-Da Huang

Dr. Jan-Gowth Chang

Institute of Bioinformatics and Systems Biology,

National Chiao Tung University

Abstract

誌謝

Table of Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Biological background

1.1.1 Central dogma

1.1.2 Non-coding RNA

1.1.3 Piwi interacting RNA (piRNA)

1.1.4 MicroRNA (miRNA)

1.1.5 Small interfering RNA (siRNA)

1.1.6 Long non-coding RNA (lncRNA)

1.1.7 Pseudogene

1.1.8 Transcribed pseudogene (TPG)

1.2 Related works

1.2.1 Public biological databases

1.2.1.1

Pseudogene.org

1.2.1.2

Hoppsigen DB

1.2.1.3

UI Pseudogenes

1.2.1.4

Human Pseudogenes

1.2.1.5

miRBase

1.2.1.6

Ensembl

1.2.1.7

UCSC Genome Browser Database

1.2.1.8

fRNAdb

1.2.1.9

Gene Expression Omnibus (GEO)

1.2.2 Public analysis tools

1.2.2.1

BLAST

1.2.2.2

ClustalW

1.2.2.3

RNAhybrid

1.2.2.4

TargetScan

1.2.2.5

miranda

1.2.2.6

Mfold

1.3 Motivation

國立交通大學

研究生：詹雯玲

指導教授：黃憲達教授

張建國教授

研究生 : 詹雯玲 Student : Wen-Ling Chan

指導教授 : 黃憲達博士 Advisor : Dr. Hsien-Da Huang

張建國博士 Dr. Jan-Gowth Chang

探討人類可轉錄假基因的調控及功能分析

指導教授 : 黃憲達博士

張建國博士

國立交通大學生物資訊及系統生物研究所