以系統化方法辨識出能用於標靶藥物輸送的癌症專一性細胞膜受體

(1)

國立交通大學

生物資訊及系統生物研究所

碩士論文

以系統化方法辨識出能用於標靶藥物輸送的癌症專一性細

胞膜受體

A Systematic Method for Identifying Cancer-Specific

Membrane Receptors for Targeted Drug Delivery

研究生：黃煒志

指導教授：黃憲達博士

(2)

以系統化方法辨識出能用於標靶藥物輸送的癌症專一

性細胞膜受體

A Systematic Method for Identifying Cancer-Specific

Membrane Receptors for Targeted Drug Delivery

研究生：黃煒志

Student: Wei-Chih Huang

指導教授：黃憲達博士

Advisor: Dr. Hsien-Da Huang

國立交通大學

生物資訊及系統生物研究所

碩士論文

A Thesis

Submitted to Institute of Bioinformatics and Systems Biology

College of Biological Science and Technology

National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Master

in

Bioinformatics

July 2010

Hsinchu, Taiwan, Republic of China

(3)

以系統化方法辨識出能用於標靶藥物輸送的癌

症專一性

症專一性細胞膜

細胞膜

細胞膜受體

細胞膜

受體

學生：黃煒志指導教授：黃憲達博士國立交通大學生物資訊及系統生物研究所碩士班

中文摘要

癌症是世界上主要死亡原因之一，目前已研發出相當多種類的抗癌藥作為

治療手段。但是傳統的抗癌藥無法準確輸送到癌症患處，身體上其他正常

的組織會因為抗癌藥的毒性而受到傷害。抗癌藥的副作用既會影響病人的

生活品質，也降低了治療效果，故本研究的目標即為如何將抗癌藥準確送

至癌細胞。現在的新方法是可將抗癌藥包裹在免疫微脂體中

(immuno-liposome)，免疫微脂體表面的特定抗體會因為抗體-抗原親和力

(antigen-antibody affinity)的反應而與癌細胞表面的目標蛋白質結合，達到抗

癌藥對癌細胞的專一性輸送，然後抗癌藥會經由內吞作用進入癌細胞內。

我們以一系統化的方法來分析去氧核醣核酸微陣列(DNA microarray)資

料，以期能找出癌症專一性細胞膜受體(cancer-specific membrane receptor,

CSMR)作為藥物輸送的目標。我們比較了癌症專一性細胞膜受體在癌症組

織與在人體正常組織的 mRNA 表現量，希望能找出在癌症組織中為高表

現、在大部分正常組織中為低表現的癌症專一性細胞膜受體。除此之外，

我們也比較了癌症專一性細胞膜受體於數種癌症細胞株中的表現量。若癌

症專一性細胞膜受體在某癌症細胞株中的表現量與在同類癌症組織中的表

現量相似，該癌症細胞株則建議可用作實驗證明。

(4)

A Systematic Method for Identifying

Cancer-Specific Membrane Receptors for

Targeted Drug Delivery

Student: Wei-Chih Huang Advisor: Dr. Hsien-Da Huang Institute of Bioinformatics and Systems Biology,

National Chiao Tung University

Abstract

Cancer is a leading cause of death worldwide. Lots of anti-cancer drugs have

been invented for cancer therapeutics. Traditional anti-cancer drugs with serious

cytotoxicity would cause injuries of normal tissues owing to the incorrect drug

delivery. The side-effects of anti-cancer drugs would not only debase the life

quality of patients but decrease the therapeutic efficacy. The purpose of this

study is to find a way to accurately deliver anti-cancer drugs to cancer cells.

Currently, anti-cancer drugs encapsulated in the immunoliposome would be

specifically transported to cancer cells by the mechanism of antigen-antibody

affinity between the coated antibody of immunoliposome and its target protein

on the surface of cancer cell and then enter the cancer cell by endocytosis. We

propose a systematic method to identify cancer-specific membrane receptors

(CSMRs) as the delivery target by analyzing DNA microarray data. We also

compare the expression level of each CSMR in cancer tissue to that in normal

tissues of body, and expect to identify the CSMR with high expression level in

cancer but generally low expression level in all normal tissues. Besides, we

compare the expression level of each CSMR in several cancer cell-lines. If the

CSMR expression levels of cancer cell-lines are similar to that in the same type

of cancer tissue, the CSMR is recommended for experimental verification.

(5)

誌謝

在碩士班生活的這段日子裡，很高興能夠身處於一個很有活力的實驗室。

和善的學長們總是會認真回答學弟的問題，和同儕們也相處地很自在。途

中也有調適不當而使情緒陷入谷底的時候，不過很感謝實驗室各位同仁與

老師的包容，以及家人的關心和支持。

非常感謝豐茂學長給予的協助，不論是在實作過程或是論文寫作上的指導

修正。也深深感謝指導教授黃憲達老師在實作過程中提供的意見以及對於

研究的看法，還有廖光文老師在討論時根據實作情形提出的建議，使得這

份研究能夠更上一層樓。

林林總總的事情多不勝數，一併感謝天。

(6)

中文

中文摘要

摘要

摘要 ... i

Abstract ...ii

誌謝

誌謝 ...iii

Contents... iv

List of Tables ... vi

List of Figures ...vii

List of Abbreviations ...viii

1. Introduction... 1

1.1 Cancer ... 1

1.1.1 Brief Introduction of Cancer ... 1

1.1.2 Highly-Expressed Membrane Receptors in Cancer ... 1

1.2 Microarray... 2

1.3 Targeted Drug Delivery ... 5

1.3.1 Passive and Active Targeted Drug Delivery ... 5

1.3.2 Liposome ... 6

1.4 Motivation and Specific Aims ... 7

2. Related Studies ... 9

2.1 Integrin targeting ... 9

2.2 Folate receptor targeting ... 9

2.3 Transferrin receptor targeting... 9

3.1 Microarray Data Collection ... 11

3.1.1 Gene Expression Omnibus ... 11

3.1.2 Lung Cancer Microarray Samples... 14

3.2 Genomic Annotation ... 14

3.3 Membrane Receptor Classification... 15

3.4 Proteomic Knowledge ... 15

3.5 Pathway Information of Cellular Processes of Genes ... 16

3.6 Antibody Manufacturers... 16

4. Method ... 17

4.1 Flowchart Overview... 17

4.2 Data Preprocessing ... 17

4.2.1 Sample Classification... 17

4.2.2 Microarray Data Normalization... 18

4.2.2.1 Robust Multi-array Analysis... 18

4.2.2.2 Data Normalization... 18

4.2.3 Classification of Membrane Receptors ... 19

4.3 Microarray Data Analysis ... 19

4.3.1 Significance Analysis of Microarrays... 20

4.3.2 False Discovery Rate... 22

(7)

4.3.5 Comparing Gene Expression of Cancer-Specific Membrane

Receptors between Cancer Tissue and Normal Tissues over Human

Body... 25

5. Results ... 27

5.1 Statistics of identified CSMRs ... 27

5.2 Case Study: Identifying CSMRs in Lung Adenocarcinoma ... 30

5.3 Web Interface ... 60

6. Discussions ... 62

6.1 Inconsistent expressions for probesets of the same gene... 62

6.2 Results with few or zero identified CSMRs... 62

6.3 Defining membrane receptors by GO terms ... 62

6.4 Tissue index calculation... 63

6.5 CSMR verification ... 63

7. Conclusions ... 65

7.1 Conclusions ... 65

7.2 Future works ... 65

8. References ... 66

9. Appendix ... 70

(8)

List of Tables

Table 1.1 Highly-expressed membrane receptors in cancer... 2

Table 3.1 Statistics of analyzed microarray data of two platforms ... 13

Table 3.2 Statistics of analyzed microarray data... 13

Table 3.3 Microarray data of cancer cell-lines ... 13

Table 3.4 Microarray data of normal human tissues... 14

Table 3.5 Membrane receptors defined by GO terms ... 15

Table 5.1 Statistics of identified CSMRs ... 27

Table 5.2 Analyzed datasets of lung adenocarcinoma... 31

Table 5.3 Identified CSMRs in lung adenocarcinoma dataset, GSE10799,

not disseminate into bone marrow ... 32

Table 5.4 Identified CSMRs in lung cancer dataset, LCH ... 32

Table 5.5 Identified CSMRs in lung adenocarcinoma dataset, GSE7670.... 33

Table 5.6 Identified CSMRs in lung adenocarcinoma dataset, GSE10072.. 34

Table 5.7 Comparisons of identified CSMRs from different datasets ... 34

Table 5.8 Examples of selected CSMR ... 36

Table 5.9 Classification of CSMRs by tissue index ... 37

Table 5.10 Corresponding antibodies of CSMRs ... 60

Table 5.11 Statistics of related literatures of CSMRs in UniProtKB... 60

Table 9.1 NCI60 Cancer cell-lines, 9 tissues (HG-U133A) ... 70

Table 9.2 Cancer cell-lines, 6 tissues (HG-U133A Plus 2.0)... 71

Table 9.3 74 Normal tissues (HG-U133A) ... 74

Table 9.4 65 Normal tissues (HG-U133A Plus 2.0)... 75

Table 9.5 Analyzed datasets and samples (HG-U133A)... 77

(9)

List of Figures

Figure 1.1 Hybridization principle of microarray ... 3

Figure 1.2 Affymetrix GeneChip array... 4

Figure 1.3 Graphics of targeted delivery ... 5

Figure 3.1 Statistics of classified cancer datasets ... 12

Figure 3.2 Statistics of microarray samples... 12

Figure 4.1 Workflow for identifying cancer-specific membrane receptors . 17

Figure 4.2 Classification of analyzed microarray datasets ... 18

Figure 4.3 SAM plot ... 22

Figure 4.4 Graphics of definition of GO: Cellular Component... 24

Figure 4.5 Graphics of tissue index ... 26

Figure 5.1 Elucidation of CSMR expression plot... 38

Figure 5.2 Expression of CLDN3 (203953_s_at) in dataset LCH ... 39

Figure 5.3 Expression of CLDN3 (203954_x_at) in dataset LCH ... 40

Figure 5.4 Expression of CLDN4 (201428_at) in dataset LCH ... 41

Figure 5.5 Expression of DDR1 (1007_s_at) in dataset LCH... 42

Figure 5.6 Expression of DDR1 (207169_x_at) in dataset LCH ... 43

Figure 5.7 Expression of DDR1 (208779_x_at) in dataset LCH ... 44

Figure 5.8 Expression of DDR1 (210749_x_at) in dataset LCH ... 45

Figure 5.9 Expression of EFNA4 (205107_s_at) in dataset LCH ... 46

Figure 5.10 Expression of EGFR (201983_s_at) in dataset LCH ... 47

Figure 5.11 Expression of ERBB3 (226213_at) in dataset LCH ... 48

Figure 5.12 Expression of F2RL1 (213506_at) in dataset LCH... 49

Figure 5.13 Expression of GPR110 (238689_at) in dataset LCH... 50

Figure 5.14 Expression of IGHG1 (211908_x_at) in dataset LCH ... 51

Figure 5.15 Expression of IGHM (209374_s_at) in dataset LCH... 52

Figure 5.16 Expression of IGHM (216491_x_at) in dataset LCH ... 53

Figure 5.17 Expression of LGR4 (218326_s_at) in dataset LCH... 54

Figure 5.18 Expression of LSR (208190_s_at) in dataset LCH ... 55

Figure 5.19 Expression of MET (203510_at) in dataset LCH... 56

Figure 5.20 Expression of PVRL4 (223540_at) in dataset LCH ... 57

Figure 5.21 Expression of STX1A (204729_s_at) in dataset LCH... 58

Figure 5.22 Expression of XPR1 (222581_at) in dataset LCH... 59

Figure 5.23 Web interface (1): browse from analyzed results... 61

(10)

List of Abbreviations

WHO

World Health Organization

CSMR

cancer-specific membrane receptor

NSCLC

non-small cell lung cancer

SCLC

small cell lung cancer

HNSCC

head and neck squamous cell carcinoma

DNA

deoxyribonucleic acid

RES

reticuloendothelial system

PEG

polyethyleneglycol

GEO

Gene Expression Omnibus

NCBI

National Center for Biotechnology Information

LCH

lung cancer microarray dataset, Huang

GO

Gene Ontology

HPMR database

Human Plasma Membrane Receptome database

GPCR

G protein-coupled receptor

UniProtKB

UniProt Knowledgebase

UniProt

Universal Protein Resource

KEGG

Kyoto Encyclopedia of Genes and Genomes

MAS5.0

Microarray Suite User Guide, Version 5

RMA

Robust Multiarray Analysis

FDR

false discovery rate

FWER

family-wise error rate

DEG

differentially expressed gene

(11)

1. Introduction

1.1 Cancer

1.1.1 Brief Introduction of Cancer

Cancer is a disease caused by accumulation of genetic and epigenetic

aberrations within a cell. The cause that normal cells transform to cancer cells is

ascribed to abnormal cell division. Under normal circumstances, cells would

undergo apoptosis if there is a mistake occurred in the cell division progress.

Cancer cells bypass the monitoring mechanism of apoptosis and continue to

increase their population.

According to the reports of WHO, cancer is the leading cause of death

worldwide. It accounts for about 13% of all deaths in 2004 [1]. Currently,

chemotherapy has been the main modality of cancer treatment. However, the

high doses of administration in order to destroy the tumors cannot be given to

patients because the overdose of chemotherapy agents would be fatal to patients.

To increase the efficacy of cancer treatment, researchers focus on targeted

cancer therapies: find genes which play an important role in carcinogenesis and

correct the abnormal mechanisms caused by those target genes. On the other

side, researchers are seeking for targeting agents to promote the efficiencies of

accurately delivering anti-cancer drugs to the tumor site.

1.1.2 Highly-Expressed Membrane Receptors in Cancer

(12)

a growth factor binds to the membrane receptor on the cell surface, cell receives

the signal and begins to grow. The abnormality of the membrane receptors in

cancer would cause uncontrolled cell growth. There are several membrane

receptors which are highly-expressed in cancer, such as EGFR in lung cancer

and ERBB2 in breast cancer (Table 1.1

1

) [2].

Table 1.1 Highly-expressed membrane receptors in cancer

Gene cancer

EGFR NSCLC; lung squamous cell carcinoma; mesothelioma; breast, head and neck, stomach, colon, esophageal, prostate, bladder, renal, pancreatic, ovarian carcinoma; glioblastoma

ERBB2 breast adenocarcinoma; ovarian carcinoma ERBB3 oral squamous cell and ovarian carcinoma ERBB4 oral squamous cell carcinoma

FLT3 acute myeloid leukemia

KIT gastrointestinal stromal tumor; Ewing's sarcoma; SCLC RET papillary thyroid cancer

FGFR3 multiple myeloma; bladder, cervical carcinoma

MET endocrinal tumor, osteosarcoma, invasive breast and lung cancer

IGF1R colon cancer IL6R myeloma, HNSCC IL8RA bladder cancer

PDGFRA/B osteosarcoma, glioma PRLR breast carcinoma

VEGFR neuroblastoma; prostate cancer GRPR SCLC

1.2 Microarray

DNA microarray is amenable to the analysis of multiple samples, and it

generates a large amount of gene expression data for statistical analysis. The

basic principle of microarray technology is complementary hybridization of

(13)

nucleotides (Figure 1.1

2

). After analyzing the hybridization results and obtaining

the mRNA expression levels, researchers can undergo the advanced analysis to

extract the critical genes by bioinformatic softwares.

Figure 1.1 Hybridization principle of microarray. Target DNA would hybrid to probe by

nucleotide base pairing.

GeneChip (Affymetrix, Santa Clara, CA, USA) is one of the most popular DNA

microarrays (Figure 1.2

3

). Generally, GeneChips are designed with 16 to 20,

preferably nonoverlapping 25-mers representing each gene on the array. Each

oligonucleotide on the chip is matched with an almost identical one, differing

only by a central, single-base mismatch. This mismatched oligonucleotide serves

as an internal control for hybridization specificity and allows for determination

2_{retrieved from http://www.members.cox.net/amgough/FISH_olgio_hybridization-deep01_01_03.jpg} 3

(14)

of the degree of non-specific binding by comparison of target binding intensity

between the two partner oligonucleotides.

Figure 1.2 Affymetrix GeneChip array. Hybridization between the sample DNA and

reference DNA allows detection of genetic variation.

The advent of DNA microarray technology provides a powerful tool in various

aspects of cancer research. Identifying altered genes related to cancer

development has been one of the central research questions in microarray data

analysis. DNA microarrays offer a possibility to compare the results of detailed

combinatory analysis of global expression profiles for normal and cancer cells at

separate experimental conditions. For example, Ye et al. used microarray to

identify significant genes in oral tongue squamous cell carcinoma [3] and Scotto

et al. identified over-expressed genes in cervical cancer progression [4].

(15)

1.3 Targeted Drug Delivery

1.3.1 Passive and Active Targeted Drug Delivery

Traditional anti-cancer drugs have been used for several decades; however, there

are serious toxic effects on normal cells due to incorrectly delivery of drugs. The

purpose of targeted drug delivery is to specifically transport the anti-cancer

drugs to the tumor sites to achieve a therapeutic effect. There are two kinds of

targeted drug delivery, passive targeted drug delivery and active targeted drug

delivery [5] (Figure 1.3

4

).

Figure 1.3 Graphics of targeted delivery. Active targeted delivery (Left and Middle) and

passive targeted delivery (Right)

(

Farokhzad O. C. and Langer R., 2009

)

Passive targeted drug delivery is a process that drug would reach the tumor sites

based on some biophysical properties. Cancer cell needs lots of nutrients and

4

(16)

oxygen supplied when it began to proliferate, so rapid angiogenesis occurred.

Owing to the rapid angiogenesis, there are open spaces between endothelial cells

in vasculature of tumor, that is, leaky vasculature [6-7]. The defective vascular

architecture and the loss of lymphatic drainage system [8] in cancer tissue

attribute to an enhanced permeation and retention effect (EPR effect) [9-11]. For

molecules with molecular weights larger than about 40kDa, such as liposomes,

they would accumulate more rapidly and get a longer retention in tumor than in

normal tissues by EPR effect [9,11].

The mechanism of active targeted drug delivery is based on the affinity between

antibody/ligand and target antigen. The antigen-binding sites in the light chains

of antibody recognize the epitope of antigen and bind to it. Each antibody binds

to a specific antigen, and this interaction is similar to a lock and key. For

example, antibody-conjugated liposome targets to the specific membrane

receptor on the surface of cancer cell and enter the cell by receptor-mediated

endocytosis [32,37].

1.3.2 Liposome

Liposome is an artificial microscopic vesicle consisting of an aqueous core

enclosed in one or more phospholipid layers. Liposomes have been developed

since 1965 [12], and they were suggested as drug carriers in cancer

chemotherapy by Gregoriadis et al. in 1974 [13]. When used in the delivery of

anti-cancer drugs, liposomes help to shield healthy cells from the toxicity of

(17)

(RES) limits the application of plain liposomes as drug carriers [7,14].

Liposomes were recognized as successful drug delivery carriers when it was

discovered that the polyethyleneglycol (PEG) coated liposomes had significantly

increased the circulation time [15-17]. For example, PEGylated liposomal

doxorubicin [18] (with brand names of Doxil in the US and Caelyx in Europe)

has been shown to significantly improve the therapeutic index of doxorubicin

both in preclinical [19-21] and clinical [22-24] studies.

Long circulating liposomes would accumulate significantly in tumors by EPR

effect. However, vascular permeability in tumors is heterogeneous with respect

to tumor type and tumor microenvironment. It is essential to obtain a higher

degree of liposome accumulation by active targeting. Site-specific targeted

delivery can be achieved by coating the liposomes with ligands or antibodies

that target over-expressed membrane receptors on the surface of tumor [26-27].

It can avoid the non-selective toxic effects of the carried drug at other normal

human tissues or organs. With the advantages of long circulation time, long

retention time in tumor and tumor specificity, PEGylated immunoliposome has

proven most successful [28-29].

1.4 Motivation and Specific Aims

Because of the non-specific delivery of anti-cancer drugs, parts of drugs reached

the tumor site and the side-effects owing to incorrect delivery occurred in other

normal tissues. To achieve the optimal therapeutic effects, the overall dosage

should increase but the toxicity accompanied high dosage is fatal. On the other

(18)

hand, the side-effects of anti-cancer drugs cause severe harm on patients’ body

and mind. Although practical anti-cancer drugs have been developed, the

problems of side-effects are less improved. Rather than giving analgesics to

patients undergoing chemotherapy, we should seek for new methods to increase

targeting specificity of anti-cancer drugs to tumor and thus achieve high

bioavailability with low dosage.

We propose a systematic method to identify cancer-specific membrane receptors

(CSMRs) which are differentially-expressed between cancer tissue and

corresponding normal tissue. Besides, the expression of identified CSMRs in

cancer tissue is compared with the expressions in normal tissues of human body

in order to identify ‘good’ CSMRs which are lowly-expressed in other normal

tissues of human body. Those ‘good’ CSMRs can serve as cancer targeted

antigens with a little or without the adverse effects in targeted drug delivery. In

addition, expression profiles of several cancer cell-lines are provided so that the

expression level in cancer tissue and cancer cell-lines are available for

cross-comparison and those CSMR whose expressions are similar in cancer

tissue and cancer cell-lines are recommended for biological validation.

(19)

2. Related Studies

2.1 Integrin targeting

Integrins are surface receptors that interact with the extracellular matrix and

regulate the cell signaling. Integrins are highly-expressed in the new blood

vessels during the tumor angiogenesis, and targeting to integrins by small

peptides sequences selected from phage display library was investigated [30].

Doxorubicin encapsulated within PEG-liposomes that conjugated with

integrin-targeting peptides could target the endothelial cells of C26 colon cancer

xenograft model and showed a better result when compared to non-targeting

PEGylated liposomes [31].

2.2 Folate receptor targeting

Folic acid is a vitamin that is essential for the biosynthesis of nucleotides and it

is especially important during periods of rapid cell division and growth. Folic

acid is a ligand with high affinity for folate receptor. Folate receptor is

over-expressed on tumor cells frequently [32]. PEGylated liposomes coated with

folate would target to cancer cells and be internalized by endocytosis [33].

Doxorubicin encapsulated within folate-coated liposomes has shown effective in

both in vitro [34] and in vivo experiments [35-36].

2.3 Transferrin receptor targeting

Transferrins are blood plasma proteins and their principal biological function is

thought to be related to iron binding properties. Transferrin receptor is a carrier

(20)

protein for transferring and is mediated by intracellular iron concentration.

Transferrin receptor 1 is over-expressed on cancer cells [37] and transferrins

may facilitate proliferation of tumor cells [38]. Doxorubicin encapsulated within

transferring-coated liposomes showed an enhanced uptake in the C6 glioma cells

via the receptor-mediated mechanism in contrast with free doxorubicin [39].

(21)

3. Material

3.1 Microarray Data Collection

3.1.1 Gene Expression Omnibus

Gene Expression Omnibus (GEO) in NCBI is currently the largest public

genomic data repository [40]. Currently, GEO preserves about half a million

microarray expression profiles which is freely to be used.

In this research, the analyzed microarray data are the expression profiles from

Affymetrix platform. The collected microarray data with raw CEL files consist

of three parts: (1) microarray datasets which contain both the cancer microarray

samples and the corresponding normal microarray samples for identifying the

cancer-specific membrane receptors, (2) microarray samples of several kinds of

cancer cell-lines for comparing the mRNA expressions of membrane receptors

in cancer tissue with those in cancer cell-lines, (3) microarray samples of normal

human tissues for comparing the mRNA expressions of membrane receptors in

cancer to those in normal human tissues for indicating the putatively occurred

side-effects in targeted drug delivery (Table 3.1 – Table 3.4; Figure 3.1, 3.2).

(22)

Figure 3.1 Statistics of classified cancer datasets. There are total fifty datasets for twenty

groups.

(23)

Table 3.1 Statistics of analyzed microarray data of two platforms

Microarray Platform Dataset Sample Affymetrix HG-U133A 21 1081 Affymetrix HG-U133A Plus 2.0 29 1402

Table 3.2 Statistics of analyzed microarray data

Cancer Type Dataset Number Sample Number Lung Cancer 4 242 Blood Cancer 8 572 Kidney Cancer 5 179 Breast Cancer 6 397 Liver Cancer 2 88 Prostate Cancer 1 19 Skin Cancer 1 87 Neural Cancer 1 25 Esophageal Cancer 1 24 Nasopharyngeal Cancer 1 41 Ovarian Cancer 3 157 Gastrointestinal Cancer 2 81 Colorectal Cancer 3 83 Bladder Cancer 2 72 Testicular Cancer 1 107 Cervical Cancer 3* 135 Endocrine Tumor 2 32 Head and Neck Cancer 4* 126

Other 1 16

*

Dataset GSE6791 are classified into two cancer types

Table 3.3 Microarray data of cancer cell-lines

Data Source Dataset ID Platform Cancer Type Sample GEO GSE5720 Affymetrix HG-U133A 9 60 GEO GSE10843 Affymetrix HG-U133A

Plus 2.0

(24)

Table 3.4 Microarray data of normal human tissues

Data Source Dataset ID Platform Tissue Type Sample GEO GSE1133 Affymetrix HG-U133A 73* 146* GEO GSE3526 Affymetrix HG-U133A

Plus 2.0

65 353

*

Remove the samples not belong to normal tissues

3.1.2 Lung Cancer Microarray Samples

The lung cancer microarray dataset (LCH) with twenty five pairs microarray

samples of lung adenocarcinoma provided by Dr. Chi-Ying F. Huang,

inaugurated as professor in National Yang Ming University. The lung

adenocarcinoma samples are analyzed for identifying the differentially

expressed membrane receptors and validating the analyzed results of other lung

cancer datasets from GEO.

3.2 Genomic Annotation

Gene Ontology (GO) database provides a controlled vocabulary to describe gene

and gene product features [41]. The descriptions are divided into three types:

cellular component, biological process and molecular function.

We use the GO terms as the filtering condition to select the cancer-specific

membrane receptors from differentially expressed gene list of each analyzed

microarray dataset.

(25)

Table 3.5 Membrane receptors defined by GO terms Platform Probeset (total) Gene (total) Probeset (membrane receptor) Gene (membrane receptor) Affymetrix HG-U133A 22283 12634 1287 768 Affymetrix HG-U133A Plus 2.0 54675 19804 1966 956

3.3 Membrane Receptor Classification

Based on structural and functional similarities, we divide membrane receptors

into three main classes: the ion channel-linked receptor, the protein

kinase-linked receptor and G protein-coupled receptor.

Membrane receptor classification is based on the information of Human Plasma

Membrane Receptome database and IUPHAR database. The Human Plasma

Membrane Receptome (HPMR) database stores the information of human

membrane receptor families involved in signal transduction [42]. The IUPHAR

database incorporates pharmacological, functional and pathophysiological

information on the G protein-coupled receptors (GPCRs), voltage-gated and

ligand-gated ion channels of human, mouse and rat [43].

3.4 Proteomic Knowledge

We use the information in UniProtKB/Swiss-Prot section to annotate the

membrane receptors. The detailed information in UniProtKB/Swiss-Prot would

make researchers comprehensively realize the membrane receptors.

The UniProt Knowledgebase (UniProtKB) collects lots of functional

information on proteins, with accurate, consistent and rich annotation [44].

(26)

UniProtKB consists of two sections: one contained manually-annotated records

with information extracted from literature and curator-evaluated computational

analysis and another contained computationally analyzed records that await full

manual annotation. The two sections are referred to as "UniProtKB/Swiss-Prot"

(reviewed, manually annotated) and "UniProtKB/TrEMBL" (unreviewed,

automatically annotated), respectively.

3.5 Pathway Information of Cellular Processes of Genes

The information in KEGG PATHWAY is used to annotate the involved pathways

of membrane receptors. Once knowing the involved pathways of membrane

receptors, researchers would select those which are appropriate for verification.

The Kyoto Encyclopedia of Genes and Genomes (KEGG) knowledgebase

provides systematic analysis of gene functions, linking genomic information

with higher order functional information [45]. In the KEGG PATHWAY

database, it displayed the graphical representations of various cellular processes

and involved genes.

3.6 Antibody Manufacturers

The identified cancer-specific membrane receptors should be verify by

biological experiments or clinical tests, and there is a need for the corresponding

antibodies. We collect and provide the information of anti-CSMR antibodies

from some well-known commercial antibody manufacturers, such as Abnova

[46], Abcam [47], Invitrogen [48], Millipore [49] and Novus Biologicals [50].

(27)

4. Method

4.1 Flowchart Overview

Figure 4.1 Workflow for identifying cancer-specific membrane receptors.

4.2 Data Preprocessing

4.2.1 Sample Classification

The analyzed microarray datasets are classified into several groups according to

different cancer types (Figure 4.2). There are nineteen classified groups, and

each group includes one to several datasets.

(28)

4 8 5 6 2 1 1 1 1 1 3 2 3 2 1 3 2 4 1 Lung Cancer Blood Cancer Kidney Cancer Breast Cancer Liver Cancer Prostate Cancer Skin Cancer Neural Cancer Esophageal Cancer Nasopharyngeal Cancer Ovarian Cancer Gastrointestinal Cancer Colorectal Cancer Bladder Cancer Testicular Cancer Cervical Cancer Endocrine Tumor Head and Neck Cancer Other

Figure 4.2 Classification of analyzed microarray datasets. There are total twenty groups.

4.2.2 Microarray Data Normalization

4.2.2.1 Robust Multi-array Analysis

The non-biological variances within microarray should be removed before

analyzing the results of multiple microarray data. There are some widely-used

methods, such as MAS 5.0 [51], dChip [52], and RMA [53-55].

Robust Multi-array Analysis (RMA) is proposed by Irizarry et al. in 2003. RMA

can adjust background intensity and normalize the intensity in probe level. It is a

widely-used method for microarray data normalization because of its high

sensitivity and specificity in detecting differentially expressed genes [55].

(29)

When comparing the results of multiple high density oligonucleotide arrays, it is

important to remove sources of variation between arrays of non-biological origin,

such as unequal quantities of starting RNA, differences in labeling or detection

efficiencies between the fluorescent dyes used, and systematic biases in the

measured expression levels. The purpose of normalization is to adjust the

non-biological effects which come from variation in the microarray technology

so that meaningful biological comparisons can be made [56].

The microarray samples within each analyzed microarray dataset and those

samples belonging to cancer cell-lines and normal human tissues are normalized

by RMA method. We use the “affy” package from Bioconductor [57] and do the

normalization with function rma in R software [58].

4.2.3 Classification of Membrane Receptors

In order to realize whether there is any commonality among the identified

cancer-specific membrane receptors, the membrane receptors are classified into

several types: kinase, G-protein coupled receptors, ion channels, other and

uncertain.

If the membrane receptor is recorded in HPMR database or IUPHAR database

but it does not belong to any one group of kinase, GPCR or ion channel, it is

classified “other”. If the membrane receptor has no record in HPMR database or

IUPHAR database, it is classified “uncertain”.

(30)

4.3.1 Significance Analysis of Microarrays

Generally, the purpose of microarray analysis is to identify differentially

expressed genes between test group (e.g. cancer tissue) and control group (e.g.

normal tissue). There are many developed methods for identifying statistically

significant genes, such as t test, Mann-Whitney U test, SAM [59], MaxT [60],

and Rank Products [61-62]. It is often the case that small pergene variances can

make small fold-changes statistically significant in the t-statistic results. Tusher

et al. in 2001 proposed the SAM (Significance Analysis of Microarrays) method

to deal with this problem. In SAM, there is a “fudge factor” adding to the

denominator of the test statistic for eliminating the small variances. The fudge

factor is calculated from the sum of the global standard error of the genes.

Besides, repeated permutations of the data are used to determine if the

expression of any gene is significant related to the response.

In SAM, each gene is assigned a score d(i), “relative difference”, based on its

gene expression change relative to the standard deviation of repeated

measurements for that gene.

x_I

( )

i

and

x_U

( )

i

are defined as the average levels

of expression for gene i in states I and U respectively.

( )

0

)

(

s

i

s

i

x

i

x

i

d

I U

+

−

=

(1)

“gene-specific scatter” s(i) is the standard deviation of repeated expression

measurements:

(31)

( )

i

=

a

{

∑

_m

[

x

_m

( ) ( )

i

−

x

_I

i

]

+

∑

_n

[

x

_n

( )

i

−

x

_U

( )

i

]

}

s

2 2

(2)

where

∑

_m

and

∑

_n

are summations of the expression measurements in states I

and U, respectively,

a=

(

1n₁+1n₂

) (

n₁+n₂ −2

)

, and n

1

and n

2

are the numbers of

measurements in states I and U.

To find significant changes in gene expression, genes were ranked by magnitude

of their d(i) values, so that d(1) was the largest relative difference, d(2) was the

second largest relative difference, and d(i) was the ith largest relative difference.

For each of the p balanced permutations, relative differences d

p

(i) were also

calculated, and the genes were again ranked such that d

p

(i) was the ith largest

relative difference for permutation p. The expected relative difference, d

E

(i), was

defined as the average over the p balanced permutations,

dE

( )

i =

∑

_pdp

( )

i p

.

To identify potentially significant changes in expression, a scatter plot of the

observed relative difference d(i) vs. the expected relative difference d

E

(i) is used

(Figure 4.3). The genes were called “significant” when the value of

d

( )

i −d_E

( )

i

or

d_E

( ) ( )

i −d i

exceeded the threshold

∆

.

In the problem of multiple testing in microarray analysis, SAM provides an

estimate of the FDR for each value of the tuning parameter

∆

. The estimated

FDR is computed from permutations of the data and hence assumes that all null

hypotheses are true, allowing for the possibility of dependent tests.

(32)

Figure 4.3 SAM plot. The “significant” genes are marked by red and green. Red dots

means the differentially expressed genes is highly-expressed in cancer tissue; green dots means the differentially expressed genes is highly-expressed in normal tissue.

4.3.2 False Discovery Rate

The DNA microarray is a powerful tool for studying expressions of thousands of

genes simultaneously so that microarray experiments generate large multiple

testing problems. The family-wise error rate (FWER) [63] and false discovery

rate (FDR) [64] are two common error measures for choosing a significant

threshold in multiple testing.

The FWER is the probability of at least one false positive over the collection of

tests, regardless of how many genes are tested. The simplest FWER method is

(33)

the Bonferroni correction, which divides the conventional α-level by the number

of tested genes m as a significance level for each individual gene test. The

FWER approach could present a problem in the analysis as the analysis tends to

screen out all but a handful of genes that show extreme differential expression if

m is large.

The FDR considers the probability of false rejections (discoveries) among the

rejections as a false-positive error measure. That is, FDR considers the

probability of false selections among the selected genes.

One common objective in microarray experiments is to identify a subset of

genes that are differentially expressed among different experimental conditions.

The FWER approach seems to be unnecessarily stringent because falsely

selecting a small number of genes may not be a serious problem. The FDR

approach may be more desirable because it controls the proportion of falsely

differentially expressed genes.

4.3.3 Defining Membrane Receptor Genes by GO Terms

The criteria for defining membrane receptors are described as below: One gene

is thought as a membrane receptor gene when (1) its GO term of biological

function contain “receptor activity” and (2) its GO term of cellular component

contain (a) “integral to plasma membrane” or (b) “integral to the external side of

plasma membrane” or (c) “extrinsic to external side of plasma membrane” or (d)

(34)

“integral to membrane” and “plasma membrane” (Figure 4.4

5

).

Figure 4.4 Graphics of definition of GO: Cellular Component. Three conditions in Rule 2

are showed.

4.3.4 Identifying Cancer-Specific Membrane Receptors

After normalization, differentially expressed genes (DEGs) with statistically

significant differences between cancer and normal tissue are extracted by SAM

method. SAM was applied using the “siggenes” package for Bioconductor in R.

Permutations of the measurements are used to estimate the false discovery rate.

For each result, the significance level was chosen not to exceed 5%

(

p

value

≤

0 .

05 ) and the level of false positive rate was set not to exceed 1%,

5%, or 10% (

FDR ≤0.01

or

0.01< FDR ≤0.05

or

0.05< FDR ≤0.1

), based

on the number of significant genes respectively. The identified differentially

Extrinsic to external side of plasma membrane

Integral to external side of plasma membrane

Integral to plasma membrane

(35)

expressed genes must satisfy these two criteria above and required at least a

2-fold expression ratio. Cancer-specific membrane receptors are extracted from

the identified differentially expressed genes by selecting genes with GO terms

conformed to the criteria for defining membrane receptors.

4.3.5 Comparing Gene Expression of Cancer-Specific

Membrane Receptors between Cancer Tissue and

Normal Tissues over Human Body

In targeted drug delivery, if the expressions of CSMR are highly-expressed in

normal tissues, tissue injuries may occur. We make comparison among the

expression of CSMR in cancer tissue and normal human tissues and calculate

the tissue index for each normal tissue in order to be aware of the adverse effects

would come up in which tissue in advance.

We supposed C

i

is the expression of the ith CSMR in a cancer,

i=1,...,I

, and N

ij

is

the expression of ith CSMR in the jth normal tissue,

j=1,...,J

. Tissue index, T

ij

,

is the log2 ratio of C

i

over N

ij

for the ith CSMR.













=

ij i ij

N

C

T

log

₂

₍₃₎

The threshold of T

ij

is set to one, and it means that the expression of the ith

CSMR in a cancer is higher than that in the jth normal tissue by two fold.

The numbers of tissues indexes which pass the threshold for a CSMR are

counted and CSMR classification is based on the counted number. Some normal

(36)

tissues are regarded as important tissues (See Appendix), e.g. neural-related

tissues (brain, spinal cord, dorsal root ganglion, etc), heart-related tissues

(atrium, ventricle, coronary artery, etc), blood-related tissues (blood cells), and

some large tissues (lung, liver, kidney, endocrine gland, etc), and the tissue

indexes of these tissues are weighted heavily.

For ith CSMR in a cancer, it belongs to class 1 CSMR if all the T

ij

pass the

threshold; it belongs to class 2 CSMR if all the T

ij

of important tissues pass the

threshold but at least one T

ij

of other tissues does not pass the threshold; it

belongs to class 3 CSMR if at least one T

ij

of important tissues does not pass the

threshold.

Figure 4.5 Graphics of tissue index. Example of Class 3 CSMR. For Class 3 CSMR, there

is at least one important normal tissue of body whose expression level is not lower than that in cancer tissue by 2-fold.

(37)

5. Results

5.1 Statistics of identified CSMRs

For the analyzed results, the lower threshold of false discovery rate is 1% and

the upper threshold of false discovery rate is 10%. The number of genes

identified by SAM would change along with the FDR value. The threshold of

FDR value which is set to 0.01, 0.05 or 0.1 depends on that the numbers of

identified genes exceed one thousand under the condition. Smaller FDR value is

preferred. There is no identified CSMR in some results.

Table 5.1 Statistics of identified CSMRs

Result Cancer (subtype) CSMRs FDR GSE781 Kidney cancer (clear cell renal cell carcinoma) 56 0.01 GSE6344 Kidney cancer (clear cell renal cell carcinoma) 88 0.01 GSE10927 Kidney cancer (adrenocortical carcinoma) 26 0.01 GSE11151-1 Kidney cancer (conventional renal cell carcinoma) 142 0.01 GSE11151-2 Kidney cancer (collecting duct carcinoma) 54 0.1 GSE11151-3 Kidney cancer (chromophobe renal cell carcinoma) 13 0.1 GSE11151-4 Kidney cancer (renal oncocytoma) 12 0.05 GSE11151-5 Kidney cancer (papillary renal cell carcinoma) 15 0.01 GSE11151-6 Kidney cancer (Wilms' tumor) 0 0.1

GSE1420 Esophageal cancer (Barrett's associated adenocarcinoma)

51 0.05 GSE1722-1 Head and neck cancer (head and neck squamous

cell carcinoma)

11 0.1 GSE1722-2 Head and neck cancer (lymph node metastasis) 12 0.1

GSE3524 Head and neck cancer (oral squamous cell carcinoma)

20 0.05 GSE6791-3 Head and neck cancer (HPV+ tumor tissue) 62 0.01

(38)

GSE6791-4 Head and neck cancer (HPV- tumor tissue) 49 0.01 GSE9844 Head and neck cancer (oral tongue squamous cell

carcinoma)

16 0.01 GSE3167-1 Bladder cancer (without carcinoma in situ) 33 0.01 GSE3167-2 Bladder cancer (carcinoma in situ) 54 0.01 GSE3167-3 Bladder cancer (invasive carcinoma) 49 0.01 GSE7476-1 Bladder cancer (low grade superficial tumor) 34 0.1 GSE7476-2 Bladder cancer (high grade superficial tumor) 9 0.1 GSE7476-3 Bladder cancer (invasive tumor) 26 0.1

GSE3218 Testicular cancer (17 seminomas, 15 pure EC, 15 pure T, 10 pure YS, 2 pure CC, and 42 NSGCT

with mixed histologies)

114 0.01

GSE5788 Blood cancer (T-cell prolymphocytic leukemia) 1 0.05 GSE6477-1 Blood cancer (relapsed multiple myeloma) 6 0.01 GSE6477-2 Blood cancer (monoclonal gammopathy of

undetermined significance multiple myeloma)

1 0.01 GSE6477-3 Blood cancer (new multiple myeloma) 12 0.01 GSE6477-4 Blood cancer (smoldering multiple myeloma) 5 0.01 GSE6691-1 Blood cancer (chronic lymphocytic leukemia) 10 0.01 GSE6691-2 Blood cancer (multiple myeloma) 28 0.01 GSE6691-3 Blood cancer (Waldenström's macroglobulinemia,

B-lymphocyte)

6 0.05 GSE6691-4 Blood cancer (Waldenström's macroglobulinemia,

plasma cell)

30 0.01 GSE8835-1 Blood cancer (chronic lymphocytic leukemia,

CD4+ T cell)

17 0.01 GSE8835-2 Blood cancer (chronic lymphocytic leukemia,

CD8+ T cell)

4 0.01 GSE9476 Blood cancer (acute myeloid leukemia) 4 0.01 GSE6338-1 Blood cancer (peripheral T-cell lymphoma

unspecified)

131 0.01 GSE6338-2 Blood cancer (angioimmunoblastic lymphoma) 80 0.01 GSE6338-3 Blood cancer (anaplastic large cell lymphoma) 101 0.01 GSE12195 Blood cancer (diffuse large B-cell lymphoma) 186 0.01 GSE12453-1 Blood cancer (classical Hodgkin's lymphoma) 84 0.01 GSE12453-2 Blood cancer (nodular lymphocte-predominant

Hodgkin's lymphoma)

60 0.01 GSE12453-3 Blood cancer (T-cell rich B-cell lymphoma) 45 0.05

(39)

GSE12453-4 Blood cancer (follicular lymphoma) 19 0.05 GSE12453-5 Blood cancer (Burkitt's lymphoma) 19 0.01 GSE12453-6 Blood cancer (diffuse large B-cell lymphoma) 52 0.01 GSE6008-1 Ovarian cancer (endometrioid carcinoma) 53 0.01 GSE6008-2 Ovarian cancer (serous carcinoma) 59 0.01 GSE6008-3 Ovarian cancer (mucinous carcinoma) 57 0.01 GSE6008-4 Ovarian cancer (clear cell carcinoma) 44 0.01 GSE10971 Ovarian cancer (adnexal serous carcinoma) 39 0.01 GSE15578 Ovarian cancer (ovarian epithelial carcinoma) 15 0.1 GSE6883-1 Breast cancer (tumorigenic cell) 12 0.1 GSE6883-2 Breast cancer (non-tumorigenic cell) 3 0.1 GSE9574 Breast cancer 0 0.1 GSE3744 Breast cancer (basal-like cancer of breast

carcinoma)

22 0.01 GSE7904-1 Breast cancer (basal-like cancer of breast

carcinoma)

30 0.01 GSE7904-2 Breast cancer (non-basal-like cancer of breast

carcinoma)

19 0.01 GSE8977 Breast cancer (stroma of invasive ductal

carcinoma)

50 0.01 GSE10780 Breast cancer (invasive ductal breast carcinoma) 19 0.01

GSE7670 Lung cancer (lung adenocarcinoma 26 pairs and large cell lung cancer 1 pair)

7 0.01 GSE10072 Lung cancer (lung adenocarcinoma) 8 0.01 GSE10799-1 Lung cancer (lung adenocarcinoma, disseminate

into bone marrow)

14 0.1 GSE10799-2 Lung cancer (lung adenocarcinoma, not

disseminate into bone marrow)

27 0.05 LCH Lung cancer (lung adenocarcinoma) 42 0.01 GSE7803 Cervical cancer (invasive cervical squamous cell

carcinoma)

3 0.01 GSE9750 Cervical cancer (cervical squamous cell carcinoma) 17 0.01 GSE6791-1 Cervical cancer (HPV+ tumor tissue) 133 0.01

(40)

GSE6791-2 Cervical cancer (HPV- tumor tissue) 33 0.01 GSE12907 Neural cancer (juvenile pilocytic astrocytoma) 73 0.05 GSE3325-1 Prostate cancer (primary tumor) 12 0.05 GSE3325-2 Prostate cancer (metastasis) 4 0.1 GSE3678 Endocrine tumor (papillary thyroid carcinoma) 17 0.01 GSE6004 Endocrine tumor (papillary thyroid carcinoma) 27 0.05 GSE4107 Colon cancer (early onset colorectal carcinoma) 30 0.01 GSE4183 Colon cancer (colorectal carcinoma) 35 0.01 GSE13471 Colon cancer (colorectal carcinoma) 19 0.1 GSE6222-1 Liver cancer (hepatocellular carcinoma, stage 1) 0 0.1 GSE6222-2 Liver cancer (hepatocellular carcinoma, stage 3) 0 0.1 GSE6764-1 Liver cancer (hepatocellular carcinoma, HCV

infection, early)

16 0.01 GSE6764-2 Liver cancer (hepatocellular carcinoma, HCV

infection, advanced)

15 0.01 GSE7553-1 Skin cancer (basal cell carcinoma) 56 0.01 GSE7553-2 Skin cancer (primary melanoma) 28 0.05 GSE7553-3 Skin cancer (melanoma in situ) 0 0.1 GSE7553-4 Skin cancer (metastatic melanoma) 72 0.01 GSE7553-5 Skin cancer (squamous cell carcinoma) 20 0.05 GSE9576-1 Gastrointestinal cancer (midgut carcinoid primary

tumor)

27 0.01 GSE9576-2 Gastrointestinal cancer (midgut carinoid liver

metastasis)

21 0.05 GSE13911 Gastrointestinal cancer (gastric carcinoma) 27 0.01 GSE12452 Nasopharyngeal cancer (nasopharyngeal

carcinoma)

28 0.01 GSE13433 Other (alveolar soft-part sarcoma) 33 0.01

5.2 Case Study: Identifying CSMRs in Lung Adenocarcinoma

(41)

are two main types of lung cancer: small cell lung cancer and non-small cell

lung cancer. Lung adenocarcinoma is one of the main types of non-small cell

lung cancer. Lung adenocarcinoma is the commonest type of lung cancer in

non-smokers [65] and accounts for about forty percent of all cases of lung

cancer [66]. The 5-year survival rate of non-small cell lung cancer is about 15

percent [67]. The currently used anti-cancer drugs, such as cisplatin or

carboplatin, would cause severe side-effects. Researchers are seeking for an

effective targeting molecule so that anti-cancer drugs could be specifically

delivered to cancer cells, and decreasing the chance of occurred side-effects.

We analyze four microarray datasets of lung adenocarcinoma and try identifying

cancer-specific membrane receptors (CSMRs) for use in targeted drug delivery.

The basic information of analyzed datasets and the results are listed below

(Table 5.2 – Table 5.6).

Table 5.2 Analyzed datasets of lung adenocarcinoma

Dataset ID Microarray platform

Cancer tissue Cancer samples

Normal tissue Normal samples GSE10799 Affymetrix HG-U133A Plus 2.0 Lung adenocarcinoma 16 Normal bronchial epithelial tissue 3 Lung cancer dataset (LCH) Affymetrix HG-U133A Plus 2.0 Lung adenocarcinoma 25 Adjacent normal lung tissue 25 GSE7670 Affymetrix HG-U133A Lung adenocarcinoma 27 Adjacent normal lung tissue 27 GSE10072 Affymetrix HG-U133A Lung adenocarcinoma 58 Normal lung tissue 49

(42)

Table 5.3 Identified CSMRs in lung adenocarcinoma dataset, GSE10799, not disseminate into bone marrow

SAM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances s0 = 0.1889 (The 15 % quantile of the s values.)

Number of permutations: 220 (complete permutation) Delta: 1.108049

cutlow: -1.865 cutup: 2.216 p0: 0.5 FDR: 0.05

Identified Genes (using Delta = 1.108049):

Probeset Gene ID Gene Symbol d.value stdev rawp q.value 203954_x_at 1365 CLDN3 2.34 0.3203 0.011634 0.03218 201428_at 1364 CLDN4 2.47 0.2555 0.008761 0.02765 1007_s_at 780 DDR1 2.79 0.1856 0.004539 0.01823 207169_x_at 780 DDR1 2.49 0.2351 0.008383 0.02694 208779_x_at 780 DDR1 2.64 0.2856 0.006127 0.02203 210749_x_at 780 DDR1 3.18 0.2097 0.002083 0.01196 205107_s_at 1945 EFNA4 4.39 0.1302 0.000294 0.0046 226213_at 2065 ERBB3 2.99 0.2515 0.003001 0.01437 224404_s_at 83416 FCRL5 2.51 0.4322 0.00808 0.02623 235988_at 266977 GPR110 3.91 0.6297 0.000585 0.00633 238689_at 266977 GPR110 2.9 0.5442 0.003542 0.01576 210473_s_at 166647 GPR125 2.24 0.2325 0.014822 0.03682 209631_s_at 2861 GPR37 2.37 0.639 0.011098 0.03129 229105_at 2863 GPR39 5.01 0.3346 0.000142 0.00339 219936_s_at 53836 GPR87 2.91 0.5867 0.003465 0.01562 213831_at 3117 HLA-DQA1 3.27 0.9784 0.001749 0.01079 236203_at 3117 HLA-DQA1 2.83 0.6874 0.004138 0.01723 209480_at 3119 HLA-DQB1 3.2 0.9685 0.001999 0.01163 242517_at 84634 KISS1R 2.8 0.6798 0.004462 0.01803 218326_s_at 55366 LGR4 5.45 0.2874 9.42E-05 0.00306 205282_at 7804 LRP8 2.79 0.3302 0.004548 0.01824 208190_s_at 51599 LSR 3.29 0.251 0.001683 0.01062 203510_at 4233 MET 2.41 0.5399 0.010003 0.02939 223540_at 81607 PVRL4 2.44 0.2598 0.009521 0.02879 204729_s_at 6804 STX1A 3.17 0.258 0.002098 0.01198 212800_at 10228 STX6 4.1 0.1982 0.000432 0.00552 222581_at 9213 XPR1 2.44 0.196 0.009519 0.02879

Table 5.4 Identified CSMRs in lung cancer dataset, LCH

SAM Analysis for the Two-Class Paired Case s0 = 0.0446 (The 0 % quantile of the s values.) Number of permutations: 1000

Delta: 1.819148 cutlow: -2.042 cutup: 2.335

(43)

FDR: 0.00999

Probeset Gene ID Gene Symbol d.value stdev rawp q.value 1555779_a_at 973 CD79A 3.09 0.2104 0.001137 0.001029

203953_s_at 1365 CLDN3 6.98 0.27 0 0 203954_x_at 1365 CLDN3 6.29 0.2546 9.99E-08 2.50E-07

201428_at 1364 CLDN4 7.65 0.1553 0 0 1007_s_at 780 DDR1 7.43 0.101 0 0 207169_x_at 780 DDR1 7.31 0.123 0 0 208779_x_at 780 DDR1 7.27 0.1093 0 0 210749_x_at 780 DDR1 7.32 0.1129 0 0 205107_s_at 1945 EFNA4 9.96 0.1438 0 0 201983_s_at 1956 EGFR 3.37 0.1873 0.000516 0.000509 201984_s_at 1956 EGFR 3.53 0.1965 0.000322 0.000335 1438_at 2049 EPHB3 4.89 0.1421 3.00E-06 5.17E-06 202454_s_at 2065 ERBB3 5.46 0.1671 3.00E-07 6.39E-07 226213_at 2065 ERBB3 5.41 0.1531 3.00E-07 6.39E-07 206429_at 2150 F2RL1 5.84 0.1424 9.99E-08 2.50E-07 213506_at 2150 F2RL1 9.1 0.1643 0 0 222906_at 28982 FLVCR1 5.38 0.1387 4.00E-07 8.40E-07 238689_at 266977 GPR110 7.87 0.3125 0 0 223423_at 26996 GPR160 8.62 0.159 0 0 211633_x_at 3500 IGHG1 3.27 0.3152 0.000685 0.000654 211908_x_at 3500 IGHG1 2.76 0.3021 0.002713 0.002252 209374_s_at 3507 IGHM 3.94 0.2837 8.28E-05 9.95E-05 216491_x_at 3507 IGHM 3.1 0.3676 0.001103 0.001003 235583_at 286676 ILDR1 5.63 0.1697 2.00E-07 4.53E-07 227314_at 3673 ITGA2 5.11 0.1865 1.50E-06 2.83E-06 204989_s_at 3691 ITGB4 4.29 0.2331 2.28E-05 3.11E-05 204990_s_at 3691 ITGB4 4.65 0.1997 5.70E-06 8.95E-06 218326_s_at 55366 LGR4 9.05 0.1901 0 0 208433_s_at 7804 LRP8 5.76 0.1572 9.99E-08 2.50E-07 208190_s_at 51599 LSR 6.17 0.1717 9.99E-08 2.50E-07 203510_at 4233 MET 5.97 0.1387 9.99E-08 2.50E-07 228592_at 931 MS4A1 3.1 0.3073 0.001122 0.001018 204213_at 5284 PIGR 4.17 0.2934 3.65E-05 4.75E-05 207011_s_at 5754 PTK7 6.72 0.1074 0 0 200635_s_at 5792 PTPRF 5.82 0.1382 9.99E-08 2.50E-07 200637_s_at 5792 PTPRF 4.64 0.1688 5.90E-06 9.20E-06 223540_at 81607 PVRL4 6.02 0.1395 9.99E-08 2.50E-07 204916_at 10267 RAMP1 4.96 0.1505 2.40E-06 4.31E-06 204729_s_at 6804 STX1A 5.9 0.1352 9.99E-08 2.50E-07 219360_s_at 54795 TRPM4 3.61 0.2534 0.000248 0.000265

222581_at 9213 XPR1 7.22 0.1757 0 0 226615_at 9213 XPR1 5.04 0.1766 1.90E-06 3.50E-06

Table 5.5 Identified CSMRs in lung adenocarcinoma dataset, GSE7670

(44)

s0 = 0 Number of permutations: 1000 Delta: 2.144888 cutlow: -2.605 cutup: 3.065 p0: 0.5 FDR: 0.00993

Probeset Gene ID Gene Symbol d.value stdev rawp q.value 203953_s_at 1365 CLDN3 6.6 0.2907 5.21E-07 1.80E-06 203954_x_at 1365 CLDN3 5.82 0.1741 1.56E-06 4.07E-06 201428_at 1364 CLDN4 6.3 0.1579 1.04E-06 3.18E-06 201983_s_at 1956 EGFR 4.8 0.2173 4.72E-05 8.72E-05 216491_x_at 3507 IGHM 4.2 0.2958 0.000253 0.000378 208190_s_at 51599 LSR 6.82 0.184 3.47E-07 1.28E-06 203510_at 4233 MET 3.15 0.2845 0.004049 0.004328

Table 5.6 Identified CSMRs in lung adenocarcinoma dataset, GSE10072

SAM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances s0 = 0.0389 (The 0 % quantile of the s values.)

Number of permutations: 1000 Delta: 1.576606 cutlow: -1.788 cutup: 1.902 p0: 0.5 FDR: 0.01

Probeset Gene ID Gene Symbol d.value stdev rawp q.value 203953_s_at 1365 CLDN3 8.75 0.1633 0 0

201428_at 1364 CLDN4 6.86 0.1349 0 0 213506_at 2150 F2RL1 7.64 0.1292 0 0 211908_x_at 3500 IGHG1 5 0.1417 0 0 209374_s_at 3507 IGHM 3.98 0.2846 2.91E-06 3.50E-06 216491_x_at 3507 IGHM 5.73 0.2475 0 0 208190_s_at 51599 LSR 8.54 0.1004 0 0

203510_at 4233 MET 3.07 0.2056 0.000128 0.000118

Table 5.7 Comparisons of identified CSMRs from different datasets

Probeset Gene Symbol GSE10799 LCH GSE7670 GSE10072 1555779_a_at CD79A o X X

203953_s_at CLDN3 o o o 203954_x_at CLDN3 o o o

(45)

207169_x_at DDR1 o o 208779_x_at DDR1 o o 210749_x_at DDR1 o o 205107_s_at EFNA4 o o 201983_s_at EGFR o o 201984_s_at EGFR o 1438_at EPHB3 o 202454_s_at ERBB3 o 226213_at ERBB3 o o X X 206429_at F2RL1 o 213506_at F2RL1 o o 224404_s_at FCRL5 o X X 222906_at FLVCR1 o X X 235988_at GPR110 o X X 238689_at GPR110 o o X X 210473_s_at GPR125 o 223423_at GPR160 o X X 209631_s_at GPR37 o 229105_at GPR39 o X X 219936_s_at GPR87 o 213831_at HLA-DQA1 o 236203_at HLA-DQA1 o X X 209480_at HLA-DQB1 o 211633_x_at IGHG1 o 211908_x_at IGHG1 o o 209374_s_at IGHM o o 216491_x_at IGHM o o o 235583_at ILDR1 o X X 227314_at ITGA2 o X X 204989_s_at ITGB4 o 204990_s_at ITGB4 o 242517_at KISS1R o X X 218326_s_at LGR4 o o 205282_at LRP8 o 208433_s_at LRP8 o 208190_s_at LSR o o o o 203510_at MET o o o o 228592_at MS4A1 o X X

(46)

204213_at PIGR o 207011_s_at PTK7 o 200635_s_at PTPRF o 200637_s_at PTPRF o 223540_at PVRL4 o o X X 204916_at RAMP1 o 204729_s_at STX1A o o 212800_at STX6 o 219360_s_at TRPM4 o 222581_at XPR1 o o X X 226615_at XPR1 o X X ’X’ means that the probeset exist in Affymetrix HG-U133A Plus 2.0 platform, but not in

Affymetrix HG-U133A platform

Table 5.8 Examples of selected CSMR

Probeset Gene Symbol GSE10799 LCH GSE7670 GSE10072 203953_s_at CLDN3 o o o 203954_x_at CLDN3 o o o 201428_at CLDN4 o o o o 1007_s_at DDR1 o o 207169_x_at DDR1 o o 208779_x_at DDR1 o o 210749_x_at DDR1 o o 205107_s_at EFNA4 o o 201983_s_at EGFR o o 226213_at ERBB3 o o X X 213506_at F2RL1 o o 238689_at GPR110 o o X X 211908_x_at IGHG1 o o 209374_s_at IGHM o o 216491_x_at IGHM o o o 218326_s_at LGR4 o o 208190_s_at LSR o o o o 203510_at MET o o o o 223540_at PVRL4 o o X X 204729_s_at STX1A o o 222581_at XPR1 o o X X ’X’ means that the probeset exist in Affymetrix HG-U133A Plus 2.0 platform, but not in

(47)

According to the comparative result (Table 5.7), there are several CSMRs which

are identified by more than one dataset. We choose the CSMRs in the

intersection of results and use the expression profiles of LCH dataset for

demonstration (Table 5.8, Figure 5.2 – 5.22).

The gene name and fold change are marked upper-left in the figure. The red bar

means the average expression in cancer tissue, and the green bar means the

average expression in corresponding normal tissue. The blue line in the upper

two parts is the value of tissue index. In the upper two parts of figure, each

yellow bar represents the expression in a normal tissue of human body. In the

lower three parts of figure, each yellow bar represents the expression in a cancer

cell-line. (Figure 5.1)

According to the results of tissue index of gene, the sixteen chosen CSMRs are

classified (Table 5.9). The CSMRs belong to class 1 and 2 are recommended,

but the low expression of GPR110 (238689_at) and EFNA4 (205107_s_at)

should be watched out.

Table 5.9 Classification of CSMRs by tissue index

Class Gene (Probeset)

Class 1 GPR110 (238689_at), MET (203510_at), XPR1 (222581_at) Class 2 CLDN4 (201428_at), EFNA4 (205107_s_at), F2RL1 (213506_at),

IGHG1 (211908_x_at) , IGHM (209374_s_at), IGHM (216491_x_at) Class 3 CLDN3 (203953_s_at), CLDN3 (203954_x_at), DDR1 (1007_s_at),

DDR1 (207169_x_at), DDR1 (208779_x_at), DDR1 (210749_x_at), EGFR (201983_s_at), ERBB3 (226213_at), LGR4 (218326_s_at), LSR

(48)

Figure 5.1 Elucidation of CSMR expression plot. Red bar is the average expression level in

cancer tissue. Green bar is the average expression in corresponding normal tissue. Yellow bars include the average expression level in normal tissues of body (upper two parts) or in cancer

cell-lines (lower three parts).

Normal tissues of body Tissue

Indexes

Gene symbol and fold change

Cancer cell-line

(49)

(50)

以系統化方法辨識出能用於標靶藥物輸送的癌症專一性細胞膜受體

國立交通大學

生物資訊及系統生物研究所

碩士論文

以系統化方法辨識出能用於標靶藥物輸送的癌症專一性細

胞膜受體

A Systematic Method for Identifying Cancer-Specific

Membrane Receptors for Targeted Drug Delivery

研究生：黃煒志

指導教授：黃憲達 博士

以系統化方法辨識出能用於標靶藥物輸送的癌症專一

性細胞膜受體

A Systematic Method for Identifying Cancer-Specific

Membrane Receptors for Targeted Drug Delivery

研 究 生：黃煒志

Student: Wei-Chih Huang

指導教授：黃憲達 博士

Advisor: Dr. Hsien-Da Huang

國立交通大學

生物資訊及系統生物研究所

碩士論文

A Thesis

Submitted to Institute of Bioinformatics and Systems Biology

College of Biological Science and Technology

National Chiao Tung University

in partial Fulfillment of the Requirements

for the Degree of

Master

in

Bioinformatics

July 2010

Hsinchu, Taiwan, Republic of China

以系統化方法辨識出能用於標靶藥物輸送的癌

以系統化方法辨識出能用於標靶藥物輸送的癌

以系統化方法辨識出能用於標靶藥物輸送的癌

以系統化方法辨識出能用於標靶藥物輸送的癌

症專一性

症專一性

症專一性

症專一性細胞膜

細胞膜

細胞膜受體

細胞膜

受體

受體

受體

中文摘要

中文摘要

中文摘要

中文摘要

癌症是世界上主要死亡原因之一，目前已研發出相當多種類的抗癌藥作為

治療手段。但是傳統的抗癌藥無法準確輸送到癌症患處，身體上其他正常

的組織會因為抗癌藥的毒性而受到傷害。抗癌藥的副作用既會影響病人的

生活品質，也降低了治療效果，故本研究的目標即為如何將抗癌藥準確送

至 癌 細 胞 。 現 在 的 新 方 法 是 可 將 抗 癌 藥 包 裹 在 免 疫 微 脂 體 中

(immuno-liposome)，免疫微脂體表面的特定抗體會因為抗體-抗原親和力

(antigen-antibody affinity)的反應而與癌細胞表面的目標蛋白質結合，達到抗

癌藥對癌細胞的專一性輸送，然後抗癌藥會經由內吞作用進入癌細胞內。

我們以一系統化的方法來分析去氧核醣核酸微陣列(DNA microarray)資

料，以期能找出癌症專一性細胞膜受體(cancer-specific membrane receptor,

CSMR)作為藥物輸送的目標。我們比較了癌症專一性細胞膜受體在癌症組

織與在人體正常組織的 mRNA 表現量，希望能找出在癌症組織中為高表

現、在大部分正常組織中為低表現的癌症專一性細胞膜受體。除此之外，

我們也比較了癌症專一性細胞膜受體於數種癌症細胞株中的表現量。若癌

症專一性細胞膜受體在某癌症細胞株中的表現量與在同類癌症組織中的表

現量相似，該癌症細胞株則建議可用作實驗證明。

A Systematic Method for Identifying

Cancer-Specific Membrane Receptors for

Targeted Drug Delivery

Abstract

Cancer is a leading cause of death worldwide. Lots of anti-cancer drugs have

been invented for cancer therapeutics. Traditional anti-cancer drugs with serious

cytotoxicity would cause injuries of normal tissues owing to the incorrect drug

delivery. The side-effects of anti-cancer drugs would not only debase the life

quality of patients but decrease the therapeutic efficacy. The purpose of this

study is to find a way to accurately deliver anti-cancer drugs to cancer cells.

Currently, anti-cancer drugs encapsulated in the immunoliposome would be

specifically transported to cancer cells by the mechanism of antigen-antibody

affinity between the coated antibody of immunoliposome and its target protein

on the surface of cancer cell and then enter the cancer cell by endocytosis. We

指導教授：黃憲達博士

研究生：黃煒志

指導教授：黃憲達博士

至癌細胞。現在的新方法是可將抗癌藥包裹在免疫微脂體中