4.1. Accuracy
In the example 1, we used three variants to display one of the application of Variants Search to identify pathogenic variants which cause Brugada Syndrome. Although, all the three variants have been reported as Brugada Syndrome related variants, two of the three variants showed index 3, and the other showed index 1 that was both reported low prediction scores by REVEL and CADD. Though it may display a conflict result between the index system and those functional prediction scores in predicting a pathogenic variant, the index system provides a comprehensive approach based on the type of variant, REVEL score, GERP++ score, and the gene expression profiles of heart muscle in human.
To improve the prediction accuracy of pathogenic variants in VariED, it could be integrated prediction tools for protein structures in the future. However, the limitation of these prediction tools is that those tools need heavy computing resources and may cause the increased processing time because of unlimited inquiry in VariED. Currently, it may not suitable for incorporating prediction tools for protein structures into VariED.
4.2. Tissue-based gene expression profiles
It is important to identify gene expression profiles from a specific tissue. Taking the natriuretic peptide B (NPPB) gene as an example, this gene encodes a secreted protein which functions as a cardiac hormone and shows the highest level of enriched expression in heart tissue. The highest tissue specificity score is also recorded by Protein Atlas. Such genes like NPPB show the phenomenon that different tissues have their own gene expression pattern. Due to this reason, the Expression Profiles service was built to help researchers understand the gene expression pattern in a tissue of interests and advance the progress of disease research.
In example 3, we displayed one of the usages of Expression Profiles service to identify candidate genes of heart diseases using information of gene expression profiles. The four gene (i.e., GAPDH, SCN5A, GK2 and MYBPC3) were used to promote the Expression Profiles function. GAPDH is one of the most housekeeping genes used in comparison of gene expression data. We can observe that the GAPDH highly expresses in both heart muscle and testis, which presents the result in stable among the other three genes. SCN5A is found the expression mainly in heart muscle. GK2 expresses at a high level in the testis instead of heart tissues. Conversely, MYBPC3 shows very high expression in heart muscle in the example. These results are not only consistent with expression profiles in the adult male C57BL/6 mouse, but also suggest the hypothesis of endogenous expression profiles
in different tissues. The function of Expression Profiles aids researchers to reduce unnecessary cost and be used to discover the “gold” from massive genes.
4.3. Data collection of gene expression profiles in different species
The expression profiles data in VariED was obtained from human, mouse and zebrafish. Mouse, and zebrafish are popular model organisms in the research of heart diseases. Genes from these species show the high conservation and identity with human.
Taking zebrafish as an example, 70 present of human genes can be found in zebrafish [46]. There is a potential limitation that the endogenous expression profiles from different phases of growth may be inconsistent in a tissue. The expression data in VariED currently was obtained from the species in adult stage, however, it may not be enough to represent the gene expression in all growth stages. Besides, there are many strains in mouse, VariED collected expression data of a male mouse with the C57BL/6 strain that may not be enough to represent the expression profiles for all strains. Also, the gene expression profiles data of mouse and zebrafish were obtained from one RNA-Seq sample. Previous research suggested that at least six biological replicates should be used in an RNA-Seq experiment [47]. Only the data of human was derived greater than six replicates of RNA-Seq experiments which suffice the standard. We believe that sufficient repeated experiments can effectively reduce the experimental error and be able to use these data to
confirm the gene expression pattern in a specific human tissue. Although samples of RNA-Seq experiment in mouse and zebrafish do not suffice the standard in VariED currently. It is necessary to increase the sample size of mouse and zebrafish in the future and provides more credible information to users.
4.4. Mapping RNA-Seq reads to the reference genome
The expression profile data in VariED was derived from RNA-Seq, a technology for measuring expression level accurately. To estimate expression from RNA-Seq data, we need to map the short sequencing reads to a reference genome or a transcript set.
Depending on the reference sequence mapped by researchers, expression profiles can usually be separated into two types--gene levelor transcript level. A single gene can have multiple transcripts. Therefore, gene expression profiles are calculated from the overall expression of all transcripts of a gene, and transcript expression profiles are from the overall expression of all exon of a transcript.
To map reads to the reference sequence, maximum likelihood method is used by a lot of tools, such as Cufflinks. The reads are mapped to the reference sequence with largest likelihood, but it is difficult to understand the actual detail mapping relations between reads and reference sequence. Furthermore, if the read is short, due to the unspecific mapping, it will become more difficult to correctly find out the target reference sequence. Calculating gene expression profiles is more accurate than transcript
expression profiles. For this reason, the Expression Profiles function provides tissue-based gene expression profiles rather than transcript expression profiles. Additionally, the Variants Search function also used expression profiles as criteria to give each variant an index. However, some diseases may only happen on only one mutated transcript in a gene.
In such cases, knowing the target transcript will express or not will help researchers decide the next step of disease study. Apparently, transcript expression profiles are more helpful on disease research, so it will be incorporated in the VariED service in the future.
For accuracy, we hope a more accurate method for mapping reads will be available soon and can be adopted in VariED system.
4.5. Characteristics
There are several important tools that are already available which infer functions and queries out information from gene expression data and protein-protein interaction networks. Each of those tools performs specific functions, but comes with respective limitations. For example, Protein Atlas [23] that utilizes quantitative transcriptomics at the tissue and organ level to provide a map of the human tissue proteome combined with protein profiling, can only search one gene or tissue at a time and does not have information regarding variants location and sequence, GeneCards [48], an integrative database that provides comprehensive, user-friendly platform for all annotated and predicted human genes integrating data from approximately 125 web sources, offers batch
search however with a limitation of 100 genes per query (or an unlimited usage for an annual payment of $149). UniProt [45] provides a comprehensive and freely accessible resource of protein sequence and functional information but supports a limited batch search of 100,000 variants. Sources such as 1000Genomes [30], NHLBI ESP, Integrative Japanese Genome Variation Database (IJGVD) [32], and Taiwan BioBank provides allele frequencies for different populations, but fails to support batch search. Although ANNOVAR can batch annotate variants, it does not provide the information about gene expression among different organs.
In aspect of variant's functional prediction, VariED incorporated REVEL, GERP++ and CADD scores. About REVEL, it incorporated pathogenicity predictions from 18 individual scores including 8 conservation scores and 10 functional scores which also combined GERP++ score in it. However, REVEL score is more like a functional score rather than a conservation score which can use to know the deleteriousness of variant but not the evolutionary conservation at variants. For this reason, GERP++ scores still are necessary for user to know the evolutionary conservation at variant. About CADD, it is a tool which also used ensemble method to build the predict scores. Although CADD does not perform as well as REVEL for missense variants, it has important advantages for genome-wide NGS applications providing scores for noncoding variants. Therefore, we provide these three scores for users. In the aspect of the variants’ phenotype, HGMD, a searchable resource for comprehensive data on published human inherited disease
mutations, provides chromosome coordinate information and VCF file search only in the professional version. For InterVar, it provides clinical interpretation information and supports VCF files search. However, InterVar takes a long time on processing a normal size VCF file. It should be noted that most of the current databases can only search one gene or variant at a time. VariED integrates information from multiple databases and offers one comprehensive platform with unlimited batch-search facilities.
4.6. Processing speed
VariED is a free, web-based tool with unlimited variant search options that offers multi-level annotation information with supporting comparative scoring schemes and tissue specific gene expression profiles and is a central source for allele frequencies from different ethnic populations such as East Asians, Africans, Americans, Europeans, South Asians, Taiwanese and Chinese population through its link to 1000 Genome and Taiwan Biobank databases. VariED is a user-friendly system and returns detailed, aggregated report into exportable table (csv files) for easy documentation of the variant review. The database is larger than 30 GB and the average time to query a set of 100 variants would be approximately 100 seconds and 16,078 variants would be approximately 15 minutes with all options selected.
VariED integrated several databases and variant analysis tools. Therefore, the processing time of VariED will depend on the tools VariED integrated, like ANNOVAR
and CADD. If the integrated tools need to take a long time to process, the average time to query would increase. However, if those tools improve their processing efficiency, the average time to query would decrease.