Chapter 4 Evaluation and Results
4.1 Experimental Data
In this study, National Health Insurance Research Database (NHIRD) in Taiwan and MEDLINE database, which combined with three ontology databases, are used as our multiple data sources. Experimental data from different sources are preprocessed in different ways, as we mentioned previously in Section 3.1.
4.1.1 NHIRD
We use National Health Insurance Research Database (NHIRD) in Taiwan covering from 2000 to 2009. The control window and surveillance window used in the data preprocessing phase are both set to 12 months. After generating patient visits and mapping NHIRD codes to ATC codes, DADs can be generated from 2000 to 2009 for each patient visit by using the two time windows, and 1999 is used for the control window of the visits in 2000.
Since labeling signals may take a lot of time and effort in query collection, according to Hsieh (2014), we choose the same four types of diseases as our disease-anchored queries. Table 17 shows the detailed ICD codes for each disease type.
Table 17. Disease types and their corresponding ICD-9-CM codes
Disease Type ICD-9-CM codes Total
Cancer 140−165, 170−176, 179−208 63
Cardiovascular events 402, 404, 410, 411, 413, 414, 424, 426, 427, 428, 7943
11
Hepatotoxicity 2774, 570, 573, 576, 7824 5
Acute renal toxicity 584, 586 2
Unique ICD codes 81
All the disease-based drug-outcome pairs were labeled by eight experts, including several graduate students of School of Pharmacy at National Taiwan University and pharmacists in National Taiwan University Hospital. These eight experts were divided into two groups. Therefore, the drug-disease pairs in one disease query were separated into four parts, and each part was labeled by two coders (please see Figure 13).
Figure 13. The arrangement of coders
They used Micromedex, which includes the information of relations between drugs and diseases, to assign an appropriate label to each drug-disease pair. Note that these disease-based drug-outcome pairs were filtered by disproportionality thresholds (𝑎 ≥ 10
and 𝑅𝑂𝑅 ≥ 1.5) mentioned in Section 3.4.1 to remove insignificant pairs. Since there are two coders labeling the same pairs in the labeling process, the rule shown in Figure 14 is described by Hsieh (2014) to solve the problem of inconsistency between two coders in different groups.
Figure 14. The rule to solve the inconsistency problem
As the result, we get the labeled result from Hsieh (2014) and Table 18 summarizes
the count of each label type in the four disease types.
Table 18. Summary of different label types in four disease types
Cancer CV Hepatotoxicity Acute renal failure
However, since we intend to improve the effectiveness of drug safety detection by using biomedical literature as the secondary data source, the year of literature should be
considered. We remained drug-disease pairs (with rank 4 and 3) where the first literature about the relation between the drug and the disease was published after 2000 as training data. In other words, we used the pairs which the first of their references appeared after 2000 (included 2000) as the input of our learning system, and used all the literature before 2000 to build the concept network. The following table (see Table 19) is the summary of
all label types for each disease type after considering the year of literature.
Table 19. Summary of different label types in four disease types after considering the year of literature
Cancer CV Hepatotoxicity Acute renal failure
In addition, if the MeSH term of drug is not mentioned before 2000, the pairs related to this drug should also be removed. The drug term did not appear in any literature before 2000 may cause many null values of literature-based measures, because the concept network is built from all the literature before 2000. As the result, we removed the drugs (114 unique ATCs) which were not mentioned before 2000. Table 20 is the summary of all label types for each disease type after considering the year of literature and the
emerged year of drugs.
Table 20. Summary of different label types in four disease types after considering the year of literature and the emerged year of drugs.
Cancer CV Hepatotoxicity Acute renal failure (MEDLINE 2011 baseline). We remove the irrelevant publication types suggested by Yetisgen-Yildiz and Pratt (2009) and adopt MeSH term as our terminology in order to retrieve representative medical terms. Then, we extract the 2,623,222 relations by Chen’s (2013) semantic subcategory filtering. After retrieving the relations, we construct two types of concept networks for the prediction of adverse drug reactions.
4.1.3 Ontologies
There are three ontology databases used in our method. First, DrugBank is a database which provides information about targets, pathways, indications, adverse effects, and
is 14,542. Second, Online Mendelian Inherirance in Man (OMIM) is an online knowledge-base of human genes and genetic phenotypes. The number of gene-disease relations from OMIM is 4,380. Third, Comparative Toxicogenomics Database (CTD) is a database which integrates data from scientific literature to describe chemical interactions with genes and proteins, and diseases and genes or proteins. The number of chemical-gene interactions from CTD is 869,902 and the number of gene-disease associations is 27,397.
As mentioned in Chapter 3, after translating terms from different coding systems to MeSH terms suggested by Chen (2013), Table 10 summarizes the number of relations and their relation types retrieved from different ontology databases.
4.1.4 Term Mapping
In the Term Mapping step, we use the 2015AB version of UMLS as our thesaurus.
The total number of unique ICD codes in the four disease types is 81, and there are 1177 unique ATC codes involved in the drug-disease pairs related to the four disease types.
After mapping, 52 unique ICDs can map to the corresponding MeSH term(s), and 1041 unique ATCs can map to the corresponding MeSH term(s).