從臨床文字報告到復發預測模組以肝癌患者為研究對象進行資訊擷取、資料查詢與探勘

(1)

୯ҥᆵ᡼εᏢႝᐒၗૻᏢଣၗૻπำᏢس റγፕЎ

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

வᖏ׉Ўӷൔ֋ډൺวႣෳኳಔ

аطᕎ஻ޣࣁࣴزჹຝ຾Չၗૻᘏڗǵၗ਑ࢗ၌ᆶ௖୎

From Clinical Narrative Reports to Recurrence Predictive Models: Information Extraction, Data Query, and Data

Mining for Liver Cancer Patients

ѳ᐀ᠩ Xiao-Ou Ping

ࡰᏤ௲௤Ǻᒘ०㈣ റγ Advisor: Feipei Lai, Ph.D.

ύ๮҇୯ 103 ԃ 5 Д

(2)

α

α၂ہ঩཮ቩۓਜ

(3)

ᇞ ᇞᖴ

२ӃǴाӃ੝ձགᖴӧ೭ࢤѠεᏢಞޑਓำύࡰᏤǵႴᓰǵх৒ǵЇሦךޑ

ࡰᏤ௲௤Ǵᒘ०㈣௲௤ǹӳ՟คఱᏢੇύޑᐩ༣Ǵ஼ЈЇሦךόᘐոΚᗌӛ߻຾! ா όՠࢂךፐ཰ᏢಞޑᏤৣǴாࡑΓೀ٣ޑ዗௃ᆶࢲΚ׳ࢂᏢғॺᏢಞޑӳڂጄ! Ψ Μϩགᖴ௲௤ӧᏢғ਋שਔǴཁൿӦႴᓰᏢғǹགᖴ௲௤ᡣךԖ׳ӭޑᐒ཮வᒱ ᇤύᏢಞǵԋߏǹགৱாόսϩ٦ǵ໺௤ࣴزᆶғࢲޑ࿶ᡍکඵችǴᡣךԖ೚ӭ ሦ৴Ǵགৱ௲௤! ӧ҂ٰޑВη္Ǵךॺۓ཮ਔਔ጑ᚶாჹᏢғॺޑคدں᝘ᆶང ޑᜢᚶ! ԖΑ೭ࢤᏢಞᐕግǴ࣬ߞךॺஒ׳Ԗ߿਻ٰय़ჹ҂ٰΓғύ܌ஒᎁၶޑ೚

೚ӭӭᜤᚒᆶݢשǶ

ௗΠٰǴᖴᖴλಔԋ঩Ǵ΋ޔаٰǴҗܭӚՏ๏ϒך೚ӭڐշᆶЍ࡭Ǵωૈ

ᡣࣴزૈ୼׳у໩ճӦ຾ՉǶᖴᖴჴᡍ࠻ޑፏՏᏢߏǵᏢۊॺ๏ϒך೚ӭޑࡰᏤ аϷ࿶ᡍޑϩ٦Ǵᡣךᕇ੻ؼӭǶགᖴѠεᙴଣޑ߻፸ॺǴགᖴாॺόս፥௲Ǵ

ࡰЇךࣴزޑБӛǶགᖴα၂ہ঩ॺܭԭԆύǴܜޜࡰᏤᏢғፕЎǴ٠๏ϒ೚ӭ ᝊ຦ޑࡰᏤکࡌ᝼Ƕӣ៝೭ࢤ؃ᏢၡำǴाགᖴޑΓჴӧࢂϼӭΑǴԶा߄ၲޑ ᖴཀǴ৮܂ᜤаҔ೭ออޑጇ൯ၰᅰǴҗ૱གৱ܌Ԗම࿶๏ϒךᔅշޑ຦ΓǴు

څޑᐕግǵкჴΑךޑΓғǶ

(4)

ύ

ύЎᄔा

ਥᏵ 2011 ԃӄౚᕎੱ಍ीፓࢗࡰрǴطᕎࣁಃΒεت܄ᕎੱԝΫচӢǴࣁಃϤε ζ܄ᕎੱԝΫচӢǶ೭ঁࣴزޑЬाҞ኱ࢂаᖏ׉Ўӷൔ֋ᆶ่ᄬϯᖏ׉ၗ਑ࣁ

୷ᘵǴଞჹௗڙ৔ᓎᐨ؟ೌݯᕍޑطᕎ஻ޣǴ຾ՉݯᕍࡕൺวႣෳኳಔޑࡌҥǶ ᖏ׉Ўӷൔ֋ᒿ๱ႝηੰᐕس಍ޑදϷЇҔԶಕᑈܭᖏ׉ၗ਑৤ύǶӵՖ٬ᙴᕍ

࣬ᜢΓ঩ૈ୼Ԗਏ౗Ӧଓᙫ஻ޣߏਔ໔܌ಕᑈޑӚᅿόӕᜪࠠЎӷൔ֋ǵӵՖᇶ շᙴᕍΓ঩ࢗ၌ಕᑈܭᖏ׉ၗ਑৤ޑεໆς่ᄬϯޑၗ਑ࣣԋࣁߦ຾ᖏ׉ࣴز຾

Չޑख़ा᝼ᚒǶԶᙴᕍኧॶޑϩ݋ࣴزύǴதሡाय़ᖏၗ਑લॶ (missing values) ޑୢᚒǴӧԜႣෳطᕎ஻ޣݯᕍࡕൺวޑኳಔว৖ࣴزၸำύǴҭԖၗ਑લॶޑ

ୢᚒǴሡाբ຾΋؁ޑ௖૸Ƕ

ҁࣴزว৖Α࣬ᜢޑၗ਑ᘏڗБݤǴஒطᕎ࣬ᜢޑᖏ׉ӢηҗӚᅿόӕᜪࠠЎӷ ൔ֋ύ຾ՉၗૻᘏڗǴԶ೭٤Ўӷൔ֋х֖Αຬॣݢൔ֋ǵܫ৔ࣽቹႽൔ֋ǵੰ

౛ൔ֋ǵЋೌ૶ᒵǵΕଣੰᄔϷрଣੰᄔ฻฻ޑЎӷൔ֋Ƕၗૻᘏڗޑኳಔё೏

ᔈҔܭଓᙫطᕎ஻ޣޑੰ௃ᡂϯǴԶаೕ߾ࣁЬޑϩᜪᏔ߾ё೏ᔈҔܭղձ஻ޣ

ࢂց಄ӝࣴز࣬ᜢచҹǶԜࣴز܌ว৖ޑϩᜪᏔගٮ஻ޣࢂց಄ӝ੝ۓచҹϐှ

เаϷޔௗᆶ໔ௗϩᜪޑ٩Ᏽ᛾ܴ (ջவЎӷൔ֋ܜڗрޑ࣬ᜢഋॊѡ)Ƕଞჹᇶ շᙴᕍΓ঩ࢗ၌่ᄬϯၗ਑ޑ᝼ᚒǴҁࣴز௦ҔаҁᡏፕࣁᏤӛޑᖏ׉ࡰЇ߄Ң ᇟق (ontology-driven clinical guideline representation language)Ǵٰჹࢗ၌ޑπբ຾

Չ่ᄬϯޑඔॊǶԶ೭٤ࢗ၌πբЬाࢂӧ Protégé ޑᕉნύ೸ၸ GLIF3.5 ޑ࣬ᜢ ϡҹǴஒࢗ၌πբϩࣁӭঁ؁ᡯᆶࢬำٰ຾Չ่ᄬϯޑ߄ҢǴҁࣴزჴբё᠐ڗ GLIF3.5 ϡҹ܌߄Ңޑࢗ၌πբϐ୺ՉЇᔏǴаଞჹჹᔈޑၗ਑຾Չࢗ၌ᆶ่݀ᡉ

(5)

Ңϐ୏բǶ೸ၸόӕኧໆ஻ޣޑኳᔕჴᡍǴаϩ݋୺ՉЇᔏၗ਑ࢗ၌ޑ҅ዴ܄ᆶ

܌ሡϐ୺Չਔ໔ǶࣁΑӧ֖Ԗόֹ᏾ၗ਑ޑ௃׎Πว৖ႣෳኳಔǴҁࣴزଞჹံ

ॶޑБݤǴӧኳᔕޑჴᡍύǴӅ௦ҔΎঁံॶБݤ (imputation method) аϩ݋ǵ ຑ՗όӕံॶБݤӧԜࣴزύ܌৖౜ޑᛙۓ܄аϷ҅ዴ܄Ǵ٬Ҕ೭٤ၨૈၨ٫ޑ

೛ۓǴଞჹόӕ᏾ӝၗ਑ڗኬਔ໔܌ౢғޑၗ਑ಔ຾ՉံॶǶ٠аЍ࡭ӛໆᐒ

(support vector machines) ࣁϩᜪᏔٰଞჹόӕၗ਑ಔٰࡌҥൺวႣෳኳಔǴаຑ

՗೭٤όӕံॶБݤჹܭႣෳኳಔޑቹៜǶҁࣴزϟಏΑ΋ঁٿ໘ࢤޑຑ՗Б ݤǴٰଞჹံॶޑБݤکႣෳኳಔ຾Չຑ՗Ǵӧಃ΋ঁ໘ࢤύǴଞჹံॶБݤޑ ਏૈ຾Չຑ՗ǴԶӧಃΒঁ໘ࢤύǴ߾ࢂଞჹံॶБݤჹႣෳኳಔޑቹៜ຾Չຑ

՗Ƕ

ၗ਑ܜڗኳಔޑ F-score ࣁ 92.40% Կ 99.59%ǴԶղձ஻ޣచҹϩᜪᏔޑ F-score ࣁ 96.15% Կ 100%Ƕӧ٬Ҕ GLIF3.5 ᇟقࣁ୷ᘵٰჴբࢗ၌Їᔏޑ೽ҽǴࢗ၌Ї ᔏӧόӕ஻ޣኧໆޑኳᔕჴᡍύǴё҅ዴӦஒჹᔈၗ਑ڗӣև౜ǴԶ೸ၸ GLIF3.5 ޑ௦ҔǴёቚ຾҂ٰࢗ၌πբϕ೯܄ (interoperability)ǴӅ٦܄ (shareability) аϷ ख़Ҕ܄ (reusability) ϐወΚǶ೸ၸٿ໘ࢤޑຑ՗БݤǴӧಃ΋ঁ໘ࢤύǴଞჹံॶ Бݤޑਏૈ຾Չຑ՗Ǵଞჹ؂΋ঁံॶБݤǴᒧрΑၨڀᛙۓ܄аϷ҅ዴ܄ޑୖ

ኧ೛ۓǴԶӧಃΒঁ໘ࢤύǴև౜рૈၲډਏૈၨӳޑႣෳኳಔ܌٬ҔޑံॶБ ݤک᏾ӝၗ਑ڗኬਔ໔ޑಔӝǴჴᡍ่݀ᡉҢǴ೭٤߄౜ၨӳޑံॶБݤک΋૓

தҔޑံॶБݤ (ѳ֡ॶံॶ) ࣬КၨǴၲډᡉ๱ޑৡ౦ (P < 0.001)Ƕ

ᜢᗖӷǺᖏ׉Ўӷൔ֋ǴႝηϯੰᐕǴၗૻᘏڗǴࢗ၌ᇟقǴલॶǴံॶǴႣෳ

(6)

ABSTRACT

Background: According to the global cancer statistics in 2011, liver cancer in men was

the second most frequent cause of cancer death and was the sixth leading cause of cancer death in women. The goal of this work is to develop the recurrence predictive model for patients who have received radiofrequency ablation (RFA) treatments based on the clinical narrative reports and structured data source. Introduction: As a result of the increasing adoption of electronic medical record (EMR) systems, more and more medical records are accumulated in the clinical data repository. To provide an efficient way for tracking patients’ conditions with long period of time and facilitate the collection of clinical data from different types of narrative reports, it is critical to develop an efficient method for smoothly analyzing the clinical data accumulated in narrative reports. For structured data, querying the data stored in the clinical data repository becomes increasingly important for discovering the contained knowledge from enormous data. In medical research, the problem of missing data occurs frequently.

In this work of developing a liver cancer recurrence predictive model, there are still missing data. Therefore, the adoption of methods for dealing with missing data is necessary. In this study, several imputation methods and their effects on different multiple measurement data sets with different sampling time periods are compared.

Materials and Methods: To facilitate the liver cancer clinical research, a method is

developed for extracting clinical factors from the mixture types of narrative clinical reports, including ultrasound reports, radiology reports, pathology reports, operation notes, admission notes, and discharge summaries. An information extraction (IE) module is developed for tracking a liver cancer patient’s disease progression over time, and a rule-based classifier is developed for answering whether patients meet the clinical

(7)

research eligibility criteria. The classifier provides answers and direct/indirect evidences (i.e., evidence sentences) for the clinical questions. The ontology-driven clinical guideline representation language, guideline interchange format version 3.5 (GLIF3.5), is utilized for formulating query tasks. The query tasks are formulated based on the flowchart of GLIF3.5 in the environment provided by Protégé. The query execution engine in Flowchart-Based Data-Querying Model (FBDQM) is developed and implemented for executing the query tasks and presenting the query results in the visualized graphical interface. The correctness and the in-time performance of the system are evaluated using three medical query tasks relevant to liver cancer in the experiments. To develop predictive models based on the incomplete clinical data, several imputation methods are adopted for dealing with the missing data before the process of data analysis. Support vector machine (SVM) is employed in building the recurrence predictive model. This study introduces a two-level method for evaluating imputation methods and predictive models when the problem of missing data occurs.

The first level of this method is used for evaluating the performance of an imputation method, and the second level is used for evaluating the influence of imputation methods on predictive models. Results: The IE model achieves F-score from 92.40% to 99.59%

and the classifier achieves accuracy from 96.15% to 100%. The FBDQM-based query execution engine performs successfully to retrieve the clinical data based on the query tasks formatted using GLIF3.5 in the experiments with different amounts of patients.

The correctness of the three query tasks is 100% in four experiments. For each imputation method, more appropriate parameter settings for a specific data set can be selected based on the imputation simulation experiment. The results reveal that

(8)

According to the evaluation results, the leading imputation methods are significantly different (P < 0.001) from the mean imputation which is frequently used by data sets with missing values. Conclusions: The application is successfully applied to the mixture types of narrative clinical reports. It might be applied to the key extraction for other types of cancer patients. The ontology-driven and FBDQM-based approach enriches the capability of data query. The adoption of GLIF3.5 increases the potential for interoperability, shareability, and reusability of the query tasks. According to a two-level evaluation approach, imputation methods and their effects on different multiple measurement data sets for the classification of liver cancer recurrence can be explored.

Key words: clinical narrative reports, electronic medical records, information extraction, query languages, missing values, imputation, predictive model, liver cancer

(9)

LIST OF FIGURES

Figure 1. The overview of this study: from clinical narrative reports to recurrence predictive model ... 2 Figure 2. The overview of this method: the information extraction module phase and

rule-based classifier phase ... 12 Figure 3. The overview of the methodology used in the data query tool based on

ontology-driven methodology and flowchart-based model ... 25 Figure 4. The architecture of the Flowchart-Based Data-Querying Model (FBDQM)

query execution engine ... 28 Figure 5. The flowchart-based data querying model (FBDQM) contains the structure of query task workflow and the information of each node in the query task .... 30 Figure 6. The query criteria selection interface and the GLIF3.5 ontology information

viewer ... 31 Figure 7. The flexibility in selecting all or partial nodes of the flowchart for

participating in the execution of query operation ... 32 Figure 8. The query criteria of the selected nodes can be displayed in the query criteria selection interface ... 32 Figure 9. The GLIF3.5 ontology information viewer. The selected node, “degree

decision”, and its corresponding information in GLIF3.5 format ... 33 Figure 10. The clinical data representation interface ... 38 Figure 11. The clinical data representation interface shows the retrieved results of

“Degrees of liver damage with mutually exclusive setting” query task ... 39 Figure 12. The clinical data representation interface shows the retrieved results of

(14)

Figure 13. Evaluation of imputation methods based on simulation experiments and development of predictive model based on imputed data sets (DSs) without missing values (MVs) ... 42 Figure 14. An overview of methods for exploring imputation methods and their effects on different multiple measurement data sets ... 44 Figure 15. The simulation experiment based on data set without missing values using

seven imputation methods ... 47 Figure 16. The performances of the system based on three query tasks in four

experiments with different amounts of patients ... 61 Figure 17. The rate of imputation methods which has top 10 leading performances with different parameter settings among 18 features with missing values and among all data sets with different periods ... 62 Figure 18. The rate of imputation methods which has top 10 leading performances with different parameter settings for different sampling time periods among 18 features with missing values ... 63 Figure 19. Simulation results of ALP with different time periods (7 days to 30 days) ... 64 Figure 20. Simulation results of ALP with different time periods (60 days to one value)65 Figure 21. The imputation method selection criterion (averaging Q1, median and Q3) of

seven imputation methods for ALT in all sampling time periods ... 66 Figure 22. BACs based on eight multiple measurement data sets and seven imputation methods ... 67 Figure 23. Sensitivity and specificity of seven imputation methods and eight multiple

measurement data sets ... 69

(15)

LIST OF TABLES

Table 1. The major concepts and their related concepts ... 15

Table 2. The textual expression examples of different concepts ... 16

Table 3. The evidence examples of a patient’s summarization status ... 21

Table 4. The examples of the query criteria in GLIF3.5 and the corresponding translated SQL-based query criteria ... 35

Table 5. Inter-annotator agreement for information extraction of concept entities ... 57

Table 6. Inter-annotator agreement for classification ... 58

Table 7. Results of evaluating the information extraction module ... 58

Table 8. Results of evaluating the rule-based classification ... 58

Table 9. The correctness of the system in experiments with three query tasks ... 59

Table 10. The performance of execution time in experiments with three query tasks .... 60

Table 11. Number of data items and missing rate of data items in multiple measurement data sets with eight sampling time periods ... 67

Table 12. Ranges of sensitivity and specificity of imputation methods applied to data sets with different sampling time periods ... 70

Table 13. Classification results of eight time periods, including the average and standard deviation classification results among different imputation methods ... 71

Table 14. Classification results of seven imputation methods, including the average and standard deviation classification results among different multiple measurement data sets ... 72 Table 15. BACs of seven imputation methods among different multiple measurement

data sets and t-test results are compared between BACs of eight time

(16)

imputation methods ... 72 Table 16. Classification results of three leading imputation methods ... 74

(17)

Chapter 1 Introduction

1.1 Motivation

Hepatocellular carcinoma (HCC) is the most commonly primary liver cancer [1-3]

and has been the leading cause of cancer death in Taiwan since 1984 [4]. The treatments for HCC contain liver transplantation, surgical resection, local ablation therapy (e.g., percutaneous ethanol injection therapy (PEIT), microwave coagulation therapy (MCT), and radiofrequency ablation (RFA)) and transcatheter arterial chemoembolization (TACE) [5-6].

Studies reported that the ablative therapy was mostly effective in treating patients with smaller HCC tumors [7-12]. Among local ablation therapies, RFA became the most frequently adopted therapy. For the patients with early-stage HCC and not suitable for surgical resection or liver transplantation, RFA was the best alternative treatment [13].

In previous studies, researchers estimated that the cumulative 5 year recurrence rate was more than 70% for patients who received RFA [14-16].

The goal of this work is to develop the recurrence predictive model for patients who have received RFA treatments based on the clinical data for supporting decision-making (e.g., for patients at higher risk of recurrence applying more frequent follow-up for tracking their clinical statuses).

(18)

Figure 1. The overview of this study: from clinical narrative reports to recurrence predictive model

This study contains the following aspects (Figure 1) of (1) developing an information extraction module for extracting clinical factors from mixture types of clinical narrative reports, (2) developing rule-based classifiers for identifying patients who meet the clinical research eligibility criteria from large amounts of clinical narrative reports, (3) developing and implementing Flowchart-Based Data-Querying Model (FBDQM) for assisting in executing the query tasks and presenting the query results in the visualized graphical interface, (4) using missing data handling methods for processing the clinical data with missing values, and (5) developing a recurrence predictive model.

1.2 Hepatocellular Carcinoma

According to the global cancer statistics, liver cancer is the second leading cause of

(19)

cancer death in men and the sixth leading cause of cancer death in women [17], and HCC is the most commonly primary liver cancer [2, 18-19]. A research reported that both incidence and mortality rates of HCC continued to increase in the United States from 1975 to 2005 [20]. HCC has been the leading cause of cancer death in Taiwan since 1984 [4].

According to a literature on HCC [13], a consensus-based practice guideline for hepatology [21], and clinical practice guidelines of hepatobiliary cancers in national comprehensive cancer network (NCCN) [12], the following clinical factors related to HCC were collected for representing a patient’s disease progression over time (from screening, diagnosis, treatment, to follow-up). In the screening, patients with high risk of HCC (e.g., the infection with hepatitis B virus (HBV) or hepatitis C virus (HCV) and the liver cirrhosis status) were recommended to test the serum alpha-fetoprotein (AFP) value and receive the ultrasound examinations for a specific period of time. According to these results received from the screening (e.g., liver mass or nodule was found by ultrasound examination), patients might need to take further examinations (i.e., the pathology and/or typical image vascular enhancement patterns in image examinations) for HCC diagnoses. The information of tumors, including number, location, and size variation, was observed. The liver function of the Child-Pugh score [22] and the cancer staging (e.g., tumor-node-metastases (TNM) staging for cancer stage [23] and Barcelona clinic liver cancer (BCLC) staging classification [24]) were also evaluated for providing the reference information during the decision-making in the assessment of HCC treatment. After receiving the specific treatment, the patient’s statuses were observed in the continuous follow-up for checking the patient’s response to the

(20)

including ultrasound reports, radiology reports, pathology reports, operation notes, admission notes, and discharge summaries. In the proposed environment, the laboratory test results could be directly retrieved from the structured data sources.

To provide an efficient way for tracking patients’ conditions with long period of time and facilitate the collection of clinical features relevant to HCC from narrative reports, it is critical to develop an efficient method for smoothly analyzing the clinical data accumulated in narrative reports; that is, the process of information extraction (IE) is required.

1.3 Related Work

1.3.1 Handling Data from Clinical Narrative Reports

During the past two decades, researchers have successfully applied natural language processing (NLP) to the medical narrative reports for case finding, case identifying, and information extraction for specific diseases and topics (e.g., tuberculosis [25], pneumonia [26-28], trauma [29], heart failure [30-31], diabetes [32], hypertension [33], smoking [34-36], and cancer [37-40]), for identifying and detecting problems [41] and adverse events [42-43], for acquiring the relations (e.g., disease-manifestation related symptom and drug-adverse drug event [44]), and for assessing the quality of care [45], etc.

For resolving the problem of information overload for clinicians reviewing large amounts of patients’ records, the previous works put the efforts on the creation of concept and problem oriented views of patient records. A concept-oriented information retrieval system was developed for reducing the information overload by retrieving the relevant reports based on the predefined concepts [46-47]. Three views for displaying

(21)

the retrieved relevant reports were provided, including “view by department”, “view by time”, and “view by topic”. A system with problem-centered organization was developed for organizing and navigating patient records, including radiology reports and discharge summaries [48]. Other related problem-oriented works included a system for automating the process of creating and maintaining a problem list [41] and semantic relation classification for the problem-oriented medical records on discharge summaries [49-50].

Radiology reports are the extremely important sources of clinical data and may be the mostly studied type of clinical narrative reports [51]. Previous works stressed on classifying [52-53], extracting [54-56], encoding [57-60], and mapping the findings, expressions, and diagnoses [26, 61]. There were also numerous works focusing on extracting pathology reports for collecting the characteristics of cancers, including breast cancer [62], prostate cancer [63], and lung cancer [64]. Other related systems and methods were designed for extracting information from pathology reports, including MEDSYNDIKATE [65-66], caTIES [67-68], MedTAS/P [69], and the works proposed by Schadow and McDonald [70], Liu et al. [71], Napolitano [72], etc.

The efforts for this system are put in the following respects. For handling the information extraction with the mixture of languages, synonyms, typical and atypical abbreviations, it follows the approach proposed by Friedman et al. [73] and identifies the liver cancer concepts and various of their different expressions. For extracting the clinical reports with grammatical or ungrammatical sentences, the clinical data are retrieved and extracted based on the predefined concept model using the hot-spotting technique [36] based on regular expressions. To have a better understanding about a

(22)

work not only could extract related information (i.e., filtering out the reports without target extracted information), but could also further group the duplicated extracted results in different reports. For providing summarized information based on the extracted evidences, the method proposed by Denny et al. [74] was extended, where they handled ‘oblique’ references to colorectal cancer (CRC) screening. This method, on the other hand, is extended to develop rule-based classifiers for answering the patient’s summarized status by providing the direct and indirect evidences.

1.3.2 Data Query from Structured Data Sources

The enormous medical records are accumulated in the clinical data repository due to the increasing utilization of electronic medical record (EMR) systems. These large volumes of clinical data are helpful for discovering new knowledge and improving the quality of healthcare [75-76].

The query approach is required for the domain experts better understanding the complex data by querying the desired data from the enormous clinical data. Previously, the query approaches were proposed and unitized for assisting the domain experts in retrieving clinical data based on query tasks for further analyses [77-79].

In the literature review, the works relevant to the data querying from clinical data repository could be categorized into three considerations, including (1) the design of query interface for users formulating query tasks, (2) the representation of query results, and (3) the models used for formulating query criteria.

(1) The design of query interface for users formulating query tasks. The formulation of query tasks could be divided into two different ways. One was formulating the query logic using low-level query language such as Structured Query Language (SQL). The other was to formulate the query tasks through the query-building

(23)

tools making users easily create query tasks using the features provided by these tools.

The usage of low-level query language such as SQL might have the following difficulties. First, experiences in database query were required. The users who used SQL commands for directly querying data needed to understand the detailed information of database, including the definitions of tables, the types of columns in tables, and the relationships among these tables. Second, the SQL command sentence and query processes might be complicated. When the scenario of data query contained a sequence of query processes, the SQL command sentence was complicated and hard to analyze the query results of these intermediate processes.

In order to decrease the complexity of formulating query tasks using SQL commands directly, the query-building tools were proposed and developed for assisting the users in creating query tasks. RetroGuide was proposed for assisting the users with limited database experiences in formulating the query tasks [79]. The query tasks were formulated as the flowchart using the workflow editor, and the query criteria were specified in the nodes of the flowchart.

(2) The representation of query results. The query results could be presented using different representation formats, including the free-text format, structuralized tables, and visualized charts. The richer representation methodologies could make the users better understand the query results.

RetroGuide provided table-based three-level hierarchical reports of query results, including a summary report, a detailed report, and an information view of an individual patient [80]. Mabotuwana et al. introduced a prescription timeline visualization utility to display the prescription information through visualized timeline graphical charts for

(24)

interoperable formulation information models, the opportunity of querying different clinical data repositories in a consistent way was increasing. Besides, the query capability could be enhanced using the models with powerful expressions for the query criteria formulation.

Austin et al. defined the information model for designing the generic interfaces to EMR systems [78] and collected various sets of examples of clinical questions which could be represented as the queries of databases. These queries inferred some general patterns and designed an information model to represent the clinical research query.

Mabotuwana et al. [77] proposed an ontology-driven [81-84] approach to formulate query criteria for enhancing query capabilities of general practice medicine for better management of hypertension.

The goal of this study is to develop an approach for enriching the data query in the three considerations mentioned above. A proposed approach can not only assist the users in formulating query tasks using the flowchart-based models, but can also present the query results through the visualized graphical interface. On the other hand, this approach provides the opportunity for enhancing the query capability through the adoption of ontology-based guideline representation language, GLIF3.5.

1.3.3 Handling Missing Data during the Development of Predictive Model

In medical research, the problem of missing data frequently occurs during data analyses. According to a review study, among 100 articles from seven cancer journals, 81 of these articles mentioned evidences of missing data [85]. In this study, the clinical data of patients with Hepatocellular carcinoma (HCC) also include missing data.

(25)

Therefore, the adoption of methods for dealing with missing data is necessary during the development of recurrence predictive model. In this study, the recurrence predictive model is developed for predicting the recurrence status of patients with HCC after the radiofrequency ablation (RFA) treatment.

To develop predictive models based on incomplete clinical data, several methods are proposed for dealing with missing data before the process of data analyses for developing predictive models [86], including (1) complete case analyses, analyzing only the data of patients who have no missing data, (2) complete variable analyses, analyzing only the data of variables (i.e., features) which have no missing datum, and (3) imputation methods, imputing the missing values based on different imputation methods.

Previous studies provided the comparison results for applying different methods mentioned above. In a study proposed by Janssen et al., they employed logistic regression in modeling the risk of Deep Venous Thrombosis (DVT) based on complete case analyses, complete variable analyses, and multiple imputation methods for estimating the missing values. According to their study, multiple imputation methods were preferred over complete case analyses and complete variable analyses [86].

Similarly, in a study proposed by Dimou et al., they evaluated imputation methods in an ovarian tumor diagnostic model based on generalized linear models and support vector machines. According to their study, the missing value imputation was preferred over complete case analyses [87].

For the evaluation of imputation methods, previous studies could be divided into two categories. The first one used normalized root mean squared error (NRMSE) for evaluating the imputation accuracy of the imputation methods [88-91], and the second

(26)

methods are employed in presenting the imputation accuracy of the imputation methods for specific data sets and their influence on predictive models which are developed based on imputed data sets.

(27)

Chapter 2 Information Extraction for Tracking Liver Cancer Patients’ Statuses: from Mixture of Clinical Narrative Report Types

The purpose of this work is to provide liver cancer patients’ statuses from several types of narrative reports for facilitating the clinical research in three respects. (1) Single report level identifies the major target information and its related attributes from each report. (2) Cross-report temporal level sorts and arranges these extracted results according to temporal orders and report types for presenting the variance of each observed target or getting a specific status before/after a specific time point. (3) Summary level further provides the answers concerning whether patients’ statuses meet the criteria specified in the clinical research.

The proposed method, which contains two phases, is proposed for tracking patients’ disease progression over time and answering whether the patients meet specific criteria of a clinical research from mixture types of narrative reports. The overview of this method is shown in Figure 2. In the first phase, an information extraction module is developed for extracting concepts related to a patient’s disease from narrative reports.

After an extraction process being performed, these extracted concepts are further sorted and arranged for grouping duplicated information and presenting variances of patients’

statuses over time. In the second phase, a rule-based classifier is developed for giving answers of three clinical questions according to the arranged information.

(28)

Figure 2. The overview of this method: the information extraction module phase and rule-based classifier phase

2.1 Material

In this study, narrative reports are collected from 152 patients who received ultrasound guided radiofrequency ablation (RFA) in National Taiwan University Hospital (NTUH) between 2007 and 2009. These reports are divided into two groups (i.e., a development set and a testing set) for developing and evaluating the methods. In the development set, narrative reports are collected from 74 patients who received RFA between 2007 and 2008. This data set is used for developing an information extraction module and a rule-based classifier. The narrative reports of the testing set are collected from 78 patients who received RFA in 2009. Narrative reports in the testing set are not previously reviewed or analyzed by system developers, and this data set is used for

(29)

evaluating the proposed methods. Different types of reports are created by different groups of clinicians. For example, radiology reports are created by about five radiologists specialized in abdomen interpretation; ultrasound reports are created by about 15 trained gastroenterology-hepatology specialists; the discharge summaries and admission notes are created by supervised attending physicians with after-review of trained residents; and, operation notes are created by about five liver surgeons.

A time period for personal reports in the development set ranges from 0.13 years to 10 years (this range is 9.87 years and the mean ± standard deviation is 5.63 ± 3.41 years). The average number of these reports of each patient is 36. A time period for personal reports in the testing set ranges from 0.15 years to 10 years (this range is 9.85 years and the mean ± standard deviation is 5.55 ± 3.58 years). The average number of these reports of each patient is 30.

759 reports are used during an evaluation of concept identification. All patients’

personal reports, 2351 reports, are used during an evaluation of classifying personal summarized status. The percentages of each report type in all reports are shown as radiology (53%), ultrasound (24%), discharge (14%), pathology (5%), admission (2%), and operation (2%).

In this environment, the computerized patient records contain structured data (such as laboratory test results) and non-structured data (such as narrative reports). This study focuses on tracking data from narrative reports, but not from structured data sources such as laboratory test results which could be analyzed directly.

(30)

2.2 Concept Model and Expressions 2.2.1 Concepts

Obeying the approach proposed by Friedman et al. [73], a liver cancer concept model is defined for assisting in the information extraction (IE). In the work, the important clinical factors are mentioned in the existed research, including cancer diagnosis, cancer staging, tumor information, comorbidity diagnosis, treatment, and recurrent status [12-13, 21]. These clinical factors are regarded as the major concepts in the study and are used for tracking and observing the patients’ disease progression over time.

To identify the related information of a major concept, each major concept has a corresponding set of related concepts for specifying its relevant information. In this concept model, a single major concept contains multiple related concepts. Besides, regular expression sets of major concepts and related concepts are used for identifying the major/related concepts in the reports. For instance, tumor in the major concept contains relative concepts of tumor, lesion, mass, and nodule. Furthermore, these related concepts have the character properties (i.e., positive and negative) for further indicating the matched information being positively correlated information, which should be reserved, or the matched information being unconcerned information, which should be filtered out.

Temporal information and report types are commonly related concepts and are used for tracking patients’ disease progression and confirming the examination sources.

Although each narrative report has its date and report type in other structured data sources, extracting temporal information from narratives is still required. For example, clinicians would mention the history records of a patient in a narrative report and the

(31)

temporal concepts have to be identified and extracted from these reports. This system could manage the history records marked with time and report type. The diagnosis status is also a commonly related concept for checking further status of a finding. Table 1 shows the major concepts and a listing of their related concepts in this study.

Table 1. The major concepts and their related concepts

Major concepts Related concepts

Diagnosis of cancer (HCC)

Diagnosis of cancer, Diagnosis status (i.e. confirmed, suspected, and no evidence of finding), Temporal information, Report type

Staging (BCLC)

BCLC staging (e.g. BCLC: class A), Diagnosis status, Temporal information, Report type

Tumor

Tumor major object (e.g. tumor, lesion, mass, and nodule), Target location (e.g. liver and segment # 7), Size (e.g. 1 cm), Quantifier (e.g. one, two, and three, etc.), Temporal information, Report type, Non-target location (e.g. breast), Non-tumor size items (e.g. LeVeen, which is used in RFA treatment)

Comorbidity diagnosis (Liver Cirrhosis)

Comorbidity diagnosis (i.e. liver cirrhosis), Diagnosis status, Child-Pugh staging (e.g. Child-Pugh score: class A), Temporal information, Report type

Treatment

Treatment type (e.g. RFA and TACE), Treatment status (e.g.

performed and status post for confirming the treatment was performed), Temporal information, Report type

Recurrent Status

Recurrent status, Diagnosis status, Temporal information, Report type

(32)

Table 2. The textual expression examples of different concepts

Categories Expression examples

Diagnosis of cancer HCC, hepatocellular carcinoma, hepatoma

Diagnosis status

suspected, suspicious, impression, probable, definite, no definite, no evidence of, without

BCLC staging

BCLC stage A1, BCLC A1, BCLC clinical stage A2, Barcelona clinic liver cancer stage A4

Tumor major object tumor, lesion, mass, nodule, metastasis, HCC, hepatoma

Target location

liver, hepatic, right superior liver, medial segment of left hepatic lobe, segment 7, S#8

Size 0.6cm, 1.8 x 1.4cm, 0.3 x 0.5 x 0.7 cm, 3.7~4cm, 4.0 * 3.8 cm, 24.0 mm Quantifier a, one, single, two, three, four, five, several, multiple

Liver Cirrhosis liver cirrhosis, cirrhosis of the liver, liver cirrhosis(+)

Child-Pugh staging

Child's B, Child A, Child's class A, Child classification A, Child-Pugh A, Child-Pugh classification A

Treatment type

RFA, radiofrequency ablation, TACE, PEI, PMCT, hepatectomy, liver transplantation

Treatment status status post, s/p, perform, receive, evaluate, suggest, arrange, prefer, discuss

Recurrent (HCC)

tumor recurrence, recurrent, no recurrent HCCs, no definite evidence of tumor recurrence

Temporal information

2007/11/11, 2007-08-04, 2010_03_10, 2010.03.04, 099/10/27, June 10, 2009

Report type

Computed Tomography (CT), Magnetic resonance imaging (MRI), MR, angiography, echo, sonar, ultrasonography, ultrasound, ຬॣݢᔠࢗ

2.2.2 Expressions

In order to identify the concepts from clinical texts, each concept has its corresponding set of regular expressions for matching different expressions of concepts.

(33)

These regular expressions are defined based on the information from the reports in the development set. Reports were previously reviewed in the development set and different expressions of concepts, including synonyms, typical and atypical abbreviations, common misspellings, were manually identified. Table 2 shows the examples of expressions for the concepts.

2.3 Information Extraction Module (First phase) 2.3.1 Identification of Major and Related Concepts

To identify the locations of major interest concepts (e.g., the expressions of tumor:

tumor, lesion, nodule, and mass, etc.) in the clinical text, the idea of hot-spotting technique is employed. These identified locations are regarded as the bases to search for their related information (e.g., the location and size) from the surrounding texts. After the major concepts are located in the text, their related concepts could be captured within a predefined window. The property value, major or related, is defined in the matching property of concepts which is used for specifying the concept being major or

related. To match the concepts in the text, regular expressions defined in the concept model are used for capturing the texts and values relevant to the major concepts and their corresponding related concepts.

2.3.2 Binding of Matched Concepts

For disambiguating the relationships among multiple extracted concepts in the same document or even the same sentence, a rule-based scheme is built for binding

(34)

first checked. When the closest location concept is bound with other size concepts, the second closest location concept, rather than the closest concept, is selected for binding (e.g., in “A 3.3 cm tumor at S8 and a 1.0 cm tumor at S7.”, the “3.3 cm” is bound with

“S8” and “1.0 cm” is bound with “S7” due to “S8” is bound with “3.3 cm”). The multiple size concepts might share the same location concept (e.g., “Two lesions are noted at S6 (1.4cm and 2.0cm)”, where “1.4 cm” and “2.0 cm” share the location concept, “S6”). From the quantifier value, “two”, this method recognizes two pieces of tumor information. From the conjunction word, “and” in “1.4cm and 2.0cm” and just one location concept “S6”, this method confirms the situation of sharing the location concept between these two tumors.

A set of rules are defined for reserving and filtering out these bound concepts according to the character property of these concepts. When a concept is positive, this concept is used for indicating positively correlated information being reserved. When a concept is negative, this concept is used for indicating the unconcerned information being filtered out. For example, when the extracted target is the tumor in “liver”, the matched tumor information such as “the tumor in the breast” would be filtered out based on the identification of negative concepts, non-target location (i.e., breast in this example).

For temporal information extraction (TIE), this method only extracts temporal information explicitly stated in the reports. To identify different expressions of temporal information, different patterns of temporal information are encoded as regular expressions. The temporal information (year, month, and date) might appear together (e.g., “2011/10/27”) or separately (e.g., “In 2011, ..., 10/27”). When only partial temporal concepts are identified (e.g., “month/date”), the method would search surrounding texts for finding out other related temporal information (e.g., “year”) and

(35)

combining these concepts as one complete temporal concept (e.g., “year/month/date”).

2.3.3 Normalization

The process of normalization is designed for handling the following tasks, including (1) normalizing synonyms of the same concept, (2) normalizing various extracted textual expressions, which are used for mentioning the same concept using full name, typical abbreviations, and atypical abbreviations, (3) standardizing different units for the same numeric clinical measurement, (4) translating mixture languages (English and Chinese) into English (i.e., only partial items such as names of examinations), and (5) standardizing two different types of temporal information, as Anno Domini (A.D.) and the “year” used in Taiwan starting from 1911 A.D. (e.g.,

“100/10/27” being the “year” used in Taiwan standardized as “2011/10/27”).

2.3.4 Sorting and Grouping

The previous examination results might be mentioned again as a patient’s clinical history in a current examination report, and the duplicated descriptions of findings might frequently appear in different reports. To reduce the duplicated information scattered among different reports, the extracted results are further sorted and grouped.

All major concepts in the model contain the temporal and report type related concepts. These extracted concepts from the clinical text can be sorted according to the temporal information of reports. A grouping method congregates extracted clinical findings which originally came from the same report but appeared and were mentioned in different reports. For example, a patient’s status of liver cirrhosis was originally described in the ultrasound report on 2009/5/1 and this finding was also described in the

(36)

came from the ultrasound report on 2009/5/1 (relevant to the diagnosis of liver cirrhosis) are grouped together.

2.4 Rule-based classifier for summarizing the patient’s status (Second phase)

A rule-based classifier is designed for answering the patient’s summarized status and checking whether the patients meet specific criteria by providing direct and indirect evidences of the summarized results. The following questions could be answered by this designed classifier. (1) HCC patient: Does the patient have HCC? (2) First treatment:

Is the specific treatment the patient’s first treatment for HCC? (3) Recurrent HCC: Is the patient recurrent after the specific treatment? Table 3 shows the examples of evidences for these questions. According to direct or indirect evidences found in the sorted and grouped extracted results, the classifier gives the answer for each question.

When proceeding clinical research, definite eligibility is required to determine which patients could be listed in the data set for data analyses. Therefore, the classifier has to give binary answers such as confirming whether the patient’s HCC is recurrent (i.e., yes or no). When the status is suspected, the answer would be “no” (because it is “not”

confirmed recurrent). For answering the positive HCC patient who indeed has HCC, the classifier might provide direct evidences such as confirmed diagnoses from pathology or image reports. For answering the negative first treatment, the classifier might provide the following evidences. (1) Direct evidence: Other treatment is performed before current RFA treatment. (2) Indirect evidence: The description of patient’s status after a specific treatment is found before current RFA treatment, or the recurrent HCC is found before current RFA treatment. For answering the positive recurrent HCC (i.e., the HCC

(37)

is confirmed recurrent), the classifier might provide direct evidences extracted from pathology or image reports. For answering the negative recurrent HCC, the classifier might provide the evidences with (1) confirmed diagnoses for no evidence of recurrent HCC and (2) suspicious recurrent status but not being further confirmed yet.

Table 3. The evidence examples of a patient’s summarization status

Statuses of summarization

Example evidence

Positive HCC patient

Direct evidence: In the pathology report, the evidence sentence is found: “it shows a hepatocellular carcinoma in microtrabecular pattern”

Negative HCC patient Indirect evidence: There is no HCC diagnosis in all reports

Positive first treatment

Indirect evidence: There is no other treatment before the specific RFA treatment and there is no recurrent HCC statement before the specific RFA treatment

Negative first treatment

Direct evidence: In the operation note, the evidence sentence is found (Other treatment is performed before current RFA treatment): “Atypical hepatectomy is performed”

Indirect evidence: The description of recurrent HCC is found before current RFA treatment: “Recurrent HCC at S8”

Positive recurrent HCC

Direct evidence: In the radiology (CT) report, the evidence sentence is found: “Recurrent HCC at S8 status post RFA”

Negative recurrent HCC

Indirect evidence: There is no recurrent HCC information in all reports (only the “non-recurrence” statement: “no evidence of local recurrence is noted”)

(38)

Moreover, the report types are also used during the classification when the concepts are extracted from more than one type of reports. For example, when checking the confirmed diagnoses, the method would check the extracted concepts with the order of pathology, radiology, ultrasound, and other types of reports.

2.5 Evaluation for information extraction module and classification

2.5.1 Report Annotation

Gold-standard annotations for concepts in reports and classifications for patients’

personal summarized statuses are determined by annotators. Each report is annotated by two annotators. When disagreements occur between two annotators, the third annotator would adjudicate the annotations and classifications.

2.5.2 Inter-Annotator Agreement

The annotated results of two annotators for information extraction and classification are evaluated based on the inter-annotator agreements (IAAs). In information extraction, F-score is employed in measuring IAA due to the following two reasons [92]. (1) In the tasks of text mark-up, the number of negative cases is poorly defined because the phrases might overlap and vary in length. F-score could be calculated without the number of negative cases. (2) F-score approximates traditional kappa [93-95] due to the large number of true negatives. In the tasks of classification, Cohen’s kappa is directly employed in measuring IAA because both positive and negative cases of patients’ classifications could be specified clearly.

(39)

k = (P(a) - P(e)) / (1 - P(e)), (2.1)

where P(a) is the relatively observed agreement among the annotators and P(e) is the expected agreement due to chance.

Total cases = (TP + FP + FN + TN), (2.2) P(a) = (TP + TN) / Total cases, (2.3) P(e) = ((TP + FP) / Total Cases) * ((TP + FN) / Total Cases)

+ ((FN + TN) / Total Cases) * ((FP + TN) / Total Cases), (2.4)

where TP is true positive, FP is false positive, TN is true negative, and FN is false negative.

2.5.3 Evaluation Metrics for Information Extraction Module

Precision, recall, and F-score are used for evaluating this proposed information extraction module. These metrics are frequently employed in evaluating methods proposed for information extraction [69, 96-97].

The definitions are listed in the following content.

Precision = TP / (TP + FP), (2.5) Recall = TP / (TP + FN), (2.6) F-score = (2 * precision * recall) / (precision + recall). (2.7)

2.5.4 Evaluation Metrics for Classification

The following metrics are employed in evaluating the performance of this rule-based classifier, including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy.

Sensitivity (= Recall) = TP / (TP + FN), (2.8)

(40)

PPV (= Precision) = TP / (TP + FP), (2.10)

NPV = TN / (TN + FN), (2.11)

Accuracy = (TP + TN) / (TP + FP + FN + TN). (2.12)

(41)

Chapter 3 Data Query from Structured Data Sources

An overview of methods is shown in Figure 3. The text-based query tasks are formulated using the ontology of the clinical guideline representation language, GLIF3.5, through the editing environment provided by Protégé [98-100]. The formulated query tasks could be exported as Extensible Markup Language (XML) format by the plug-in tool of Protégé. The exported XML-format instances are imported into the proposed FBDQM-based query execution engine. The proposed FBDQM-based query execution engine is implemented for interpreting the XML-format query tasks formulated using GLIF3.5, executing the query operations, retrieving clinical data, and representing the query results.

Figure 3. The overview of the methodology used in the data query tool based on

(42)

3.1 Clinical Practice Guideline Representation Languages

The clinical guideline representation languages are used for formulating the paper-based clinical guidelines as the computer-interpretable format. Currently, numerous clinical practice guideline representation languages and models are developed by several medical related institutions. These languages and models contain GLIF, EON, Asbru, GUIDE, PRODIGY, and PROforma [101-102]. The clinical guideline representation languages provide the capability of representing the medical related logics, criteria, and data items. In this study, the capability of clinical guideline representation language is adopted for formulating the query criteria. The ontology-based representation language, GLIF3.5, is chosen for formulating the query tasks.

3.2 Query Tasks Formulation

In order to formulate the query tasks based on the classes and concepts of GLIF3.5, the concepts and criteria of the text-based query tasks should be categorized as the corresponding classes of the ontology in GLIF3.5. The instance of algorithms class in GLIF3.5 is a flowchart used for describing the workflow of the clinical guideline. In this study, the flowchart is used for presenting the workflow of the query task.

The process of formulating query tasks involves in the usage of the knowledge-based editing software, Protégé, which provides the graphical user interface for users formulating query tasks using the flowchart of GLIF3.5. The query tasks can be formulated by building the flowchart and specifying the criteria and data items in each node of the flowchart through the environment provided by Protégé.

The entire query task can be divided into numerous sub-tasks, and each sub-task

(43)

can be represented using one node in the flowchart. Five predefined types of classes in GLIF3.5 can be used, including action, decision, branch, synchronization, and patient state steps. The detailed components of each node could also be further specified using the predefined ontology of GLIF3.5 through Protégé.

After the query tasks being formulated using the ontology of GLIF3.5 in the editing environment provided by Protégé, the formulated query tasks can be exported from Protégé as the XML format. Then, the exported XML-format query tasks can be imported into the query execution engine for performing the data querying and retrieving from the clinical data repository.

3.3 System Architecture

A web-based data querying tool is implemented for querying the clinical data based on the FBDQM-based query execution engine. The architecture of this query execution engine contains seven major components (Figure 4), which can be further divided into two sets; one is the logical processing components and the other is the visual representation components. The logical processing components contain GLIF3.5 ontology interpreter, flowchart-based model builder, query language generator, and clinical data retriever. The visual representation components contain GLIF3.5 ontology information viewer, query criteria selection interface, and clinical data representation interface.

(44)

Figure 4. The architecture of the Flowchart-Based Data-Querying Model (FBDQM) query execution engine

3.4 GLIF3.5 Ontology Interpreter

The formulated query tasks are further exported as XML-format documents which could be imported into the GLIF3.5 ontology interpreter in the query execution engine.

A GLIF3.5 ontology interpreter is responsible for interpreting the query criteria and data items in these five types of classes, namely action, decision, branch, synchronization, and patient states classes. The original meanings of these classes in GLIF3.5 and their usages in the study are described as follows [103]. (1) Action is used for indicating the actions being performed. For example, the medically oriented actions such as medical treatment strategies can be described using this class. When the concepts in the query

(45)

operation are medically oriented actions, these concepts can be described using the attributes in the action class. (2) Decision can be used for specifying the criteria of different choices in a decision point. The decision option contains the condition value attribute which is used for describing the detailed criteria of an option. When a query task contains the decision point and needs different criteria to decide the corresponding query operations, the decision class can be used. (3) Branch and synchronization work together. In a flowchart, these two classes are used for expressing multiple concurrent paths which are separated from the branch class and combined in the synchronization class. In a query task, these two classes can be used for representing the multiple concurrent paths. (4) Patient state has two different functions; one is used for describing the clinical state of a patient and the other is used as an entry point of the flowchart.

When the concepts and rules included in the query operation are related to the patient’s status, the class can be used for describing the status. This class can also be used for describing the start status of a query task.

3.5 Flowchart-based Model Builder

The flowchart-based model builder is developed for generating the flowchart-based model based on the interpreted results from the GLIF3.5 ontology interpreter. A corresponding FBDQM is generated by one interpreted query task. The information relevant to the formulated query tasks is included in a FBDQM, including the structure of the flowchart which describes the overall query task and the detailed information of each query sub-task (e.g., the query criteria and the related data items in each node) (Figure 5).

(46)

Figure 5. The flowchart-based data querying model (FBDQM) contains the structure of query task workflow and the information of each node in the query task

3.6 Query Criteria Selection Interface

The graphical flowchart of FBDQM is presented in the query criteria selection interface shown in Figure 6. In the query criteria selection interface, all or partial nodes of the flowchart can be dynamically selected for participating in the execution of query operation. In the left side of Figure 7, all nodes in the flowchart are selected for query operation. In the right side of Figure 7, certain nodes in the flowchart are selected for query operation. The query criteria of the selected nodes would be displayed in the interface shown in Figure 8.

(47)

Figure 6. The query criteria selection interface and the GLIF3.5 ontology information viewer

The functionality of the mutually exclusive setting is included in the query criteria selection interface. For example, in Figure 7, the “degree decision” node has three child nodes “degree A”, “degree B”, and “degree C”. When a patient can only be classified into one of these three degrees, the mutually exclusive setting could be used and the

“degree decision” node could be set as a “mutually exclusive” node. The further priorities of its child nodes could be set. When a patient meets all criteria of these three child nodes, the patient would be classified as the node with the highest priority.

(48)

Figure 7. The flexibility in selecting all or partial nodes of the flowchart for participating in the execution of query operation

Figure 8. The query criteria of the selected nodes can be displayed in the query criteria selection interface

3.7 GLIF3.5 Ontology Information Viewer

In Figure 9, the GLIF3.5 ontology information viewer shows the detailed information relevant to the ontology of GLIF3.5 in each node of a FBDQM.

(49)

Figure 9. The GLIF3.5 ontology information viewer. The selected node, “degree decision”, and its corresponding information in GLIF3.5 format

When a specific node of the flowchart is selected, the viewer displays its detailed information of the GLIF3.5 ontology, including the name of this selected node, the query criterion expressions specified in this node, and the information of the data items in the query criterion expressions.

3.8 Query Language Generator

To generate the low-level query language, the query criteria in these selected nodes of a FBDQM are transferred to the query language generator. During the process of query language generation, the data items included in the query criteria of these selected

(50)

mapping the data items. The mapping contains direct mapping and indirect mapping. In direct mapping, a data item is mapped directly to a value of one specific column of the table in the database (e.g., select Personal_ID from Diagnosis where ICD9_Code = '155.0'; the data item is directly mapped to the value of “ICD9_Code” column of a database table). In indirect mapping, a data item is mapped indirectly through multiple values of columns of the table in the database (e.g., select Personal_ID from Laboratory where Result_String = 'Controllable' and Item_Name = 'Ascites'; the data item is indirectly mapped through multiple values of columns of a databases table which are the

“Result_String” column and the “Item_Name” column).

After the mapping process of the data items, the query language generator creates the corresponding SQL-based query languages for data query. The query language generator gets the query criterion in GLIF3.5 format and translates it into one or more simplified SQL-based query criteria. Table 4 shows the examples of the query criteria in GLIF3.5 and the corresponding translated SQL-based query criteria. For example, the GLIF3.5-based query criterion, “ICD9 = 155.0”, is translated into one SQL-based query criterion, “select Personal_ID from Diagnosis where ICD9_Code = '155.0'”. The GLIF3.5-based query criterion, “at least 2 of (Ascites = "Controllable", ICG is within 15 to 40, Prothrombin_Activity is within 50 to 80, Serum_Albumin is within 3.0 to 3.5, Serum_Bilirubin is within 2.0 to 3.0)”, is translated and divided into five SQL-based query criteria (Table 4).

(51)

Table 4. The examples of the query criteria in GLIF3.5 and the corresponding translated SQL-based query criteria

Query Criteria Format

Query Criteria

GLIF3.5 1: ICD9 = 155.0 Translated

SQL-based criteria

1: select Personal_ID from Diagnosis where ICD9_Code = '155.0'

GLIF3.5 2: at least 2 of (Ascites == "Controllable", ICG is within 15 to 40, Prothrombin_Activity is within 50 to 80, Serum_Albumin is within 3.0 to 3.5, Serum_Bilirubin is within 2.0 to 3.0)

Translated

SQL-based criteria

2.1: select Personal_ID from Laboratory where Result_String = 'Controllable'

and Item_Name = 'Ascites'

2.2: select Personal_ID from Laboratory where Result_Nubmer between 15

and 40 and Item_Name = 'ICG'

2.3: select Personal_ID from Laboratory where Result_Nubmer between 50

and 80 and Item_Name = 'Prothrombin activity'

2.4: select Personal_ID from Laboratory where Result_Nubmer between 3.0

and 3.5 and Item_Name = 'Serum albumin'

2.5: select Personal_ID from Laboratory where Result_Nubmer between 2.0 and 3.0 and Item_Name = 'Serum bilirubin'

3.9 Clinical Data Retriever

The clinical data retriever performs the query operation based on the query criteria in the selected nodes from FBDQM. The process of query execution starts from the nodes on the top layer of the flowchart to the nodes at the bottom layer. During the process of query operation in each node, the patients’ data are retrieved based on the

(52)

query criteria in the node. In the following content, the notations are introduced for illustrating the workflow and operations of the query execution in the FBDQM-based query execution engine.

(1) QC(node): The query criteria included in the node.

(2) PL(node): The patient list which contains the patients satisfying the query criteria included in the node.

(3) PLS(node): The size of the patient list, PL(node), indicating the number of patients satisfying the query criteria included in the node.

(4) PN(node): The list of the parent nodes of the node.

In the left side of Figure 5, six nodes are included in the flowchart, including nodes A, B, C, D, E, and F. Node A has two child nodes, node B and node C. Node B also has two child nodes, node D and node E. Node C has only one child node, node F. In the example, five nodes are selected for participating in the process of query execution, including nodes A, B, C, D, and E. When the query operation is executed upon one node, the query target set is the patients in PL(PN(node)), and the query criteria contained in QC(node) are applied to this query target set, PL(PN(node)). It means that the query operation of the node is performed on the patients satisfying the query criteria of its parent nodes (i.e., PL(PN(node))) with the query criteria of the node, QC(node). When a node has no parent node, all patients in the clinical data repository are regarded as its query target set. For example, the first query operation in Figure 5 is executed upon node A, which has no parent node. The query execution of node A is performed based on QC(A), which is applied to all patients in the repository. After this query operation, the query result of node A, PL(A), is retrieved from the clinical data repository. The following query operation is performed based on the query criteria in QC(B) of node B and the patients in PL(A). For each patient in PL(A), when a patient satisfies all the

從臨床文字報告到復發預測模組 以肝癌患者為研究對象進行資訊擷取、資料查詢與探勘